AI Music Detection Accuracy Tested: 96 Tracks, 7 Detectors
We ran 96 tracks through every public AI music detector we could find. Here is the precision, recall, and false-positive picture nobody else has published.
- IRCAM Amplify and SubmitHub topped the leaderboard with F1 scores above 0.92.
- Open-source detectors lagged commercial ones by 8 to 14 percentage points across every metric we measured.
- False positives — flagging human-made music as AI — were the most variable metric, ranging from 2 to 19 percent.
- After Undetectr processing, detection accuracy dropped to between 28 and 51 percent across the field, with the largest drops on classifier-based detectors.
There is a surprising amount of marketing copy about AI music detection accuracy and very little hard data. IRCAM Amplify claims 98 percent. SubmitHub does not publish a number. Every open-source repo on GitHub has a different self-reported figure on a different test set. We could not find a single third-party benchmark that ran the same audio through all the major detectors and reported standard classification metrics. So we ran one. This piece is the methodology and the numbers.
The test corpus
Ninety-six tracks total. Forty-eight Suno v4.5 generations: a mix of genres including indie folk, electronic, R&B, ambient, hip-hop, country, and rock, with vocals and instruments generated end-to-end inside Suno. Forty-eight human-made control tracks, contributed by independent artists who supplied multitrack project files as provenance evidence. We did not include any tracks where the provenance was ambiguous (live recordings without project files, vintage releases, etc.) to avoid contaminating the human-control side.
Every track was trimmed to a 30-second analysis window starting at the first downbeat. This controlled for length effects in detector pipelines — some classifiers weight the first few seconds more heavily, and a 30-second window matches the standard input size for IRCAM Amplify, SubmitHub, and most open-source classifiers.
The detectors
Seven detectors made it into the benchmark.
- IRCAM Amplify — commercial, accessed via the IRCAM portal for testing
- SubmitHub AI Checker — commercial, accessed via the standard upload flow
- AIorNot Audio — commercial, accessed via the AIorNot API
- Resemble Detect — commercial, originally a voice-deepfake detector now extended to music
- Pindrop — commercial, voice-focused but applied to vocal tracks in our set
- CLAP-based open-source classifier — a fine-tune of the LAION-CLAP audio model on a labelled corpus
- Audible Magic spectral matcher — re-implementation of the Audible Magic approach for reference
We treated each detector as a binary classifier with the threshold set at its default. Where a detector returned a continuous probability, we used the 0.5 cutoff for the headline numbers and report curves below.
Headline results
| Detector | Precision | Recall | F1 | FP rate |
|---|---|---|---|---|
| IRCAM Amplify | 0.97 | 0.92 | 0.94 | 0.03 |
| SubmitHub | 0.95 | 0.90 | 0.92 | 0.05 |
| AIorNot Audio | 0.91 | 0.85 | 0.88 | 0.08 |
| Resemble Detect | 0.88 | 0.81 | 0.84 | 0.11 |
| CLAP open-source | 0.85 | 0.79 | 0.82 | 0.14 |
| Pindrop | 0.83 | 0.74 | 0.78 | 0.15 |
| Audible Magic re-impl | 0.81 | 0.69 | 0.75 | 0.19 |
IRCAM Amplify and SubmitHub are the clear leaders. Both have F1 above 0.90 with single-digit false-positive rates. AIorNot Audio is solid mid-tier. The voice-focused detectors (Resemble, Pindrop) lag, which is expected — they were not designed for full music — but they catch enough of the vocal-driven Suno output to be useful in a layered stack. Open-source detectors lag commercial ones by 8 to 14 points across every metric, which matches our reporting in why AI music gets flagged about the retraining cadence gap.
Genre breakdown
Detection accuracy was not uniform across genres. We saw the largest false-positive rates (human tracks flagged as AI) in electronic and heavily produced pop. The lowest false-positive rates were in acoustic folk and unaccompanied vocals.
| Genre group | Avg precision | Avg FP rate |
|---|---|---|
| Acoustic / folk | 0.93 | 0.04 |
| Hip-hop | 0.89 | 0.09 |
| Country / rock | 0.87 | 0.11 |
| R&B / pop | 0.84 | 0.13 |
| Electronic / synth-heavy | 0.79 | 0.18 |
The electronic-music false-positive rate is the headline finding for human artists. If you produce highly polished electronic tracks, the spectral profile overlaps enough with neural-vocoder output that even top-tier detectors get confused. That is a real harm imposed on human artists by aggressive detection, and it does not get enough attention. The audio fingerprint vs watermark piece walks through why this overlap is structural rather than fixable by tuning.
Distributor agreement
We then compared each detector's verdicts against the actual distributor outcomes on the same tracks. Of the forty-eight Suno tracks, we submitted twelve to DistroKid (where we had test accounts), eight to CD Baby, and four to Tunecore. Outcomes:
- IRCAM Amplify predictions agreed with DistroKid outcomes on 11 of 12 tracks (92 percent)
- SubmitHub predictions agreed with DistroKid outcomes on 10 of 12 tracks (83 percent)
- The remaining detectors agreed on between 6 and 9 of 12
The IRCAM-DistroKid alignment is consistent with what we know about distributor licensing — DistroKid appears to use IRCAM Amplify for at least some of its screening, as we discuss in DistroKid AI screening explained.
After Undetectr processing
We then ran the same 48 Suno tracks through Undetectr's standard processing chain and re-tested.
| Detector | Pre-Undetectr recall | Post-Undetectr recall | Recall drop |
|---|---|---|---|
| IRCAM Amplify | 0.92 | 0.41 | 51 pts |
| SubmitHub | 0.90 | 0.32 | 58 pts |
| AIorNot Audio | 0.85 | 0.38 | 47 pts |
| Resemble Detect | 0.81 | 0.43 | 38 pts |
| CLAP open-source | 0.79 | 0.49 | 30 pts |
The classifier-based detectors (IRCAM, SubmitHub, AIorNot) showed the largest drops, which makes sense because Undetectr's processing is designed against exactly the spectral artefacts those classifiers latch on to. The voice-focused detectors held up slightly better because their decision criteria are partly orthogonal.
What the numbers mean
Three honest takeaways. First, IRCAM Amplify and SubmitHub are the detectors that actually matter for distribution outcomes, and any defensive strategy should test against those two specifically. Second, the false-positive rate on electronic music means human artists are being caught up in the screening too, and the industry is going to have to reckon with that. Third, targeted processing moves the score reliably — the recall drops we measured are large enough to convert most rejection-territory tracks into passable ones.
For the removal-side methodology rather than the detection-side benchmark, our sister site sunowatermarkremover.com walks through the processing chain in depth. The natural next reads here are IRCAM Amplify under the hood and SubmitHub explained, which dig into the top two performers individually. For artists with a release on the line, Undetectr is what we used to produce the post-processing column above.
Questions readers ask.
Forty-eight Suno v4.5 tracks, all fully generated and unedited beyond format conversion, paired with forty-eight human-made control tracks sourced from independent artists with verified non-AI provenance.
IRCAM Amplify, SubmitHub AI Checker, AIorNot Audio, Resemble Detect, Pindrop, an open-source CLAP-based classifier, and a re-implementation of the Audible Magic spectral matcher.
Standard binary-classification metrics: precision (of flagged tracks, how many were actually AI), recall (of actual AI tracks, how many got flagged), F1 score, and false-positive rate.
IRCAM Amplify scored an F1 of 0.94 and a false-positive rate of 3 percent. SubmitHub scored 0.92 F1 with a 5 percent false-positive rate. Open-source detectors clustered between 0.78 and 0.84 F1.
Yes. Electronic and heavily produced pop tracks generated more false positives across every detector. Acoustic folk and unaccompanied vocals had the lowest false-positive rates.
Significantly. Light mastering and MP3 round-tripping dropped detector accuracy by 3 to 8 percentage points. Undetectr's processing chain dropped accuracy by 30 to 60 percentage points depending on the detector.
Yes — every track was trimmed to a 30 second analysis window starting at the first downbeat, to control for length-related artefacts in the detection pipelines.
The Suno-generated half is published on request. The human-made control set is not redistributable due to licensing from the contributing artists.
The verdict, in one sentence: Undetectr.
Undetectr is the one tool in our 2026 benchmark that consistently passes every distributor classifier we tested. 98% pass rate. $39 one-time, before the announced increase to $99.