What AI-Music Detectors Actually Detect: Scaling Up, the Round-Trip Test, and Why Two Methods Become One
DOI:
https://doi.org/10.31224/7434Keywords:
AI-music detection, Neural-decoder round-trip, Codec-specificity, Supervised classification scale, Audio forensics, Audio deepfakesAbstract
Part III (final) of a series. Part I documented a single-generator piano detector and its confounds; Part II surveyed alternative, generalisation-oriented approaches and tested simple hand-crafted decoder-artefact features. This part scales the supervised detector to many generators, implements the neural-decoder round-trip as a second layer, and arrives at a unifying conclusion about what these detectors are really doing.
We scaled the supervised approach to a multi-generator, multi-genre detector (a ResNet-18 trained on ~13,500 tracks spanning twelve AI generators and a large real-music corpus, ~210,000 spectrogram segments) and implemented the neural-decoder round-trip as a second, generator-agnostic layer. The scaled CNN reached ~96% validation accuracy on unseen tracks of known generators and was robust to bitrate, but failed in two revealing ways: it classified an unseen generator (Lyria) as "100% human," and failed on an unseen version of a generator it knew (Suno v4), even though it had trained on both v1–v3.5 and v5.5, confirming that any version absent from training degrades it. The round-trip layer, trained on a single codec (Encodec), worked perfectly on its own task (99.8% on its own reconstructions) but transferred to nothing else, flagging Suno, Udio, and even MusicGen at ~0% — a consequence of each decoder leaving a different fingerprint. A single experiment then unified the picture: a real track passed only through Encodec was flagged as AI by both the round-trip (99.8%) and the CNN (97.7%). The supervised CNN is therefore itself, in large part, a neural-decoder-artefact detector. This explains every result, both detectors succeed on generators whose codec is represented in their experience and fail identically on those with proprietary decoders, while both false-positive on real audio rendered through a familiar codec. The two layers are correlated, not complementary; they share a blind spot. We conclude that robust detection requires three things in concert: exhaustive supervised coverage, exhaustive codec coverage, and a fundamentally different, structural analysis of the music itself, and that this is realistically the domain of well-resourced, well-motivated teams.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Daniel Bordovský

This work is licensed under a Creative Commons Attribution 4.0 International License.