Preprint / Version 1

What AI-Music Detectors Actually Detect: Scaling Up, the Round-Trip Test, and Why Two Methods Become One

##article.authors##

  • Daniel Bordovský Independent Researcher

DOI:

https://doi.org/10.31224/7434

Keywords:

AI-music detection, Neural-decoder round-trip, Codec-specificity, Supervised classification scale, Audio forensics, Audio deepfakes

Abstract

Part III (final) of a series. Part I documented a single-generator piano detector and its confounds; Part II surveyed alternative, generalisation-oriented approaches and tested simple hand-crafted decoder-artefact features. This part scales the supervised detector to many generators, implements the neural-decoder round-trip as a second layer, and arrives at a unifying conclusion about what these detectors are really doing.

We scaled the supervised approach to a multi-generator, multi-genre detector (a ResNet-18 trained on ~13,500 tracks spanning twelve AI generators and a large real-music corpus, ~210,000 spectrogram segments) and implemented the neural-decoder round-trip as a second, generator-agnostic layer. The scaled CNN reached ~96% validation accuracy on unseen tracks of known generators and was robust to bitrate, but failed in two revealing ways: it classified an unseen generator (Lyria) as "100% human," and failed on an unseen version of a generator it knew (Suno v4), even though it had trained on both v1–v3.5 and v5.5, confirming that any version absent from training degrades it. The round-trip layer, trained on a single codec (Encodec), worked perfectly on its own task (99.8% on its own reconstructions) but transferred to nothing else, flagging Suno, Udio, and even MusicGen at ~0% — a consequence of each decoder leaving a different fingerprint. A single experiment then unified the picture: a real track passed only through Encodec was flagged as AI by both the round-trip (99.8%) and the CNN (97.7%). The supervised CNN is therefore itself, in large part, a neural-decoder-artefact detector. This explains every result, both detectors succeed on generators whose codec is represented in their experience and fail identically on those with proprietary decoders, while both false-positive on real audio rendered through a familiar codec. The two layers are correlated, not complementary; they share a blind spot. We conclude that robust detection requires three things in concert: exhaustive supervised coverage, exhaustive codec coverage, and a fundamentally different, structural analysis of the music itself, and that this is realistically the domain of well-resourced, well-motivated teams.

Downloads

Download data is not yet available.

Author Biography

Daniel Bordovský, Independent Researcher

I am an independent researcher and instrumental music composer with over a decade of experience in the industry. Throughout my career, I have focused on creating cinematic music for commercial licensing, with my work featured in original productions distributed by major platforms including Netflix, Amazon Prime Video, Disney+, and others. My music has reached millions of streams globally and continues to attract thousands of monthly listeners on Spotify.

In recent years, the rapid advancement of generative music systems sparked a new professional focus. As AI-generated music became increasingly sophisticated, I found it personally challenging to reliably distinguish certain synthetic compositions from human-created works. This experience led me to transition much of my attention toward research, where I now dedicate the majority of my work to investigating the technical and forensic challenges posed by generative audio.

Bridging the gap between creative music production and audio forensics, my current research focuses on generative audio security, machine learning architectures, and the structural limitations of synthetic audio detection. My recent empirical work investigates the critical challenge of generalization across diverse generative music systems. Utilizing deep learning frameworks, my research evaluates the efficacy of supervised convolutional neural networks (CNNs) trained on mel-spectrogram data, analyzes the fine-grained impact of data-collection confounds, and implements neural-decoder round-trip verification to isolate codec-specific architectural artifacts.

By rigorously testing the boundaries of temporal stationarity, hand-crafted features, and self-supervised music foundation models (MERT), my work maps the vulnerabilities of signal-level forensics and advocates for comprehensive, multi-lens ensemble frameworks capable of navigating the rapidly evolving technological arms race of AI-generated music detection.

Downloads

Posted

2026-06-25