Preprint / Version 1

Generalising AI-Music Detection: Decoder Artefacts, Self-Supervised Features, and the Limits of Hand-Crafted Cues

##article.authors##

  • Daniel Bordovský Independent Researcher

DOI:

https://doi.org/10.31224/7433

Keywords:

AI-generated music, Synthetic audio detection, Decoder artifacts, Temporal stationarity, Spectral periodicity, Audio forensics

Abstract

A follow-up to "Detecting AI-Generated Piano Music with a Spectrogram CNN: A Proof of Concept, and a Study in Shortcuts."

Our previous study showed that a spectrogram CNN can distinguish one generator's piano output from human recordings with high in-distribution confidence, but fails to generalise: it collapsed on an unseen generator and on an unseen genre, and part of its signal was traceable to data-collection confounds. This follow-up turns from that single working classifier to the broader question it raised — how can AI-music detection be made to generalise across generators? — and reports a small empirical study of the most-discussed answer: detecting the artefacts left by the neural decoders that all current generators use to render audio. We first survey the candidate approaches (supervised spectrogram classification, self-supervised foundation features, neural-decoder artefact detection, musical-structure analysis, and watermarking) with their respective strengths and limitations. We then test, on a multi-generator dataset spanning seven systems (AudioLDM, MusicGen, Mustango, Riffusion, Stable Audio, Suno, Udio) plus real music, whether the decoder artefact can be captured by a simple, training-free signal-processing feature. A naive spectral-periodicity ("comb-strength") feature failed (AUC 0.37): it measures musical harmonicity, which real instrumental music has in abundance. A temporal-stationarity feature — exploiting that decoder artefacts are frozen in frequency while music moves — recovered a real but modest signal (AUC 0.68). Crucially, it separated codec-based generators (AudioLDM, MusicGen) well but was blind to the two most prominent commercial systems, Suno and Udio, whose polished output suppresses the artefact. We conclude that the decoder-artefact route is the most principled path to generator-agnostic detection, but that the artefact in today's best generators is too subtle for hand-crafted features and requires a learned model (e.g. the autoencoder round-trip of Afchar et al.). No single approach is a silver bullet; a practical detector is an ensemble, and the field is an arms race.

Downloads

Download data is not yet available.

Author Biography

Daniel Bordovský, Independent Researcher

I am an independent researcher and instrumental music composer with over a decade of experience in the industry. Throughout my career, I have focused on creating cinematic music for commercial licensing, with my work featured in original productions distributed by major platforms including Netflix, Amazon Prime Video, Disney+, and others. My music has reached millions of streams globally and continues to attract thousands of monthly listeners on Spotify.

In recent years, the rapid advancement of generative music systems sparked a new professional focus. As AI-generated music became increasingly sophisticated, I found it personally challenging to reliably distinguish certain synthetic compositions from human-created works. This experience led me to transition much of my attention toward research, where I now dedicate the majority of my work to investigating the technical and forensic challenges posed by generative audio.

Bridging the gap between creative music production and audio forensics, my current research focuses on generative audio security, machine learning architectures, and the structural limitations of synthetic audio detection. My recent empirical work investigates the critical challenge of generalization across diverse generative music systems. Utilizing deep learning frameworks, my research evaluates the efficacy of supervised convolutional neural networks (CNNs) trained on mel-spectrogram data, analyzes the fine-grained impact of data-collection confounds, and implements neural-decoder round-trip verification to isolate codec-specific architectural artifacts.

By rigorously testing the boundaries of temporal stationarity, hand-crafted features, and self-supervised music foundation models (MERT), my work maps the vulnerabilities of signal-level forensics and advocates for comprehensive, multi-lens ensemble frameworks capable of navigating the rapidly evolving technological arms race of AI-generated music detection.

Downloads

Posted

2026-06-25