Preprint / Version 1

Detecting AI-Generated Piano Music with a Spectrogram CNN: A Proof of Concept, and a Study in Shortcuts

##article.authors##

  • Daniel Bordovský Independent Researcher

DOI:

https://doi.org/10.31224/7432

Keywords:

AI-generated music, Synthetic audio detection, Convolutional neural network, Mel-spectrogram, Deep learning shortcuts, Music information retrieval, Audio forensics

Abstract

We investigated whether a convolutional neural network can distinguish AI-generated piano music from human-performed piano recordings by treating the problem as image classification over mel-spectrograms. Using a deliberately narrow scope — solo and ambient piano only — we trained a ResNet-18 on 39 AI-generated tracks (Suno v5.5) and 42 human recordings (81 tracks, 3,914 three-second segments). The initial classifier reached near-perfect validation accuracy, however, subsequent analysis suggested that this performance may have been influenced by factors unrelated to musical content. In particular, we observed a systematic loudness difference between the two classes (mean RMS 0.127 vs. 0.098) and a hard high-frequency cutoff in the generated audio at ~18 kHz. After neutralising these confounds, imposing a common 16 kHz ceiling and matching loudness, in-distribution accuracy remained high (70.6%–99.2%, final 89.0%), and single-track inference labelled held-out tracks with very high confidence (≈90% for AI tracks, ≈95% for human tracks). One residual difference survived cleaning: a pronounced spectral-tilt (brightness) gap, with human recordings carrying markedly more energy across the ~1–9 kHz band. We retain rather than remove this difference, and argue it cannot be classified as confound or genuine artefact without an out-of-distribution test. The model failed to generalise: a track from a different generator (Udio) was misclassified, and two out-of-genre tracks (punk rock, one AI and one human) were both classified as "real" with full confidence. I additionally report qualitative observations from data collection — notably the generator's frequent non-compliance with prompts and a marked homogeneity of its piano timbre, and outline the data scale and the move beyond purely spectral features that a reliable detector would require. I conclude that a small, single-genre, single-generator detector is achievable and locally confident, but brittle and partially built on data-collection artefacts.

Downloads

Download data is not yet available.

Author Biography

Daniel Bordovský, Independent Researcher

I am an independent researcher and instrumental music composer with over a decade of experience in the industry. Throughout my career, I have focused on creating cinematic music for commercial licensing, with my work featured in original productions distributed by major platforms including Netflix, Amazon Prime Video, Disney+, and others. My music has reached millions of streams globally and continues to attract thousands of monthly listeners on Spotify.

In recent years, the rapid advancement of generative music systems sparked a new professional focus. As AI-generated music became increasingly sophisticated, I found it personally challenging to reliably distinguish certain synthetic compositions from human-created works. This experience led me to transition much of my attention toward research, where I now dedicate the majority of my work to investigating the technical and forensic challenges posed by generative audio.

Bridging the gap between creative music production and audio forensics, my current research focuses on generative audio security, machine learning architectures, and the structural limitations of synthetic audio detection. My recent empirical work investigates the critical challenge of generalization across diverse generative music systems. Utilizing deep learning frameworks, my research evaluates the efficacy of supervised convolutional neural networks (CNNs) trained on mel-spectrogram data, analyzes the fine-grained impact of data-collection confounds, and implements neural-decoder round-trip verification to isolate codec-specific architectural artifacts.

By rigorously testing the boundaries of temporal stationarity, hand-crafted features, and self-supervised music foundation models (MERT), my work maps the vulnerabilities of signal-level forensics and advocates for comprehensive, multi-lens ensemble frameworks capable of navigating the rapidly evolving technological arms race of AI-generated music detection.

Downloads

Posted

2026-06-25