Detecting AI-Generated Piano Music with a Spectrogram CNN: A Proof of Concept, and a Study in Shortcuts
DOI:
https://doi.org/10.31224/7432Keywords:
AI-generated music, Synthetic audio detection, Convolutional neural network, Mel-spectrogram, Deep learning shortcuts, Music information retrieval, Audio forensicsAbstract
We investigated whether a convolutional neural network can distinguish AI-generated piano music from human-performed piano recordings by treating the problem as image classification over mel-spectrograms. Using a deliberately narrow scope — solo and ambient piano only — we trained a ResNet-18 on 39 AI-generated tracks (Suno v5.5) and 42 human recordings (81 tracks, 3,914 three-second segments). The initial classifier reached near-perfect validation accuracy, however, subsequent analysis suggested that this performance may have been influenced by factors unrelated to musical content. In particular, we observed a systematic loudness difference between the two classes (mean RMS 0.127 vs. 0.098) and a hard high-frequency cutoff in the generated audio at ~18 kHz. After neutralising these confounds, imposing a common 16 kHz ceiling and matching loudness, in-distribution accuracy remained high (70.6%–99.2%, final 89.0%), and single-track inference labelled held-out tracks with very high confidence (≈90% for AI tracks, ≈95% for human tracks). One residual difference survived cleaning: a pronounced spectral-tilt (brightness) gap, with human recordings carrying markedly more energy across the ~1–9 kHz band. We retain rather than remove this difference, and argue it cannot be classified as confound or genuine artefact without an out-of-distribution test. The model failed to generalise: a track from a different generator (Udio) was misclassified, and two out-of-genre tracks (punk rock, one AI and one human) were both classified as "real" with full confidence. I additionally report qualitative observations from data collection — notably the generator's frequent non-compliance with prompts and a marked homogeneity of its piano timbre, and outline the data scale and the move beyond purely spectral features that a reliable detector would require. I conclude that a small, single-genre, single-generator detector is achievable and locally confident, but brittle and partially built on data-collection artefacts.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Daniel Bordovský

This work is licensed under a Creative Commons Attribution 4.0 International License.