Emotion-Conditioned Chiptune Music Generation Using a Hybrid PatchTST-LSTM Model
DOI:
https://doi.org/10.31224/5562Keywords:
Music Generation, Deep Learning, LSTM, PatchTST, Symbolic Music, Transformer ArchitectureAbstract
We propose and evaluate a hybrid deep learning model that combines Patch Time Series Transformers (PatchTST) with Long Short-Term Memory (LSTM) networks for symbolic music generation conditioned on emotional states. Using the YM2413-MDB dataset of annotated chiptune music, we map emotions into Russell’s circumplex model (valence-arousal space) and assess the ability of three models—vanilla PatchTST, vanilla LSTM, and our hybrid architecture—to generate emotion-aligned music. Evaluation metrics include melodic coherence, rhythmic stability, harmonic richness, structural complexity, and a custom Emotion Alignment Score. Experimental results show that while the hybrid PatchTST-LSTM model achieved competitive performance, the vanilla LSTM slightly outperformed it in both validation loss and emotional alignment. The findings suggest that recurrent models remain highly effective for short symbolic music sequences, while Transformer-based approaches may require more complex datasets or longer compositions to demonstrate advantages. We discuss limitations of emotion encoding, evaluation methods, and dataset size, and outline directions for future research. Code is available at https://github.com/qwirty123/PatchTST-LSTM.
Downloads
Downloads
Posted
License
Copyright (c) 2025 Jing Yuan Sun, Roy Ma

This work is licensed under a Creative Commons Attribution 4.0 International License.