TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation
DOI:
https://doi.org/10.31224/6669Keywords:
Text-to-Video Generation, Hierarchical Reasoning, Efficient Transformers,, Spatiotemporal Modeling, Video Tokenization, Resource-Constrained LearningAbstract
Text-to-video generation is typically inaccessible to most researchers due to its reliance on large-scale models and multi-GPU infrastructure. We introduce the Text-to-Video Hier- archical Reasoning Model (TTV-HRM), a lightweight framework that enables coherent text-conditioned video synthesis on a single commodity GPU. The model employs interleaved hierarchical reasoning in which a high-level transformer captures global semantic structure while a low-level layer refines spatiotem- poral details through bidirectional cross-attention. A learned convergence predictor enables early stopping, reducing average inference iterations from three to 2.1 without quality loss. The 115M-parameter system integrates rotary positional embeddings, SwiGLU feed-forward layers, and a 3D convolutional video autoencoder, training in about four hours on a single NVIDIA T4 GPU at roughly $2 cloud cost, with sub-second inference per clip. On 8-frame 32×32 video generation, TTV-HRM improves frame-wise Fr ́echet Inception Distance from 120.5 to 62.1 across three epochs using only 45 video–text pairs. Results demonstrate semantic alignment, temporal coherence, and object persistence, showing that hierarchical reasoning can substitute for model scale to make text-to-video research more accessible.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Ahsan Umar

This work is licensed under a Creative Commons Attribution 4.0 International License.