Preprint / Version 1

TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

##article.authors##

Ahsan Umar Islamia College University https://orcid.org/0009-0005-8823-2686

DOI:

Keywords:

Text-to-Video Generation, Hierarchical Reasoning, Efficient Transformers,, Spatiotemporal Modeling, Video Tokenization, Resource-Constrained Learning

Abstract

Text-to-video generation is typically inaccessible to most researchers due to its reliance on large-scale models and multi-GPU infrastructure. We introduce the Text-to-Video Hier- archical Reasoning Model (TTV-HRM), a lightweight framework that enables coherent text-conditioned video synthesis on a single commodity GPU. The model employs interleaved hierarchical reasoning in which a high-level transformer captures global semantic structure while a low-level layer refines spatiotem- poral details through bidirectional cross-attention. A learned convergence predictor enables early stopping, reducing average inference iterations from three to 2.1 without quality loss. The 115M-parameter system integrates rotary positional embeddings, SwiGLU feed-forward layers, and a 3D convolutional video autoencoder, training in about four hours on a single NVIDIA T4 GPU at roughly $2 cloud cost, with sub-second inference per clip. On 8-frame 32×32 video generation, TTV-HRM improves frame-wise Fr ́echet Inception Distance from 120.5 to 62.1 across three epochs using only 45 video–text pairs. Results demonstrate semantic alignment, temporal coherence, and object persistence, showing that hierarchical reasoning can substitute for model scale to make text-to-video research more accessible.

Downloads

Download data is not yet available.

Downloads

Posted

2026-03-23

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Downloads

Posted

License

Latest preprints