State Space Models as CPU-Native Neural Network Architectures: Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba

Gabriel Zo-Hasina Rasatavohary

doi:10.31224/6680

##article.authors##

Gabriel Zo-Hasina Rasatavohary Zonova

DOI:

https://doi.org/10.31224/6680

Keywords:

State Space Models, Mamba, ARM NEON, CPU inference, Apple Silicon, 1-bit quantization, SSM-attention duality, non-GPU architectures, SIMD optimization, ternary neural networks

Abstract

We present experimental evidence that State Space Models (SSMs) are structurally advantageous for neural network inference on CPU and ARM architectures. By porting BitMamba-2, a 1.58-bit quantized Mamba implementation, from x86 AVX2 to ARM NEON, we achieve 82.5 tokens/sec (255M parameters) and 29.6 tokens/sec (1B parameters) on an Apple M1 processor — the first published ARM benchmark for this model family. We experimentally validate the O(1) memory property of SSMs: generation speed remains perfectly constant across sequence lengths from 50 to 200+ tokens, in contrast to Transformer-based models whose memory grows linearly with context via the KV cache. At comparable model weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers (~30–40 tokens/sec) while offering constant memory footprint and 1.58-bit compression (vs. 4-bit for Transformers). These results support the thesis that mathematical reformulations — here, the combination of state space recurrence with ternary quantization — can make non-GPU inference structurally competitive rather than merely tolerable.

Downloads

Download data is not yet available.

State Space Models as CPU-Native Neural Network Architectures

Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Downloads

Posted

License

Latest preprints