State Space Models as CPU-Native Neural Network Architectures
Experimental Evidence from ARM NEON Inference with 1.58-bit Quantized Mamba
DOI:
https://doi.org/10.31224/6680Keywords:
State Space Models, Mamba, ARM NEON, CPU inference, Apple Silicon, 1-bit quantization, SSM-attention duality, non-GPU architectures, SIMD optimization, ternary neural networksAbstract
We present experimental evidence that State Space Models (SSMs) are structurally advantageous for neural network inference on CPU and ARM architectures. By porting BitMamba-2, a 1.58-bit quantized Mamba implementation, from x86 AVX2 to ARM NEON, we achieve 82.5 tokens/sec (255M parameters) and 29.6 tokens/sec (1B parameters) on an Apple M1 processor — the first published ARM benchmark for this model family. We experimentally validate the O(1) memory property of SSMs: generation speed remains perfectly constant across sequence lengths from 50 to 200+ tokens, in contrast to Transformer-based models whose memory grows linearly with context via the KV cache. At comparable model weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers (~30–40 tokens/sec) while offering constant memory footprint and 1.58-bit compression (vs. 4-bit for Transformers). These results support the thesis that mathematical reformulations — here, the combination of state space recurrence with ternary quantization — can make non-GPU inference structurally competitive rather than merely tolerable.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Gabriel Zo-Hasina Rasatavohary

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.