Cross-Platform Inference of 1.58-bit State Space Models: ARM NEON vs x86 AVX-512 vs GPU CUD: A Comparative Benchmark of BitMamba-2 with Optimized SIMD Kernels on Consumer, Server, and GPU Hardware

Gabriel Zo-Hasina Rasatavohary

doi:10.31224/6686

##article.authors##

Gabriel Zo-Hasina Rasatavohary Zonova

DOI:

https://doi.org/10.31224/6686

Keywords:

State space models, ssm, mamba, 1.58-bit, SIMD optimization, ARM NEON, AVX-512, CPU Inference, edge deployement, O(1) memory, hardware-aware optimization

Abstract

Large language model inference remains predominantly GPU-dependent, limiting deployment on edge devices and cost-sensitive
environments. We investigate whether State Space Models (SSMs) with extreme quantization can achieve practical inference speeds on CPU architectures without GPU acceleration. We benchmark BitMamba-2, a 1.58-bit ternary-quantized Mamba model, across five hardware configurations spanning three instruction set architectures: ARM NEON (Apple M1), x86 AVX-512 (Intel Xeon Silver 4210R), x86 AVX2 (Intel i9-10980HK), and GPU CUDA (NVIDIA RTX 2070 Super). Our C++ implementation features hand-written SIMD kernels for both ARM NEON and x86 AVX-512, enabling direct architectural comparison on the same model weights. On the 255M-parameter model, the Xeon AVX-512 configuration achieves 112.9 tokens/s and ARM NEON reaches 82.5 tokens/s — both exceeding the throughput of several cloud-hosted API endpoints. The 1B-parameter model runs at 46.8 tokens/s (Xeon) and 29.6 tokens/s (M1), competitive with Transformer models of equivalent weight size in 4-bit quantization. We experimentally confirm the O(1) memory property of SSM recurrence: throughput remains constant across sequence lengths from 50 to 200 tokens, in contrast to the linear KV-cache growth of Transformer architectures. We further quantify the WSL2 virtualization overhead at 10–25× relative to native execution on identical hardware. These results demonstrate that the combination of SSM recurrence and ternary quantization constitutes a
viable mathematical reformulation for GPU-free inference at interactive speeds.

Downloads

Download data is not yet available.

Cross-Platform Inference of 1.58-bit State Space Models: ARM NEON vs x86 AVX-512 vs GPU CUD

A Comparative Benchmark of BitMamba-2 with Optimized SIMD Kernels on Consumer, Server, and GPU Hardware

##article.authors##

DOI:

Keywords:

Abstract

Downloads

Downloads

Posted

License

Latest preprints