Cross-Platform Inference of 1.58-bit State Space Models: ARM NEON vs x86 AVX-512 vs GPU CUD
A Comparative Benchmark of BitMamba-2 with Optimized SIMD Kernels on Consumer, Server, and GPU Hardware
DOI:
https://doi.org/10.31224/6686Keywords:
State space models, ssm, mamba, 1.58-bit, SIMD optimization, ARM NEON, AVX-512, CPU Inference, edge deployement, O(1) memory, hardware-aware optimizationAbstract
Large language model inference remains predominantly GPU-dependent, limiting deployment on edge devices and cost-sensitive
environments. We investigate whether State Space Models (SSMs) with extreme quantization can achieve practical inference speeds on CPU architectures without GPU acceleration. We benchmark BitMamba-2, a 1.58-bit ternary-quantized Mamba model, across five hardware configurations spanning three instruction set architectures: ARM NEON (Apple M1), x86 AVX-512 (Intel Xeon Silver 4210R), x86 AVX2 (Intel i9-10980HK), and GPU CUDA (NVIDIA RTX 2070 Super). Our C++ implementation features hand-written SIMD kernels for both ARM NEON and x86 AVX-512, enabling direct architectural comparison on the same model weights. On the 255M-parameter model, the Xeon AVX-512 configuration achieves 112.9 tokens/s and ARM NEON reaches 82.5 tokens/s — both exceeding the throughput of several cloud-hosted API endpoints. The 1B-parameter model runs at 46.8 tokens/s (Xeon) and 29.6 tokens/s (M1), competitive with Transformer models of equivalent weight size in 4-bit quantization. We experimentally confirm the O(1) memory property of SSM recurrence: throughput remains constant across sequence lengths from 50 to 200 tokens, in contrast to the linear KV-cache growth of Transformer architectures. We further quantify the WSL2 virtualization overhead at 10–25× relative to native execution on identical hardware. These results demonstrate that the combination of SSM recurrence and ternary quantization constitutes a
viable mathematical reformulation for GPU-free inference at interactive speeds.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Gabriel Zo-Hasina Rasatavohary

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.