On-Device Multi-Type Disfluency Detection with Sub-Millisecond Inference on Apple Silicon
DOI:
https://doi.org/10.31224/6814Keywords:
speech disfluency detection, stuttering, on-device inference, CoreML, Apple Neural Engine, voice stress analysis, SEP-28K, mobile speech processingAbstract
Published multi-type disfluency detection systems achieve their best results with 300M+ parameter server-class backbones, leaving speech-therapy applications without a concrete reference for the detection performance and inference latency achievable on a smartphone.
We present DisfluoSDK, a multi-type disfluency classifier running entirely on-device on Apple Silicon. On SEP-28K (20,131 clips, episode-grouped 5-fold cross-validation) a 617K-parameter CNN achieves macro-F1 0.382 (1.2 MB CoreML) and an adapted ResNet-18 achieves 0.404 (11.2M parameters, 21 MB)—occupying an otherwise unpopulated region of the accuracy–efficiency Pareto frontier where on-device deployment is feasible.
A four-way CoreML compute-unit sweep across four hardware generations (M1 Max, A19 Pro, A18, A15; 16,000+ timed trials) shows that the Neural Engine delivers sub-millisecond mean inference across all tested devices (CNN 0.225–0.635 ms), providing ample real-time headroom for speech processing. The sweep also surfaces a desktop/mobile CoreML scheduler divergence in GPU routing with a direct consequence for deployment practice. PyTorch-to-CoreML export fidelity is numerically verified on 500 test-fold spectrograms (cell-level agreement 99.96%/100.00%, ΔF1 ≤ 0.003).
As an auxiliary empirical result, voice-stress features show no practically meaningful linear association with any disfluency type across 14,645 clips (|r| < 0.05, all Cohen-negligible), supporting the architectural separation of stress and disfluency modules.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Nazar Kozak

This work is licensed under a Creative Commons Attribution 4.0 International License.