Large Batch vs. Small Batch Training: Generalization Tradeoffs in Deep Neural Networks
DOI:
https://doi.org/10.31224/7356Keywords:
Batch size, deep learning, generalization gap, SGD, learning-rate scaling, sharp minima, flat minima, gradient noise, implicit regularization, CIFAR-10Abstract
Batch size is a foundational hyperparameter in stochastic gradient descent (SGD) training of deep neural networks, governing both computational efficiency and model generalization. Although the adverse effect of large batch sizes on test-set performance—the generalization gap—is empirically well known, its dependence on dataset scale, model architecture, and learning-rate scheduling has not been systematically char acterized across a unified experimental framework. In this work we present a comprehensive empirical study on five datasets (three synthetic sets of 1K, 10K, and 50K samples; MNIST; and CIFAR-10) and three neural network architectures, sweeping batch sizes from 1 to 1024. Our experiments confirm that large batch SGD converges to sharp minimizers of the training loss, while small-batch SGD—by exploiting gradient noise as implicit regularization—finds substantially flatter minima that generalize better. We quantify a 6.9× sharpness ratio between batch sizes of 32 and 512, and demonstrate that gradient variance follows the theoretical Var ∝ 1/B law with R2 = 0.996. Controlled ablations validate that the linear scaling rule—scaling the learning rate proportionally with batch size—is essential: omitting it degrades test accuracy by up to 10%. Finally, we identify that the critical batch size—the threshold beyond which accuracy degrades by more than 1%—scales approximately as √N, where N is the dataset size, and we translate all findings into a practical batch size selection protocol for practitioners.
Downloads
Downloads
Posted
License
Copyright (c) 2026 Anish Kumar Pal

This work is licensed under a Creative Commons Attribution 4.0 International License.