Preprint has been published in a journal as an article
DOI of the published article https://ieeexplore.ieee.org/document/11469398#:~:text=10.1109/ICETEMS66917.2026.11469398
Preprint / Version 1

RackWeave: Hierarchical Gradient Exchange for Distributed AI

##article.authors##

DOI:

https://doi.org/10.31224/6164

Keywords:

Distributed deep learning, parameter server, gradient aggregation, rack-scale computing, network optimization

Abstract

This paper reimagines model update flow in dataparallel training as a balanced I/O service co-designed across NICs, memory hierarchies, and CPUs to overcome the communication bottlenecks that arise when accelerators outpace network bandwidth in distributed AI systems. The architecture slices model state into fine-grained chunks, drives per-core aggregation and optimization pipelines with NUMA-aware buffers, and employs zero-copy RDMA over multiple high-speed interfaces to maximize overlap between transport and computation without cross-core contention. By anchoring a gradient exchange node at the top-of-rack and composing it with hierarchical crossrack coordination, the design confines most traffic within the rack and minimizes oversubscribed core traversal during synchronization. The implementation interoperates with mainstream training stacks while restoring compute-bound behavior through communication-aware chunk mapping, streaming aggregation, and streamlined update paths. Experiments on representative vision workloads under cloud-like networks demonstrate consistent throughput and cost-efficiency gains versus sharded baselines while preserving accuracy, with scalability bounded by memory/PCIe fabric limits rather than GPU compute. Together, these mechanisms provide a practical template for rack-centric distributed AI training where gradient exchange is treated as a first-class, balanced rack resource instead of a colocated afterthought.

Downloads

Download data is not yet available.

Downloads

Posted

2026-01-05