The Real Limits of Distributed LLM Training
Authors/Creators
Description
We analyze a federated, peer-to-peer LLM training architecture that uses delta compression,
BitTorrent-style chunked model distribution, and hierarchical merging to coordinate training
across thousands of consumer GPUs. The architecture is internally coherent and contains several
non-trivial engineering decisions worth documenting; it is also, for the intended use case of
training frontier-scale language models, the wrong shape of the problem. We characterize seven
concrete failure modes – bandwidth, straggler effect, FedAvg convergence under non-IID data,
the consumer-VRAM ceiling, total cost of training, the security envelope of the delta-validation
rules, and data provenance – each paired with a reproducible Python script. The conclusion is
that for frontier-scale models the centralized cluster is faster, cheaper, and safer by enough that
distributed federated training is economically and mathematically dominated. We close with a
short list of regimes where federated training remains the right tool.
Files
whitepaper.pdf
Files
(193.5 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:80436ae52dfcb5377b0d50a40cd016dd
|
193.5 kB | Preview Download |