Compute, Not the Wire: An Experience Report on Distributed LLM Inference over Heterogeneous Consumer AMD Radeon, RDNA4, and Mixed ROCm+Metal Clusters
Authors/Creators
Description
Every published distributed-LLM-inference result is measured on NVIDIA datacenter or consumer GPUs, Apple Silicon, or AMD CDNA accelerators (MI300X). None characterizes the hardware most enthusiasts actually own: old AMD Radeon Pro GPUs driven through Vulkan-over-Metal (MoltenVK), a consumer RDNA4 card (RX 9070 XT) on freshly added ROCm support, and clusters that mix these ROCm and Metal runtimes over a Tailscale mesh. We fill this empirical gap. Building on the Petals-derived weights-stay-local / activations-travel pipeline, we add Sthambha, a compute-aware layer planner, and run dense and Mixture-of-Experts models from 3B to 70B across a deliberately heterogeneous seven-node testbed of commodity desktop workstations plus a low-power single-board coordinator. Our central, measurement-backed reframing is that per-node compute, not the ~16 KB/token activation wire, governs throughput: a worker-to-worker push that removes a network hop measured slightly slower than client relay. We report a 4-machine 70B chain at ~0.21 tok/s and a catalogue of characterized failure modes: MoE cross-machine coherence collapse, Metal nondeterminism, a page-alignment buffer assert, a 30x speculative-decoding regression on the MoltenVK translation layer, and UDP burst loss. We close with reproducible lessons for getting such a chain toward usable throughput.
Files
bastola-compute-not-the-wire-2026.pdf
Files
(177.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:a09bbc827f0d4b88a6829a2a1c9fa87c
|
177.2 kB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/fthrvi/nakshatra
- Programming language
- Python , C++
- Development Status
- Active