Published June 2, 2026 | Version v2
Preprint Open

Compute, Not the Wire: An Experience Report on Distributed LLM Inference over Heterogeneous Consumer AMD Radeon, RDNA4, and Mixed ROCm+Metal Clusters

Authors/Creators

Description

Every published distributed-LLM-inference result is measured on NVIDIA datacenter or consumer GPUs, Apple Silicon, or AMD CDNA accelerators (MI300X). None characterizes the hardware most enthusiasts actually own: old AMD Radeon Pro GPUs driven through Vulkan-over-Metal (MoltenVK), a consumer RDNA4 card (RX 9070 XT) on freshly added ROCm support, and clusters that mix these ROCm and Metal runtimes over a Tailscale mesh. We fill this empirical gap. Building on the Petals-derived weights-stay-local / activations-travel pipeline, we add Sthambha, a compute-aware layer planner, and run dense and Mixture-of-Experts models from 3B to 70B across a deliberately heterogeneous seven-node testbed of commodity desktop workstations plus a low-power single-board coordinator. Our central, measurement-backed reframing is that per-node compute, not the ~16 KB/token activation wire, governs throughput: a worker-to-worker push that removes a network hop measured slightly slower than client relay. We report a 4-machine 70B chain at ~0.21 tok/s and a catalogue of characterized failure modes: MoE cross-machine coherence collapse, Metal nondeterminism, a page-alignment buffer assert, a 30x speculative-decoding regression on the MoltenVK translation layer, and UDP burst loss. We close with reproducible lessons for getting such a chain toward usable throughput.

Files

bastola-compute-not-the-wire-2026.pdf

Files (177.2 kB)

Name Size Download all
md5:a09bbc827f0d4b88a6829a2a1c9fa87c
177.2 kB Preview Download

Additional details

Software

Repository URL
https://github.com/fthrvi/nakshatra
Programming language
Python , C++
Development Status
Active