On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design
Authors/Creators
Description
This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge.
The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments.
This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD).
The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing.
The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments.
This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD).
The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-side vector retrieval over domain-specific knowledge bases, enabling factual grounding without transmitting raw user queries.
Files
SlyOS_Edge_LLM_Research_Paper.pdf
Files
(2.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:23263a5d2ea0417f4d6dc94a09d31510
|
2.6 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/BeltoAI/sly.os.git
- Development Status
- Concept