On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design

Shirokikh, Emil

doi:10.5281/zenodo.19013891

Published March 13, 2026 | Version v2

Standard Open

On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design

Shirokikh, Emil

This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge.

The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments.

This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD).

The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing.

The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-This technical report, prepared during the Stanford Ignite Program, presents a comprehensive framework and system architecture (SlyOS) for executing large language models (LLMs) directly on consumer hardware at the network edge. The rapid growth of large language models (LLMs) has produced remarkable gains in natural language understanding, generation, and reasoning. However, the computational and financial burden of cloud-hosted inference creates a fundamental bottleneck for latency-sensitive applications, bandwidth-constrained environments, and privacy-conscious deployments.

This report presents a comprehensive treatment of on-device LLM inference—the practice of executing transformer-based language models directly on consumer hardware at the network edge, eliminating the round-trip to centralized GPU clusters. We introduce a cross-platform runtime architecture that unifies inference execution across heterogeneous edge devices spanning iOS, Android, and web browsers through a single model artifact format and a shared abstraction over hardware-specific acceleration primitives (Core ML, NNAPI, WebAssembly SIMD).

The system implements aggressive 4-bit post-training quantization with activation-aware calibration, achieving model footprints between 0.26 and 3.7 gigabytes. Published quantization studies demonstrate that this level of compression preserves generation quality within 0.1-0.2 perplexity points of full-precision baselines for 7B-class models. We describe a device intelligence layer that constructs hardware capability profiles encompassing compute throughput, thermal characteristics, available memory, and GPU architecture to drive automatic model selection, batch size calibration, and execution provider routing. The report further develops a hybrid retrieval-augmented generation (RAG) pipeline that augments on-device generation with server-side vector retrieval over domain-specific knowledge bases, enabling factual grounding without transmitting raw user queries.

Files

SlyOS_Edge_LLM_Research_Paper.pdf

Files (2.6 MB)

Name	Size	Download all
SlyOS_Edge_LLM_Research_Paper.pdf md5:23263a5d2ea0417f4d6dc94a09d31510	2.6 MB	Preview Download

Additional details

Repository URL: https://github.com/BeltoAI/sly.os.git
Development Status: Concept

	All versions	This version
Views	946	42
Downloads	398	14
Data volume	1.2 GB	42.0 MB

On-Device Large Language Model Inference at the Network Edge: Architecture, Optimization, and Cross-Platform Runtime Design

Authors/Creators

Description

Files

SlyOS_Edge_LLM_Research_Paper.pdf

Files (2.6 MB)

Additional details

Software