UniLLMOps: A Unified Framework for End-to-End Large Language Model Production Systems — From Distributed Fine-Tuning to Hybrid Retrieval-Augmented Inference

Tammineni, Tanay

doi:10.5281/zenodo.19582347

Published April 14, 2026 | Version v1

Publication Open

UniLLMOps: A Unified Framework for End-to-End Large Language Model Production Systems — From Distributed Fine-Tuning to Hybrid Retrieval-Augmented Inference

Tammineni, Tanay¹

1. Independent Researcher

Deploying Large Language Models in production remains a non-trivial challenge: existing tools address individual stages -- parameter-efficient fine-tuning, retrieval-augmented generation, or optimized serving -- but integrating them into a coherent, end-to-end pipeline is still an open problem. This paper presents UniLLMOps, an open-source framework that unifies three subsystems into a single production-ready stack: a Distributed Fine-Tuning Pipeline combining QLoRA with DeepSpeed ZeRO-3 and FlashAttention 2, a Hybrid RAG System fusing dense retrieval, BM25, and knowledge-graph traversal through Reciprocal Rank Fusion with CRAG self-correction, and an inference layer built on vLLM with AWQ 4-bit quantization. Experiments on Llama 3 8B yield a 23.7% gain in retrieval faithfulness, 41.2% reduction in per-GPU training memory, and 3.8x inference throughput improvement. All code and evaluation artifacts are publicly available.

Abstract (English)

Retrieval-Augmented Generation
Large Language Models
QLoRA
DeepSpeed
CRAG
MLOps
Hybrid Retrieval

Files

Paper_Publication.pdf

Files (372.0 kB)

Name	Size	Download all
Paper_Publication.pdf md5:bbb35f2167b155fa8b48f54d0cda5646	372.0 kB	Preview Download

Additional details

Is supplemented by: Preprint: https://github.com/TammineniTanay/distributed-finetune-pipeline (URL); Preprint: https://github.com/TammineniTanay/hybrid-rag-system (URL)

Programming language: Python

	All versions	This version
Views	11	11
Downloads	9	9
Data volume	3.3 MB	3.3 MB

Paper_Publication.pdf

Files (372.0 kB)

Related works

Software

UniLLMOps: A Unified Framework for End-to-End Large Language Model Production Systems — From Distributed Fine-Tuning to Hybrid Retrieval-Augmented Inference

Authors/Creators

Description

Abstract (English)

Files

Paper_Publication.pdf

Files (372.0 kB)

Additional details

Related works

Software