Published April 14, 2026 | Version v1
Publication Open

UniLLMOps: A Unified Framework for End-to-End Large Language Model Production Systems — From Distributed Fine-Tuning to Hybrid Retrieval-Augmented Inference

Authors/Creators

  • 1. Independent Researcher

Description

Deploying Large Language Models in production remains a non-trivial challenge: existing tools address individual stages -- parameter-efficient fine-tuning, retrieval-augmented generation, or optimized serving -- but integrating them into a coherent, end-to-end pipeline is still an open problem. This paper presents UniLLMOps, an open-source framework that unifies three subsystems into a single production-ready stack: a Distributed Fine-Tuning Pipeline combining QLoRA with DeepSpeed ZeRO-3 and FlashAttention 2, a Hybrid RAG System fusing dense retrieval, BM25, and knowledge-graph traversal through Reciprocal Rank Fusion with CRAG self-correction, and an inference layer built on vLLM with AWQ 4-bit quantization. Experiments on Llama 3 8B yield a 23.7% gain in retrieval faithfulness, 41.2% reduction in per-GPU training memory, and 3.8x inference throughput improvement. All code and evaluation artifacts are publicly available.

Abstract (English)

  • Retrieval-Augmented Generation
  • Large Language Models
  • QLoRA
  • DeepSpeed
  • CRAG
  • MLOps
  • Hybrid Retrieval

Files

Paper_Publication.pdf

Files (372.0 kB)

Name Size Download all
md5:bbb35f2167b155fa8b48f54d0cda5646
372.0 kB Preview Download

Additional details

Software

Programming language
Python