Published October 20, 2025 | Version v4
Preprint Open

Topic-Aware Inference Boost: A Fast Microservice Architecture for Reducing Large Language Model Hallucinations

  • 1. Founder - Gigi Sehgal LLC

Description

Large language models (LLMs) often hallucinate—producing plausible but inaccurate responses—particularly when misjudging their own confidence [arXiv:2401.01313]. 

This paper introduces Topic-Aware Inference Boost, a modular microservice architecture designed to mitigate hallucinations through rapid, topic-specific inference augmentation. The system delivers just-in-time expert-level responses from curated subject-matter-expert (SME) models through a lightweight API, without requiring retraining or prompt engineering. The prototype demonstrates end-to-end latency of 1 to 7 seconds on standard CPUs with over 90 % inference quality for multiple domain tasks. By decoupling topic specialization from monolithic LLMs, this solution enables any client model to enhance its reliability through targeted grounding. Phase 2 will extend the framework to allow models to self-evaluate confidence and selectively invoke this solution for low-confidence inferences, maintaining real-time performance and high accuracy.

 

Note To Readers This document, formerly titled "InferBoost," has been renamed to Topic-Aware Inference Boost to improve technical clarity and to disambiguate the research from external websites currently utilizing the "InferBoost" term. The underlying architecture, topic-identification methodology, and performance metrics remain unchanged.

 

Files

TopicAwareInferenceBoost.pdf

Files (64.1 kB)

Name Size Download all
md5:d07d58b75433cfa1133c05e2c73e16e6
64.1 kB Preview Download

Additional details

Related works

Is new version of
Preprint: 10.5281/zenodo.17429009 (DOI)

Dates

Updated
2025-10-23
Updated title only for technical clarity and disambiguation