Production-Ready AI Inference for Healthcare with Triton, FastAPI, and Kubernetes

Gopalan, Dinesh

doi:10.5281/zenodo.16946461

Published August 26, 2025 | Version v1.0

Report Open

Production-Ready AI Inference for Healthcare with Triton, FastAPI, and Kubernetes

Gopalan, Dinesh (Researcher)¹

1. Principal member of Technical Staff at AMD

This work presents a production-ready AI inference architecture for healthcare and pharmaceutical applications, designed to address the stringent requirements of scalability, compliance, and reliability. The system integrates:

FastAPI Gateway for authentication, request validation, and routing
Optional NLP/CV Preprocessor as an independent Kubernetes microservice for PHI de-identification and multimodal data handling
Triton Inference Server for serving ONNX/TorchScript models at scale
Model Registry + CI/CD with GitHub Actions for automated deployment and model versioning
Kubernetes (k8s.yaml, hpa.yaml, preprocessor.yaml) for deployment, scaling, and orchestration
Observability with Prometheus + Grafana for monitoring latency, throughput, and failures
Security & Compliance as outlined in SECURITY.md, including TLS, OAuth2/JWT, structured audit logs, and HIPAA-aligned controls

Key features include horizontal pod autoscaling, self-healing with readiness/liveness probes, rollback and promotion strategies for safe model lifecycle management, and support for both NLP (clinical notes) and CV (medical imaging) pipelines. The architecture is illustrated in architecture.png and is validated through modular YAML and curl snippets that demonstrate deployment and inference in real environments.

This publication is intended to serve as a reference architecture for practitioners, researchers, and engineers deploying AI systems in sensitive clinical contexts. By releasing the design openly, it encourages reuse, adaptation, and further validation in production environments requiring both technical performance and regulatory compliance.

Files

architecture (1).png

Files (282.3 kB)

Name	Size	Download all
architecture (1).png md5:523dfa88b95f9313e8675c5865332cc5	106.3 kB	Preview Download
production_ready_ai_inference_healthcare_triton_fastapi_kubernetes.pdf md5:a76d8b73180e5e2e9e3d1baa04ffafc2	176.1 kB	Preview Download

Additional details

URL: https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare

Repository URL: https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare
Programming language: Python , YAML
Development Status: Active

architecture.png – AI inference system architecture diagram. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/architecture.png
hpa.yaml – Horizontal Pod Autoscaler configuration for AI inference. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/hpa.yaml
k8s.yaml – Core Kubernetes Deployment Manifest. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/k8s.yaml
preprocessor.yaml – Independent NLP/CV microservice configuration. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/preprocessor.yaml
SECURITY.md – TLS, OAuth2/JWT, and audit log policies. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/SECURITY.md

	All versions	This version
Views	86	86
Downloads	42	42
Data volume	6.8 MB	6.8 MB

architecture (1).png

Files (282.3 kB)

Identifiers

Software

References

Production-Ready AI Inference for Healthcare with Triton, FastAPI, and Kubernetes

Authors/Creators

Description

Files

architecture (1).png

Files (282.3 kB)

Additional details

Identifiers

Software

References