Production-Ready AI Inference for Healthcare with Triton, FastAPI, and Kubernetes
Description
This work presents a production-ready AI inference architecture for healthcare and pharmaceutical applications, designed to address the stringent requirements of scalability, compliance, and reliability. The system integrates:
-
FastAPI Gateway for authentication, request validation, and routing
-
Optional NLP/CV Preprocessor as an independent Kubernetes microservice for PHI de-identification and multimodal data handling
-
Triton Inference Server for serving ONNX/TorchScript models at scale
-
Model Registry + CI/CD with GitHub Actions for automated deployment and model versioning
-
Kubernetes (k8s.yaml, hpa.yaml, preprocessor.yaml) for deployment, scaling, and orchestration
-
Observability with Prometheus + Grafana for monitoring latency, throughput, and failures
-
Security & Compliance as outlined in
SECURITY.md, including TLS, OAuth2/JWT, structured audit logs, and HIPAA-aligned controls
Key features include horizontal pod autoscaling, self-healing with readiness/liveness probes, rollback and promotion strategies for safe model lifecycle management, and support for both NLP (clinical notes) and CV (medical imaging) pipelines. The architecture is illustrated in architecture.png and is validated through modular YAML and curl snippets that demonstrate deployment and inference in real environments.
This publication is intended to serve as a reference architecture for practitioners, researchers, and engineers deploying AI systems in sensitive clinical contexts. By releasing the design openly, it encourages reuse, adaptation, and further validation in production environments requiring both technical performance and regulatory compliance.
Files
architecture (1).png
Files
(282.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:523dfa88b95f9313e8675c5865332cc5
|
106.3 kB | Preview Download |
|
md5:a76d8b73180e5e2e9e3d1baa04ffafc2
|
176.1 kB | Preview Download |
Additional details
Software
- Repository URL
- https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare
- Programming language
- Python , YAML
- Development Status
- Active
References
- architecture.png – AI inference system architecture diagram. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/architecture.png
- hpa.yaml – Horizontal Pod Autoscaler configuration for AI inference. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/hpa.yaml
- k8s.yaml – Core Kubernetes Deployment Manifest. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/k8s.yaml
- preprocessor.yaml – Independent NLP/CV microservice configuration. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/preprocessor.yaml
- SECURITY.md – TLS, OAuth2/JWT, and audit log policies. (2025). digopala AI Inference Healthcare Repo. https://huggingface.co/spaces/digopala/ai-inference-architecture-healthcare/blob/main/SECURITY.md