RAG API Core Architecture
This page provides a comprehensive technical overview of the RAG API Core backend architecture, including its modular design, component interactions, configuration options, and development practices. It serves as the primary entry point for developers, architects, and technical teams working with or extending the system.
Platform Behavior
When handling a request, the platform can gather information from different sources such as search indexes, databases, or files. This retrieval step is usually the main factor that affects how long a response takes, since waiting for external data is often the slowest part of the process. The rest of the system is designed to process and return results efficiently once the data is available.
The architecture follows a modular, factory-based pattern that enables easy extension and maintenance. Key components include FastAPI for the web framework, Azure services for AI and storage, and a flexible orchestrator system for handling RAG and chat workflows. The system is designed for enterprise deployment with strong observability, security, and performance considerations.API Core Architecture
Architecture
This page provides a technical overview of the RAG API Core backend, its modular structure, and how the main components interact. It is intended for developers, architects, and technical stakeholders who want to understand the system’s design, extensibility, and deployment.
High-Level Architecture
Architecture Components
- Client: External applications or users making API requests
- FastAPI App: The main web application framework handling HTTP requests and responses
- Router Factory: Creates and configures API routers for different features (RAG, Chat, Multimodal, etc.)
- Feature Routers: Modular routers handling specific API endpoints and request routing
- Orchestrator: Core business logic coordinator that manages the RAG/chat workflow
- Fetchers: Components that retrieve relevant knowledge from various sources (Azure AI Search, databases, etc.)
- Prompt Builder: Assembles prompts using templates and retrieved context
- Azure OpenAI: External AI service for generating responses
- Telemetry: Asynchronous logging and monitoring of requests and responses
- Final Response: Formatted output with answer, sources, and metadata
Data Flow
- Client sends request to FastAPI app
- Request routed through appropriate feature router
- Orchestrator coordinates the workflow:
- Fetchers retrieve relevant knowledge
- Prompt Builder creates the AI prompt
- Azure OpenAI generates the response
- Telemetry captures the interaction
- Final response returned to client
Other Features
- Containerization:
Dockerfileanddocker-compose.ymlincluded - Config: environment variables and YAML files (see
docs/configuration/environment_variables.md) - OpenAPI: auto-generated via FastAPI, available at
/api/v2/openapi.jsonand/api/v2/docs - Health endpoints:
/api/health,/api/health/live,/api/health/ready,/api/health/check(see Health & Monitoring)
Health & Monitoring
The platform provides several health endpoints for monitoring the status of all critical services and dependencies. These endpoints are designed for both automated monitoring and manual inspection:
Health Endpoints
/api/health: Basic liveness check. Returns a simple status to confirm the API is running./api/health/live: Liveness probe for container orchestrators (e.g., Kubernetes). Indicates if the service is up./api/health/ready: Readiness probe. Checks if the service is ready to accept requests (e.g., all dependencies are available)./api/health/check: Deep health check. Performs comprehensive checks on all major dependencies (LLM, Azure Search, Storage, etc.) and returns detailed status and diagnostics./api/v2/health/check: Health dashboard UI (v2). Visual dashboard for real-time status, manual refresh, and diagnostics./api/v2/health/service-health: JSON summary used by the dashboard; accepts?test_services=true|false.
Monitoring Features
- Automated Monitoring: Health endpoints are suitable for integration with monitoring tools, alerting systems, and container orchestrators.
- Manual Inspection: The health dashboard UI provides a human-friendly view for troubleshooting and diagnostics.
- Detailed Diagnostics: The deep health check returns detailed information about each service, including response times, error details, and configuration status.
- Role-Based Access: Health endpoints can be secured and exposed only to authorized users or systems.
Value for Developers & Operators
- Quickly identify if the API or any dependency is down or misconfigured
- Integrate with uptime monitoring and alerting tools
- Use the dashboard for real-time troubleshooting and support
- Understand which services are healthy, degraded, or failing
API Reference & Sidebar
- API Reference: Health Endpoints
- API Reference: OpenAPI/Swagger
- API Reference: Authentication
- API Reference: Observability
Observability & Monitoring
The system provides comprehensive observability features for production deployment and debugging:
Health Dashboard
- Endpoint:
/api/v2/health/check - JSON:
/api/v2/health/service-health(append?test_services=truefor deep mode) - Features: Real-time health status, service availability, dependency checks
- Components Monitored: Database connections, Azure services, external APIs
- Update Frequency: Configurable intervals with caching
Log Viewer
- Endpoint:
/api/v2/logs/ui(Kudu-based interface) - Features: Filterable logs by level, time range, and component
- Integration: Azure App Service logging infrastructure
- Security: Access controlled through Azure authentication
Telemetry System
- Architecture: Asynchronous adapter pattern for event/trace export
- Data Captured: Request/response metrics, performance timing, error details
- Storage: Configurable backends (Azure Application Insights, custom databases)
- Impact: Non-blocking implementation to avoid affecting response latency
Health Endpoints
- Live Check (
/api/health/live): Basic service availability - Ready Check (
/api/health/ready): Full dependency verification - Deep Check (
/api/health/check): Comprehensive system validation - Response Format: JSON with detailed status and diagnostic information
Security
The architecture incorporates multiple layers of security designed for enterprise deployment:
Authentication & Authorization
- API Management Ready: Designed to work behind Azure API Management or similar gateways
- JWT Integration: Supports JWT token validation through external proxies
- Azure AD Integration: Leverages Azure Active Directory for identity management
- Managed Identity: Uses Azure managed identities for service-to-service authentication
Input Validation
- Pydantic v2: Comprehensive request/response validation with automatic error handling
- Type Safety: Strong typing throughout the application prevents type-related vulnerabilities
- Sanitization: Input sanitization for text, file uploads, and API parameters
Data Protection
- Encryption: Data encrypted in transit and at rest using Azure standards
- Key Management: Azure Key Vault integration for secrets and certificates
- Access Control: Role-based access control for different API endpoints
Documentation Security
- Public/Private Separation: Sensitive documentation accessible only to authorized users
- Content Filtering: Dynamic content filtering based on user permissions
- Audit Logging: All documentation access logged for compliance
Development Workflow
The system supports efficient development practices with comprehensive tooling:
Local Development
- Launcher Scripts: Multiple launcher options in
/launchersfor different scenarios - Uvicorn Integration: Direct FastAPI server execution for development
- Hot Reload: Automatic code reloading during development
- Environment Management: Isolated environments with virtualenv/conda support
Testing Strategy
- Unit Tests: Comprehensive pytest coverage in
/tests/unit - Integration Tests: End-to-end testing in
/tests/integration - Test Clients: Reusable test utilities for API validation
- CI/CD Integration: Automated testing in deployment pipelines
API Documentation
- OpenAPI Generation: Automatic API specification generation via FastAPI
- Interactive Docs: Swagger UI at
/api/v2/docs - Schema Documentation: Detailed request/response schemas
- Regeneration Script:
scripts/generate_openapi.pyfor documentation updates
Code Quality
- Linting: Code quality checks and formatting standards
- Type Checking: Static type analysis with mypy
- Pre-commit Hooks: Automated quality checks before commits
- Code Coverage: Test coverage reporting and thresholds
Directory Structure
The codebase follows a modular organization that separates concerns and enables independent development:
rag_api_core/
├── ab_testing/ # A/B testing framework and experiment management
├── config/ # Configuration loading and validation
├── configs/ # YAML configuration files for different environments
├── endpoints/ # API endpoint definitions (v1, v2 versions)
├── factory/ # Factory patterns for app and router creation
├── schemas/ # Pydantic models for request/response validation
├── services/ # Business logic and external service integrations
├── static/ # Static assets (CSS, JS, images)
├── templates/ # Jinja2 templates for HTML responses
└── utils/ # Shared utilities and cross-cutting concerns
├── exception_handlers.py # Global error handling
├── health_checks.py # Health monitoring utilities
├── id_utils.py # ID generation and validation
├── index_manager_multi.py # Search index management
├── keyvault.py # Azure Key Vault integration
├── logging_security.py # Secure logging utilities
└── ...
Key Architectural Patterns
- Versioned APIs: Separate endpoint directories for API versioning
- Factory Pattern: Centralized creation of apps, routers, and services
- Modular Services: Independent service modules for different capabilities
- Configuration Management: Environment-based configuration with validation
See Also
Template Resolution Priority
The system uses a hierarchical template resolution system that provides flexibility while maintaining sensible defaults:
Resolution Order (Highest Priority First)
- Inline Templates: Templates provided directly in the API request payload
- File References: Templates loaded from filesystem or Azure Blob Storage based on
prompts_sourceconfiguration - PromptManager Cache: Cached default templates managed by the PromptManager service
- Built-in Fallbacks: Hardcoded fallback templates for guaranteed functionality
Template Types
- System Templates: Instructions and context provided to the AI model
- Response Templates: Formatting templates for structuring AI responses
- User Templates: Custom templates provided by API consumers
Configuration
prompts:
source: "filesystem" # or "blob" for Azure Blob Storage
base_path: "/app/prompts"
cache_ttl_seconds: 300
Template Validation
- Syntax Checking: Jinja2 template compilation validation
- Variable Verification: Required variable presence checking
- Security Scanning: Prevention of dangerous template constructs
Azure OpenAI Authentication Strategy
The system dynamically selects authentication methods for Azure OpenAI based on configuration and availability:
Authentication Decision Flow
At call time:
├── use_managed_identity: false AND api_key present?
│ └── Yes → Use API key authentication (header: api-key)
│ └── No → Use managed identity authentication
└── DefaultAzureCredential with scope: https://cognitiveservices.azure.com/.default
Authentication Methods
- API Key Authentication
- Header:
api-key: <your-key> - Configuration:
AZURE_OPENAI_API_KEYenvironment variable -
Use Case: Development, testing, or when managed identity not available
-
Managed Identity Authentication
- Token Acquisition:
DefaultAzureCredentialwith Cognitive Services scope - Identity Types: System-assigned or user-assigned managed identities
- Azure Resources: App Service, Container Apps, Functions, AKS
Credential Chain (DefaultAzureCredential)
- Environment variables (
AZURE_CLIENT_ID,AZURE_CLIENT_SECRET,AZURE_TENANT_ID) - Managed identity
- Azure CLI authentication
- Azure PowerShell authentication
- Interactive browser authentication (development only)
Security Considerations
- Key Rotation: API keys should be rotated regularly
- Least Privilege: Managed identities should have minimal required permissions
- Network Security: Use private endpoints for Azure OpenAI when possible
- Audit Logging: Authentication attempts are logged for security monitoring
Telemetry & Feedback
The system captures comprehensive telemetry data for monitoring, debugging, and continuous improvement:
Telemetry Triggers
Each successful API response automatically triggers _save_response_telemetry() with:
- Neutral Rating: Default rating of 0 (can be updated via feedback endpoints)
- Response Data: Generated answer, source documents, template information
- Performance Metrics: Request timing, token counts, model information
- Context: User ID, session information, A/B test variants
Data Structure
{
"user_id": "user123",
"request_id": "req_456",
"timestamp": "2024-01-15T10:30:00Z",
"model": "gpt-4",
"tokens_used": 150,
"response_time_ms": 2500,
"sources_count": 3,
"rating": 0,
"experiment_id": "exp_789",
"template_version": "v2.1"
}
Feedback Integration
- Rating System: Users can provide feedback (1-5 stars) via dedicated endpoints
- Correlation: Feedback linked to original requests via correlation IDs
- Analytics: Aggregated feedback data for model and prompt improvement
Failure Handling
- Non-blocking: Telemetry failures never affect API responses
- Retry Logic: Failed telemetry submissions are queued for retry
- Graceful Degradation: System continues operating if telemetry backend is unavailable
Privacy & Compliance
- Data Minimization: Only necessary data collected for operational purposes
- Retention Policies: Configurable data retention periods
- Anonymization: Personally identifiable information is hashed or removed
Health & Diagnostics
The system provides comprehensive health monitoring for production reliability:
Startup Diagnostics
- Deep Probe: 8-second timeout comprehensive system check during startup
- Status Reporting: Concise status summary with first error details
- Dependency Validation: Verifies all required services and connections
- Configuration Verification: Validates all required configuration parameters
Health Endpoints
/api/health/live: Basic liveness check (service is running)/api/health/ready: Readiness check (service can handle requests)/api/health/check: Deep health check (all dependencies verified)
External Service Monitoring
- Azure Functions Proxy: Optional health checks for external function apps
- Caching: Health status caching to reduce external API calls
- Timeout Handling: Configurable timeouts for external health checks
Health Response Format
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"version": "2.1.0",
"checks": {
"database": "healthy",
"azure_openai": "healthy",
"azure_search": "healthy"
},
"details": {
"uptime_seconds": 3600,
"memory_usage_mb": 256,
"active_connections": 15
}
}
Monitoring Integration
- Azure Application Insights: Automatic health metric collection
- Alerting: Configurable alerts for health status changes
- Dashboards: Real-time health visualization in Azure portal
Error Handling Layers
The system implements comprehensive error handling across multiple layers:
| Layer | Examples | HTTP Status | Strategy |
|---|---|---|---|
| Input Validation | Missing templates, invalid images, malformed requests | 400 Bad Request | Explicit validation with detailed error messages |
| Configuration | Missing LLM config, invalid connection strings | 500 Internal Server Error | Fail fast with clear configuration errors |
| Upstream Services | Azure OpenAI errors, search service failures | 502 Bad Gateway | Map external errors to appropriate HTTP status with truncated messages |
| Template Rendering | Jinja2 syntax errors, missing variables | 400 Bad Request | User-controllable errors with helpful guidance |
| Business Logic | Orchestrator failures, data processing errors | 500 Internal Server Error | Generic wrapper with correlation IDs for debugging |
| Infrastructure | Network timeouts, resource exhaustion | 503 Service Unavailable | Graceful degradation with retry mechanisms |
Error Response Format
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Template validation failed",
"details": "Missing required field: 'system_prompt'",
"correlation_id": "req_123456789"
}
}
Error Tracking
- Correlation IDs: Unique identifiers for request tracing
- Telemetry Integration: All errors logged with context and stack traces
- User-Friendly Messages: Sanitized error messages for client consumption
- Debug Information: Detailed errors available in development mode
13. Performance Considerations
- Retrieval latency dominates; tune
fetch_args(top_k, filters) - Template rendering is lightweight (Jinja instantiation per request)
- Multimodal data URLs inflate payload (~33% base64 overhead)
- PromptManager TTL reduces I/O for system/response templates
- Telemetry fire-and-forget avoids latency impact
Extensibility Points
The modular architecture provides multiple extension points for customization and enhancement:
Adding New Fetchers
- Interface: Implement
BaseFetcherabstract class - Registration: Register with orchestrator factory
- Data Sources: Support for databases, APIs, file systems, cloud storage
- Configuration: Environment-based configuration for new fetchers
New Endpoint Groups
- Router Pattern: Create module returning
APIRouterinstance - Feature Flags: Gate new endpoints with configuration flags
- Versioning: Include in appropriate versioned router factory
- Documentation: Automatic OpenAPI integration
Telemetry Customization
- Adapter Pattern: Replace telemetry adapter implementing
.save(payload) - Backends: Support for Application Insights, DataDog, custom databases
- Data Format: Flexible payload structure for different monitoring systems
A/B Testing Framework
- Experiment Manager: Implement real experiment management in factory stub
- Variant Selection: Dynamic routing based on user segments or percentages
- Metrics Collection: Integration with telemetry for experiment analytics
Authentication Providers
- Provider Interface: Pluggable authentication modules
- Token Validation: Custom JWT or OAuth2 implementations
- User Context: Integration with user management and authorization systems
Custom Orchestrators
- Workflow Logic: Replace or extend core orchestration logic
- Prompt Engineering: Custom prompt building and optimization
- Response Processing: Post-processing and formatting extensions
Adding Streaming Support (Future Enhancement)
The architecture is designed to support real-time streaming responses for enhanced user experience:
Implementation Strategy
- Insertion Point: Add streaming interface after orchestrator call completion
- Protocol Support: Server-Sent Events (SSE) and WebSocket compatibility
- Response Format: Chunked responses with partial content and metadata
Streaming Architecture
Client Request → FastAPI → Orchestrator → Streaming Interface
↓
Azure OpenAI (streaming)
↓
Token-by-token streaming
↓
SSE/WebSocket response
Key Considerations
- Telemetry Preservation: Complete telemetry saved after streaming finishes
- Error Handling: Streaming errors handled gracefully without breaking connections
- Backpressure: Client-controlled streaming rate to prevent overwhelming clients
- Compatibility: Maintain existing non-streaming endpoints for backward compatibility
Benefits
- Real-time Responses: Immediate display of AI-generated content
- Better UX: Progressive loading for long-form content
- Resource Efficiency: Reduced memory usage for large responses
- Analytics: Maintain full telemetry and feedback capabilities
Glossary
Core Components
- Orchestrator: The central coordinator that manages the entire RAG/chat workflow, including retrieval, prompt assembly, and AI model interaction
- Fetchers: Modular components responsible for retrieving relevant knowledge from various data sources (Azure AI Search, databases, JSON files, etc.)
- Prompt Builder: Service that assembles AI prompts using templates, retrieved context, and user input
- Router Factory: Factory pattern implementation that creates and configures API routers for different feature sets
API Concepts
- Flexible Endpoint: RAG/Chat API surface that allows template and configuration overrides for customization
- Feature Routers: Modular routers handling specific API endpoint groups (RAG, Chat, Multimodal, Management)
- Versioned APIs: Separate API versions (v1, v2) with independent routing and feature sets
Data & Metadata
- Sources (Metadata): Retrieved document fragments and context used for grounding AI responses
- Correlation ID: Unique identifier assigned to each request for tracing and debugging across system components
- User Context: Information about the requesting user, including identity, permissions, and session data
Infrastructure
- Managed Identity: Azure security feature allowing services to authenticate without explicit credentials
- A/B Testing: Experimental framework for testing different prompts, models, or configurations
- Telemetry: Asynchronous collection of operational data, performance metrics, and user interactions
Development
- Factory Pattern: Design pattern used for creating complex objects (apps, routers, services) with configuration
- Pydantic Models: Data validation and serialization using Python type hints
- Jinja2 Templates: Templating engine for dynamic prompt and response generation
17. Changelog
- 2025-08-26: Initial architecture documentation.