RAG API Core Architecture

This page provides a comprehensive technical overview of the RAG API Core backend architecture, including its modular design, component interactions, configuration options, and development practices. It serves as the primary entry point for developers, architects, and technical teams working with or extending the system.

Platform Behavior

When handling a request, the platform can gather information from different sources such as search indexes, databases, or files. This retrieval step is usually the main factor that affects how long a response takes, since waiting for external data is often the slowest part of the process. The rest of the system is designed to process and return results efficiently once the data is available.

The architecture follows a modular, factory-based pattern that enables easy extension and maintenance. Key components include FastAPI for the web framework, Azure services for AI and storage, and a flexible orchestrator system for handling RAG and chat workflows. The system is designed for enterprise deployment with strong observability, security, and performance considerations.API Core Architecture

Architecture

This page provides a technical overview of the RAG API Core backend, its modular structure, and how the main components interact. It is intended for developers, architects, and technical stakeholders who want to understand the system’s design, extensibility, and deployment.

High-Level Architecture

Architecture Components

Client: External applications or users making API requests
FastAPI App: The main web application framework handling HTTP requests and responses
Router Factory: Creates and configures API routers for different features (RAG, Chat, Multimodal, etc.)
Feature Routers: Modular routers handling specific API endpoints and request routing
Orchestrator: Core business logic coordinator that manages the RAG/chat workflow
Fetchers: Components that retrieve relevant knowledge from various sources (Azure AI Search, databases, etc.)
Prompt Builder: Assembles prompts using templates and retrieved context
Azure OpenAI: External AI service for generating responses
Telemetry: Asynchronous logging and monitoring of requests and responses
Final Response: Formatted output with answer, sources, and metadata

Data Flow

Client sends request to FastAPI app
Request routed through appropriate feature router
Orchestrator coordinates the workflow:
Fetchers retrieve relevant knowledge
Prompt Builder creates the AI prompt
Azure OpenAI generates the response
Telemetry captures the interaction
Final response returned to client

Other Features

Containerization: Dockerfile and docker-compose.yml included
Config: environment variables and YAML files (see docs/configuration/environment_variables.md)
OpenAPI: auto-generated via FastAPI, available at /api/v2/openapi.json and /api/v2/docs
Health endpoints: /api/health, /api/health/live, /api/health/ready, /api/health/check (see Health & Monitoring)

Health & Monitoring

The platform provides several health endpoints for monitoring the status of all critical services and dependencies. These endpoints are designed for both automated monitoring and manual inspection:

Health Endpoints

/api/health: Basic liveness check. Returns a simple status to confirm the API is running.
/api/health/live: Liveness probe for container orchestrators (e.g., Kubernetes). Indicates if the service is up.
/api/health/ready: Readiness probe. Checks if the service is ready to accept requests (e.g., all dependencies are available).
/api/health/check: Deep health check. Performs comprehensive checks on all major dependencies (LLM, Azure Search, Storage, etc.) and returns detailed status and diagnostics.
/api/v2/health/check: Health dashboard UI (v2). Visual dashboard for real-time status, manual refresh, and diagnostics.
/api/v2/health/service-health: JSON summary used by the dashboard; accepts ?test_services=true|false.

Monitoring Features

Automated Monitoring: Health endpoints are suitable for integration with monitoring tools, alerting systems, and container orchestrators.
Manual Inspection: The health dashboard UI provides a human-friendly view for troubleshooting and diagnostics.
Detailed Diagnostics: The deep health check returns detailed information about each service, including response times, error details, and configuration status.
Role-Based Access: Health endpoints can be secured and exposed only to authorized users or systems.

Value for Developers & Operators

Quickly identify if the API or any dependency is down or misconfigured
Integrate with uptime monitoring and alerting tools
Use the dashboard for real-time troubleshooting and support
Understand which services are healthy, degraded, or failing

API Reference: Health Endpoints
API Reference: OpenAPI/Swagger
API Reference: Authentication
API Reference: Observability

Observability & Monitoring

The system provides comprehensive observability features for production deployment and debugging:

Health Dashboard

Endpoint: /api/v2/health/check
JSON: /api/v2/health/service-health (append ?test_services=true for deep mode)
Features: Real-time health status, service availability, dependency checks
Components Monitored: Database connections, Azure services, external APIs
Update Frequency: Configurable intervals with caching

Log Viewer

Endpoint: /api/v2/logs/ui (Kudu-based interface)
Features: Filterable logs by level, time range, and component
Integration: Azure App Service logging infrastructure
Security: Access controlled through Azure authentication

Telemetry System

Architecture: Asynchronous adapter pattern for event/trace export
Data Captured: Request/response metrics, performance timing, error details
Storage: Configurable backends (Azure Application Insights, custom databases)
Impact: Non-blocking implementation to avoid affecting response latency

Health Endpoints

Live Check (/api/health/live): Basic service availability
Ready Check (/api/health/ready): Full dependency verification
Deep Check (/api/health/check): Comprehensive system validation
Response Format: JSON with detailed status and diagnostic information

Security

The architecture incorporates multiple layers of security designed for enterprise deployment:

Authentication & Authorization

API Management Ready: Designed to work behind Azure API Management or similar gateways
JWT Integration: Supports JWT token validation through external proxies
Azure AD Integration: Leverages Azure Active Directory for identity management
Managed Identity: Uses Azure managed identities for service-to-service authentication

Input Validation

Pydantic v2: Comprehensive request/response validation with automatic error handling
Type Safety: Strong typing throughout the application prevents type-related vulnerabilities
Sanitization: Input sanitization for text, file uploads, and API parameters

Data Protection

Encryption: Data encrypted in transit and at rest using Azure standards
Key Management: Azure Key Vault integration for secrets and certificates
Access Control: Role-based access control for different API endpoints

Documentation Security

Public/Private Separation: Sensitive documentation accessible only to authorized users
Content Filtering: Dynamic content filtering based on user permissions
Audit Logging: All documentation access logged for compliance

Development Workflow

The system supports efficient development practices with comprehensive tooling:

Local Development

Launcher Scripts: Multiple launcher options in /launchers for different scenarios
Uvicorn Integration: Direct FastAPI server execution for development
Hot Reload: Automatic code reloading during development
Environment Management: Isolated environments with virtualenv/conda support

Testing Strategy

Unit Tests: Comprehensive pytest coverage in /tests/unit
Integration Tests: End-to-end testing in /tests/integration
Test Clients: Reusable test utilities for API validation
CI/CD Integration: Automated testing in deployment pipelines

API Documentation

OpenAPI Generation: Automatic API specification generation via FastAPI
Interactive Docs: Swagger UI at /api/v2/docs
Schema Documentation: Detailed request/response schemas
Regeneration Script: scripts/generate_openapi.py for documentation updates

Code Quality

Linting: Code quality checks and formatting standards
Type Checking: Static type analysis with mypy
Pre-commit Hooks: Automated quality checks before commits
Code Coverage: Test coverage reporting and thresholds

Directory Structure

The codebase follows a modular organization that separates concerns and enables independent development:

rag_api_core/
├── ab_testing/           # A/B testing framework and experiment management
├── config/               # Configuration loading and validation
├── configs/              # YAML configuration files for different environments
├── endpoints/            # API endpoint definitions (v1, v2 versions)
├── factory/              # Factory patterns for app and router creation
├── schemas/              # Pydantic models for request/response validation
├── services/             # Business logic and external service integrations
├── static/               # Static assets (CSS, JS, images)
├── templates/            # Jinja2 templates for HTML responses
└── utils/                # Shared utilities and cross-cutting concerns
    ├── exception_handlers.py    # Global error handling
    ├── health_checks.py         # Health monitoring utilities
    ├── id_utils.py              # ID generation and validation
    ├── index_manager_multi.py   # Search index management
    ├── keyvault.py              # Azure Key Vault integration
    ├── logging_security.py      # Secure logging utilities
    └── ...

Key Architectural Patterns

Versioned APIs: Separate endpoint directories for API versioning
Factory Pattern: Centralized creation of apps, routers, and services
Modular Services: Independent service modules for different capabilities
Configuration Management: Environment-based configuration with validation

Template Resolution Priority

The system uses a hierarchical template resolution system that provides flexibility while maintaining sensible defaults:

Resolution Order (Highest Priority First)

Inline Templates: Templates provided directly in the API request payload
File References: Templates loaded from filesystem or Azure Blob Storage based on prompts_source configuration
PromptManager Cache: Cached default templates managed by the PromptManager service
Built-in Fallbacks: Hardcoded fallback templates for guaranteed functionality

Template Types

System Templates: Instructions and context provided to the AI model
Response Templates: Formatting templates for structuring AI responses
User Templates: Custom templates provided by API consumers

Configuration

prompts:
  source: "filesystem"  # or "blob" for Azure Blob Storage
  base_path: "/app/prompts"
  cache_ttl_seconds: 300

Template Validation

Syntax Checking: Jinja2 template compilation validation
Variable Verification: Required variable presence checking
Security Scanning: Prevention of dangerous template constructs

Azure OpenAI Authentication Strategy

The system dynamically selects authentication methods for Azure OpenAI based on configuration and availability:

Authentication Decision Flow

At call time:
├── use_managed_identity: false AND api_key present?
│   └── Yes → Use API key authentication (header: api-key)
│   └── No → Use managed identity authentication
└── DefaultAzureCredential with scope: https://cognitiveservices.azure.com/.default

Authentication Methods

API Key Authentication
Header: api-key: <your-key>
Configuration: AZURE_OPENAI_API_KEY environment variable
Use Case: Development, testing, or when managed identity not available
Managed Identity Authentication
Token Acquisition: DefaultAzureCredential with Cognitive Services scope
Identity Types: System-assigned or user-assigned managed identities
Azure Resources: App Service, Container Apps, Functions, AKS

Credential Chain (DefaultAzureCredential)

Environment variables (AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID)
Managed identity
Azure CLI authentication
Azure PowerShell authentication
Interactive browser authentication (development only)

Security Considerations

Key Rotation: API keys should be rotated regularly
Least Privilege: Managed identities should have minimal required permissions
Network Security: Use private endpoints for Azure OpenAI when possible
Audit Logging: Authentication attempts are logged for security monitoring

Telemetry & Feedback

The system captures comprehensive telemetry data for monitoring, debugging, and continuous improvement:

Telemetry Triggers

Each successful API response automatically triggers _save_response_telemetry() with: - Neutral Rating: Default rating of 0 (can be updated via feedback endpoints) - Response Data: Generated answer, source documents, template information - Performance Metrics: Request timing, token counts, model information - Context: User ID, session information, A/B test variants

Data Structure

{
  "user_id": "user123",
  "request_id": "req_456",
  "timestamp": "2024-01-15T10:30:00Z",
  "model": "gpt-4",
  "tokens_used": 150,
  "response_time_ms": 2500,
  "sources_count": 3,
  "rating": 0,
  "experiment_id": "exp_789",
  "template_version": "v2.1"
}

Feedback Integration

Rating System: Users can provide feedback (1-5 stars) via dedicated endpoints
Correlation: Feedback linked to original requests via correlation IDs
Analytics: Aggregated feedback data for model and prompt improvement

Failure Handling

Non-blocking: Telemetry failures never affect API responses
Retry Logic: Failed telemetry submissions are queued for retry
Graceful Degradation: System continues operating if telemetry backend is unavailable

Privacy & Compliance

Data Minimization: Only necessary data collected for operational purposes
Retention Policies: Configurable data retention periods
Anonymization: Personally identifiable information is hashed or removed

Health & Diagnostics

The system provides comprehensive health monitoring for production reliability:

Startup Diagnostics

Deep Probe: 8-second timeout comprehensive system check during startup
Status Reporting: Concise status summary with first error details
Dependency Validation: Verifies all required services and connections
Configuration Verification: Validates all required configuration parameters

Health Endpoints

/api/health/live: Basic liveness check (service is running)
/api/health/ready: Readiness check (service can handle requests)
/api/health/check: Deep health check (all dependencies verified)

External Service Monitoring

Azure Functions Proxy: Optional health checks for external function apps
Caching: Health status caching to reduce external API calls
Timeout Handling: Configurable timeouts for external health checks

Health Response Format

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00Z",
  "version": "2.1.0",
  "checks": {
    "database": "healthy",
    "azure_openai": "healthy",
    "azure_search": "healthy"
  },
  "details": {
    "uptime_seconds": 3600,
    "memory_usage_mb": 256,
    "active_connections": 15
  }
}

Monitoring Integration

Azure Application Insights: Automatic health metric collection
Alerting: Configurable alerts for health status changes
Dashboards: Real-time health visualization in Azure portal

Error Handling Layers

The system implements comprehensive error handling across multiple layers:

Layer	Examples	HTTP Status	Strategy
Input Validation	Missing templates, invalid images, malformed requests	400 Bad Request	Explicit validation with detailed error messages
Configuration	Missing LLM config, invalid connection strings	500 Internal Server Error	Fail fast with clear configuration errors
Upstream Services	Azure OpenAI errors, search service failures	502 Bad Gateway	Map external errors to appropriate HTTP status with truncated messages
Template Rendering	Jinja2 syntax errors, missing variables	400 Bad Request	User-controllable errors with helpful guidance
Business Logic	Orchestrator failures, data processing errors	500 Internal Server Error	Generic wrapper with correlation IDs for debugging
Infrastructure	Network timeouts, resource exhaustion	503 Service Unavailable	Graceful degradation with retry mechanisms

Error Response Format

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Template validation failed",
    "details": "Missing required field: 'system_prompt'",
    "correlation_id": "req_123456789"
  }
}

Error Tracking

Correlation IDs: Unique identifiers for request tracing
Telemetry Integration: All errors logged with context and stack traces
User-Friendly Messages: Sanitized error messages for client consumption
Debug Information: Detailed errors available in development mode

13. Performance Considerations

Retrieval latency dominates; tune fetch_args (top_k, filters)
Template rendering is lightweight (Jinja instantiation per request)
Multimodal data URLs inflate payload (~33% base64 overhead)
PromptManager TTL reduces I/O for system/response templates
Telemetry fire-and-forget avoids latency impact

Extensibility Points

The modular architecture provides multiple extension points for customization and enhancement:

Adding New Fetchers

Interface: Implement BaseFetcher abstract class
Registration: Register with orchestrator factory
Data Sources: Support for databases, APIs, file systems, cloud storage
Configuration: Environment-based configuration for new fetchers

New Endpoint Groups

Router Pattern: Create module returning APIRouter instance
Feature Flags: Gate new endpoints with configuration flags
Versioning: Include in appropriate versioned router factory
Documentation: Automatic OpenAPI integration

Telemetry Customization

Adapter Pattern: Replace telemetry adapter implementing .save(payload)
Backends: Support for Application Insights, DataDog, custom databases
Data Format: Flexible payload structure for different monitoring systems

A/B Testing Framework

Experiment Manager: Implement real experiment management in factory stub
Variant Selection: Dynamic routing based on user segments or percentages
Metrics Collection: Integration with telemetry for experiment analytics

Authentication Providers

Provider Interface: Pluggable authentication modules
Token Validation: Custom JWT or OAuth2 implementations
User Context: Integration with user management and authorization systems

Custom Orchestrators

Workflow Logic: Replace or extend core orchestration logic
Prompt Engineering: Custom prompt building and optimization
Response Processing: Post-processing and formatting extensions

Adding Streaming Support (Future Enhancement)

The architecture is designed to support real-time streaming responses for enhanced user experience:

Implementation Strategy

Insertion Point: Add streaming interface after orchestrator call completion
Protocol Support: Server-Sent Events (SSE) and WebSocket compatibility
Response Format: Chunked responses with partial content and metadata

Streaming Architecture

Client Request → FastAPI → Orchestrator → Streaming Interface
                                      ↓
                               Azure OpenAI (streaming)
                                      ↓
                            Token-by-token streaming
                                      ↓
                         SSE/WebSocket response

Key Considerations

Telemetry Preservation: Complete telemetry saved after streaming finishes
Error Handling: Streaming errors handled gracefully without breaking connections
Backpressure: Client-controlled streaming rate to prevent overwhelming clients
Compatibility: Maintain existing non-streaming endpoints for backward compatibility

Benefits

Real-time Responses: Immediate display of AI-generated content
Better UX: Progressive loading for long-form content
Resource Efficiency: Reduced memory usage for large responses
Analytics: Maintain full telemetry and feedback capabilities

Glossary

Core Components

Orchestrator: The central coordinator that manages the entire RAG/chat workflow, including retrieval, prompt assembly, and AI model interaction
Fetchers: Modular components responsible for retrieving relevant knowledge from various data sources (Azure AI Search, databases, JSON files, etc.)
Prompt Builder: Service that assembles AI prompts using templates, retrieved context, and user input
Router Factory: Factory pattern implementation that creates and configures API routers for different feature sets

API Concepts

Flexible Endpoint: RAG/Chat API surface that allows template and configuration overrides for customization
Feature Routers: Modular routers handling specific API endpoint groups (RAG, Chat, Multimodal, Management)
Versioned APIs: Separate API versions (v1, v2) with independent routing and feature sets

Data & Metadata

Sources (Metadata): Retrieved document fragments and context used for grounding AI responses
Correlation ID: Unique identifier assigned to each request for tracing and debugging across system components
User Context: Information about the requesting user, including identity, permissions, and session data

Infrastructure

Managed Identity: Azure security feature allowing services to authenticate without explicit credentials
A/B Testing: Experimental framework for testing different prompts, models, or configurations
Telemetry: Asynchronous collection of operational data, performance metrics, and user interactions

Development

Factory Pattern: Design pattern used for creating complex objects (apps, routers, services) with configuration
Pydantic Models: Data validation and serialization using Python type hints
Jinja2 Templates: Templating engine for dynamic prompt and response generation

17. Changelog

2025-08-26: Initial architecture documentation.

RAG API Core Architecture

Platform Behavior

Architecture

High-Level Architecture

Architecture Components

Data Flow

Other Features

Health & Monitoring

Health Endpoints

Monitoring Features

Value for Developers & Operators

API Reference & Sidebar

Observability & Monitoring

Health Dashboard

Log Viewer

Telemetry System

Health Endpoints

Security

Authentication & Authorization

Input Validation

Data Protection

Documentation Security

Development Workflow

Local Development

Testing Strategy

API Documentation

Code Quality

Directory Structure

Key Architectural Patterns

See Also

Template Resolution Priority

Resolution Order (Highest Priority First)

Template Types

Configuration

Template Validation

Azure OpenAI Authentication Strategy

Authentication Decision Flow

Authentication Methods

Credential Chain (DefaultAzureCredential)

Security Considerations

Telemetry & Feedback

Telemetry Triggers

Data Structure

Feedback Integration

Failure Handling

Privacy & Compliance

Health & Diagnostics

Startup Diagnostics

Health Endpoints

External Service Monitoring

Health Response Format

Monitoring Integration

Error Handling Layers

Error Response Format

Error Tracking

13. Performance Considerations

Extensibility Points

Adding New Fetchers

New Endpoint Groups

Telemetry Customization

A/B Testing Framework

Authentication Providers

Custom Orchestrators

Adding Streaming Support (Future Enhancement)

Implementation Strategy

Streaming Architecture

Key Considerations

Benefits

Glossary

Core Components

API Concepts

Data & Metadata

Infrastructure

Development

17. Changelog