RAG Service
← Back to Documentation Center

Release Tracker

This page tracks recent releases and changelog for the API.

Version History

  • 2025-10-02 (v2.4.0):

    • Fix: Resolved critical LiteLLM integration issues preventing all 17 non-Azure model providers from initializing. Root cause was UnboundLocalError in verbose mode setup function attempting to modify global _litellm_verbose_enabled variable without proper scope declaration, combined with deprecated API usage calling litellm.set_verbose() as function instead of property assignment.
    • Solution: Added global _litellm_verbose_enabled declaration to nested function scope; updated verbose activation to use litellm.set_verbose = True property assignment; enhanced exception logging to capture import failures with full stack traces.
    • Provider Support: All 17 configured providers now validate successfully on startup (1 Azure + 1 OpenAI + 15 OpenRouter models), enabling full multi-provider routing capabilities including OpenAI, Anthropic Claude, Google Gemini, and Meta Llama via OpenRouter.
    • Model Corrections: Fixed 4 incorrect OpenRouter model IDs preventing Claude variants from working: anthropic/claude-4-sonnetanthropic/claude-sonnet-4, anthropic/claude-4.1-opusanthropic/claude-opus-4.1, anthropic/claude-4-opusanthropic/claude-opus-4, anthropic/claude-3-sonnetanthropic/claude-3.5-sonnet (upgraded from deprecated 3.0). Model names validated against live OpenRouter API.
    • Environment Variables: Added support for both LITELLM_VERBOSE (legacy) and LITELLM_LOG=DEBUG (recommended) for debug logging, suppressing deprecation warnings when using new format.
    • UI Enhancement: Changed default temperature in testing page from 0.7 to 0.0 for more deterministic test results, better suited for debugging and validation workflows.
    • Business Impact: Restored multi-provider diversity for cost optimization, redundancy/fallback, model comparison, and access to latest Claude/GPT/Gemini variants. Enables provider-specific routing strategies and specialized model selection per use case.
    • Developer: Updated rag_shared/core/models/llm_router.py with scope declarations and property-based verbose mode; corrected model IDs in resources/configs/development_config.yml; adjusted temperature fallback in rag_api_core/templates/v2/unified_rag_test.html.
    • Migration: Restart server to apply fixes. All 17 models should show validate=ok in startup logs. Optionally switch to LITELLM_LOG=DEBUG to suppress deprecation warnings.
  • 2025-10-02 (v2.3.0):

    • Feature: Real-time streaming responses for /flexible_rag and /flexible_chat endpoints. Clients can now receive tokens as they're generated by the LLM, providing better user experience for long responses.
    • Implementation: Single endpoint handles both streaming and non-streaming via stream: bool parameter in request body. No separate streaming endpoints needed.
    • Protocol: Server-Sent Events (SSE) with text/event-stream media type. Each token sent as data: {"token": "..."} event, final event includes metadata.
    • Backend: Leverages existing streaming infrastructure in llm_router.py (Azure and LiteLLM streaming already implemented). Queue-based token capture bridges callback mechanism with async generator.
    • UI: Unified RAG test page now includes "Enable Streaming" toggle. Tokens display in real-time using EventSource API.
    • Performance: First token typically arrives 100-500ms faster than non-streaming. Similar total time (streaming overhead minimal). Tokens streamed directly without server-side accumulation.
    • Compatibility: Works with Azure OpenAI, OpenAI, Anthropic, Google Gemini, Ollama, and other LiteLLM-supported providers.
    • Testing: Added test_streaming.ps1 PowerShell script for integration testing. Validates both streaming and non-streaming modes.
    • Documentation: Created comprehensive docs/STREAMING_IMPLEMENTATION.md with architecture details, usage examples, client code samples (JavaScript, Python, PowerShell), error handling, and troubleshooting guide.
    • Business Impact: Improved user experience for interactive applications. Real-time feedback reduces perceived latency. Enables chat-like experiences and progress indicators.
  • 2025-10-02 (v2.2.1):

    • Fix: Resolved critical stale alias bug where model metadata persisted incorrect alias values across requests. When switching between models (e.g., default → OpenRouter → default), response metadata now correctly reflects the actual model used instead of showing stale values from previous requests.
    • Root Cause: LLM router's _last_model_info instance variable persisted across requests. For default model requests (no override), orchestrator called generate() without passing model_id, causing alias to be set as None. Subsequent requests would incorrectly inherit alias from previous override.
    • Solution: Implemented intelligent alias injection system that distinguishes between default model and override scenarios:
      • Default model: Injects {"llm": "default"} into model_override dict, ensuring alias is passed through orchestrator to generate() call
      • Override model: Preserves original override, letting LLM router correctly populate metadata from the override's llm parameter
      • Post-orchestrator: Only updates alias for default model case (when no override present) to maintain consistency
    • Metadata Consistency: All three metadata locations now show correct values:
      • response_metadata.model_used.alias: Correctly shows "default" or override alias
      • response_metadata.timing_breakdown.model_call.alias: Matches model_used (when present)
      • metadata[0].alias: Orchestrator-captured metadata now includes correct alias for both default and override scenarios
    • Testing: Created comprehensive test_model_switching.ps1 that validates model switching behavior across 5 scenarios: default → override → different override → back to default → consecutive default. All metadata fields verified for consistency.
    • Business Impact: Accurate observability and debugging. Users can now trust that metadata correctly identifies which model processed each request, essential for cost tracking, performance analysis, and model comparison workflows.
    • Developer: Three-part fix: (1) Inject default alias into model_override when no override present, (2) Skip post-orchestrator alias override when explicit override used, (3) Add debug logging to trace alias resolution flow.
    • Migration: No action required. Metadata will automatically show correct values after server restart. Existing monitoring tools will see more accurate model usage tracking.
  • 2025-10-01 (v2.2.0):

    • Features: Model-specific LLM parameters now correctly load from configuration and override global defaults. Each model's params section (max_tokens, temperature, top_p, frequency_penalty, presence_penalty, n, reasoning_effort, logit_bias, stream) is properly extracted and applied.
    • Fix: Corrected critical bug where MultiProviderLLM router ignored per-model configuration settings, causing all models to use global defaults regardless of their individual configurations.
    • Parameter Hierarchy: Implemented proper three-tier precedence: global defaults (e.g., llms.defaults.params.max_tokens: 2000) < model-specific params (e.g., oai_gpt4o.params.max_tokens: 6000) < request-time overrides (e.g., generate(max_tokens=100)).
    • Developer: Enhanced generate() method to load model entry from config after route resolution, extract valid API parameters from model's params block, merge with proper precedence, and filter out routing-specific keys (model_id, provider, deployment, api_base_url, azure_endpoint, api_version, use_managed_identity, api_key) to prevent API errors.
    • Testing: Added test_token_debug.py showing parameter flow from config → router → API with debug logging confirming correct values at each stage. Test demonstrates that oai_gpt4o with config max_tokens: 6000 now sends 6000 to the API instead of the global default 2000.
    • Documentation: Created comprehensive INVESTIGATION_REPORT.md documenting root cause analysis (router only loaded global defaults at init, never read model-specific params), fix implementation (post-route parameter loading), verification (debug test showing before/after API calls), and impact (all 10 models now properly configured).
    • Business Impact: Enables proper per-model tuning strategies. Models can have different token limits for different use cases (e.g., summary model with lower tokens, detailed analysis model with higher tokens). Temperature and creativity controls work as configured per model instead of being universally applied.
    • Provider Behavior: Testing revealed provider-specific minimum token enforcement: OpenAI respects low limits but finishes sentences gracefully (~16 tokens for 10-token request), while OpenRouter providers (Claude, GPT-5) enforce higher minimums (~280-400 tokens) regardless of requested max_tokens, indicating API-level safeguards against unusably short responses.
    • Migration: No action required. Existing configurations will automatically benefit from correct parameter loading. Recommended: Audit model-specific params sections to ensure they reflect intended settings, as previously ignored values will now take effect.
  • 2025-09-25 (v2.1.9):

    • Features: Saving the "Update current default" forms now versions the prompt and immediately promotes that version as the active default—no separate Make Default click required.
    • Developer: POST /api/v2/prompts accepts system_prompt_make_default=on / response_template_make_default=on; the service returns the saved manifest entry, flips current_id, and redirects with saved=true&default_set=true.
    • Docs: Public and private prompt manager guides updated with business/developer guidance plus curl/Python examples for the auto-default workflow.
    • Migration: None required; existing prompts remain untouched. Operators can continue using explicit action=make_default_prompt if desired.
  • 2025-09-25 (v2.1.8):

    • Features: Prompt repository rows and the selected prompt drawer now expose a Make default action; the backend accepts POST /api/v2/prompts with action=make_default_prompt to switch the manifest’s current record.
    • UX: Prompt previews render inside a dedicated bordered panel, preserve whitespace, and load immediately on first click (no second-click retry).
    • Docs: Updated landing, public prompt guidance, and private prompt manager docs with business/developer examples for the new workflow.
    • Migration: None required.
  • 2025-09-24 (v2.1.7):

    • Features: GET /api/v2/unified-test?json=true now returns active model, llm parameter defaults, configured system prompt, and prompt metadata for programmatic bootstraps.
    • UX: Unified Test UI renders immediately with skeleton states before hydrating from the new JSON endpoint, reducing perceived load times.
    • Observability: Metrics dashboards and external tools can consume the JSON payload instead of scraping HTML to keep in sync with the built-in UI.
    • Docs: Added public docs covering the bootstrap JSON shape plus curl/Python examples for the unified test endpoint.
    • Migration: No required action. Clients may adopt the JSON bootstrap to prefill forms or cache defaults.
  • 2025-09-24 (v2.1.6):

    • Features: Settings dry-run mode (POST /api/v2/settings?dry_run=1) returns JSON preview of inferred type changes without persisting; new diagnostics endpoint GET /api/v2/settings/blob-status for blob connectivity/container existence; flexible-chat now surfaces response_metadata.reasoning_tokens (multi-key fallback) matching flexible-rag.
    • Reliability: Auto-creation attempt for missing blob container during config save; graceful logging if creation fails.
    • Internal: Extracted centralized form value coercion (numbers, bools, list[str]) reducing Pydantic serialization warnings; structured JSON log lines ({"event":"config_change",...}) emitted for each applied mutation.
    • Tests: Added unit test test_coerce_config_value.py validating coercion matrix; enhanced integration tests for reasoning tokens fallback and effort matrix.
    • Docs: Updated landing & tracker to reflect dry-run, blob-status endpoint, reasoning token parity, and structured logging.
    • Migration: No action required. Optional adoption: clients may call dry-run before applying bulk parameter changes.
  • 2025-09-23 (v2.1.5):

    • Features: Added response_metadata.reasoning_tokens (when providers return reasoning token usage) and nested token_usage.detail.reasoning. Models startup cache now includes supports_reasoning flag.
    • Docs: Updated Chat, Multimodal, Flexible RAG and Unified Test UI docs to mention reasoning token reporting.
    • Internal: Normalized reasoning effort default application and safe extraction from varying provider usage payload shapes.
    • Migration: No action required. Clients may optionally read the new reasoning_tokens field.
  • 2025-09-20 (v2.1.0):

    • Features: Multi-completions end-to-end with answers[] in responses; answer mirrors answers[0]. Effective n precedence standardized: override_config.n > override_config.params.n > config.llms.defaults.params.n.
    • UX: Unified Test UI dark mode overhaul using theme variables; removed hardcoded light colors; instant tooltips with simplified Top P explanation.
    • Models: Startup prewarm and cache for models/validation/pings; /api/v2/models leverages cache; non-model keys (e.g., defaults) filtered from listings; deep validation shows "Ping n/a" when appropriate.
    • Settings: Added LLM Parameters side card with debounced auto-save to /api/v2/settings (manual Save preserved); POST redirect fix; GET settings path corrected; storage fallback to filesystem when Blob container missing. UI surfaces saved_source/save_error.
    • Endpoints: Flexible form now includes multimodal inline (Multimodal tab removed). Send Request button wiring fixed with unit test.
    • Docs: Public and private docs updated to reflect answers[], parameter reflection via response_metadata.parameters (optional), and n-precedence. Added new integration test covering top_p, frequency_penalty, presence_penalty, repetition_penalty, and stop.
  • 2025-09-20: Unified Test page simplified to a single Flexible form. The separate "Multimodal" tab was removed; multimodal is now an optional section inside the Flexible form. The UI auto-selects -mm endpoints when images are included.

  • 2025-09-19: Unified Test page adds an LLM Parameters side panel with debounced auto-save (~800ms) to POST /api/v2/settings. Manual “Save Parameters” remains available. If the page URL includes subscription-key, it’s forwarded on save.
  • 2025-09-19: Flexible endpoints now surface response_metadata.token_usage, response_metadata.reasoning_effort, and for multimodal also response_metadata.reasoning_effective (requested vs. sent, provider, sanitized). Multimodal endpoints align timing shape with regular endpoints and always include timings.orchestrator and a retrieval block. Health docs restored with detailed per-service steps; Chat docs updated for model selection, reasoning controls, and skip_knowledge_base behavior.
  • 2025-09-16: Knowledge stats: added file_type_breakdown and index_breakdown fields; UI shows both charts. Recent ingestion list now wraps long filenames so timestamps remain visible.
  • 2025-09-15: Added POST /api/v2/search/indexes/clear/{alias_or_name} to delete all documents in a selected index (accepts alias or concrete name).
  • 2025-09-15: /api/v2/knowledge/search-file now returns one row per file (not per chunk), supports substring/partial filename search, and the UI displays: Index, Source File, File Type, Chunk Count, and Latest Ingestion.
  • 2025-09-15: Ingestion Table UI now supports multi-column sorting for filename and timestamp columns. Sorting uses the other column as a tiebreaker for stable, intuitive ordering.
  • 2025-09-14: Added Support to generate and download fine-tuning files for OpenAI models.
  • 2025-09-12: First V2 version released (initial public release)

  • none