Release Tracker
This page tracks recent releases and changelog for the API.
Version History
-
2025-10-02 (v2.4.0):
- Fix: Resolved critical LiteLLM integration issues preventing all 17 non-Azure model providers from initializing. Root cause was
UnboundLocalErrorin verbose mode setup function attempting to modify global_litellm_verbose_enabledvariable without proper scope declaration, combined with deprecated API usage callinglitellm.set_verbose()as function instead of property assignment. - Solution: Added
global _litellm_verbose_enableddeclaration to nested function scope; updated verbose activation to uselitellm.set_verbose = Trueproperty assignment; enhanced exception logging to capture import failures with full stack traces. - Provider Support: All 17 configured providers now validate successfully on startup (1 Azure + 1 OpenAI + 15 OpenRouter models), enabling full multi-provider routing capabilities including OpenAI, Anthropic Claude, Google Gemini, and Meta Llama via OpenRouter.
- Model Corrections: Fixed 4 incorrect OpenRouter model IDs preventing Claude variants from working:
anthropic/claude-4-sonnet→anthropic/claude-sonnet-4,anthropic/claude-4.1-opus→anthropic/claude-opus-4.1,anthropic/claude-4-opus→anthropic/claude-opus-4,anthropic/claude-3-sonnet→anthropic/claude-3.5-sonnet(upgraded from deprecated 3.0). Model names validated against live OpenRouter API. - Environment Variables: Added support for both
LITELLM_VERBOSE(legacy) andLITELLM_LOG=DEBUG(recommended) for debug logging, suppressing deprecation warnings when using new format. - UI Enhancement: Changed default temperature in testing page from 0.7 to 0.0 for more deterministic test results, better suited for debugging and validation workflows.
- Business Impact: Restored multi-provider diversity for cost optimization, redundancy/fallback, model comparison, and access to latest Claude/GPT/Gemini variants. Enables provider-specific routing strategies and specialized model selection per use case.
- Developer: Updated
rag_shared/core/models/llm_router.pywith scope declarations and property-based verbose mode; corrected model IDs inresources/configs/development_config.yml; adjusted temperature fallback inrag_api_core/templates/v2/unified_rag_test.html. - Migration: Restart server to apply fixes. All 17 models should show
validate=okin startup logs. Optionally switch toLITELLM_LOG=DEBUGto suppress deprecation warnings.
- Fix: Resolved critical LiteLLM integration issues preventing all 17 non-Azure model providers from initializing. Root cause was
-
2025-10-02 (v2.3.0):
- Feature: Real-time streaming responses for
/flexible_ragand/flexible_chatendpoints. Clients can now receive tokens as they're generated by the LLM, providing better user experience for long responses. - Implementation: Single endpoint handles both streaming and non-streaming via
stream: boolparameter in request body. No separate streaming endpoints needed. - Protocol: Server-Sent Events (SSE) with
text/event-streammedia type. Each token sent asdata: {"token": "..."}event, final event includes metadata. - Backend: Leverages existing streaming infrastructure in
llm_router.py(Azure and LiteLLM streaming already implemented). Queue-based token capture bridges callback mechanism with async generator. - UI: Unified RAG test page now includes "Enable Streaming" toggle. Tokens display in real-time using EventSource API.
- Performance: First token typically arrives 100-500ms faster than non-streaming. Similar total time (streaming overhead minimal). Tokens streamed directly without server-side accumulation.
- Compatibility: Works with Azure OpenAI, OpenAI, Anthropic, Google Gemini, Ollama, and other LiteLLM-supported providers.
- Testing: Added
test_streaming.ps1PowerShell script for integration testing. Validates both streaming and non-streaming modes. - Documentation: Created comprehensive
docs/STREAMING_IMPLEMENTATION.mdwith architecture details, usage examples, client code samples (JavaScript, Python, PowerShell), error handling, and troubleshooting guide. - Business Impact: Improved user experience for interactive applications. Real-time feedback reduces perceived latency. Enables chat-like experiences and progress indicators.
- Feature: Real-time streaming responses for
-
2025-10-02 (v2.2.1):
- Fix: Resolved critical stale alias bug where model metadata persisted incorrect alias values across requests. When switching between models (e.g., default → OpenRouter → default), response metadata now correctly reflects the actual model used instead of showing stale values from previous requests.
- Root Cause: LLM router's
_last_model_infoinstance variable persisted across requests. For default model requests (no override), orchestrator calledgenerate()without passingmodel_id, causing alias to be set asNone. Subsequent requests would incorrectly inherit alias from previous override. - Solution: Implemented intelligent alias injection system that distinguishes between default model and override scenarios:
- Default model: Injects
{"llm": "default"}into model_override dict, ensuring alias is passed through orchestrator togenerate()call - Override model: Preserves original override, letting LLM router correctly populate metadata from the override's
llmparameter - Post-orchestrator: Only updates alias for default model case (when no override present) to maintain consistency
- Default model: Injects
- Metadata Consistency: All three metadata locations now show correct values:
response_metadata.model_used.alias: Correctly shows "default" or override aliasresponse_metadata.timing_breakdown.model_call.alias: Matches model_used (when present)metadata[0].alias: Orchestrator-captured metadata now includes correct alias for both default and override scenarios
- Testing: Created comprehensive
test_model_switching.ps1that validates model switching behavior across 5 scenarios: default → override → different override → back to default → consecutive default. All metadata fields verified for consistency. - Business Impact: Accurate observability and debugging. Users can now trust that metadata correctly identifies which model processed each request, essential for cost tracking, performance analysis, and model comparison workflows.
- Developer: Three-part fix: (1) Inject default alias into model_override when no override present, (2) Skip post-orchestrator alias override when explicit override used, (3) Add debug logging to trace alias resolution flow.
- Migration: No action required. Metadata will automatically show correct values after server restart. Existing monitoring tools will see more accurate model usage tracking.
-
2025-10-01 (v2.2.0):
- Features: Model-specific LLM parameters now correctly load from configuration and override global defaults. Each model's
paramssection (max_tokens, temperature, top_p, frequency_penalty, presence_penalty, n, reasoning_effort, logit_bias, stream) is properly extracted and applied. - Fix: Corrected critical bug where
MultiProviderLLMrouter ignored per-model configuration settings, causing all models to use global defaults regardless of their individual configurations. - Parameter Hierarchy: Implemented proper three-tier precedence: global defaults (e.g.,
llms.defaults.params.max_tokens: 2000) < model-specific params (e.g.,oai_gpt4o.params.max_tokens: 6000) < request-time overrides (e.g.,generate(max_tokens=100)). - Developer: Enhanced
generate()method to load model entry from config after route resolution, extract valid API parameters from model'sparamsblock, merge with proper precedence, and filter out routing-specific keys (model_id, provider, deployment, api_base_url, azure_endpoint, api_version, use_managed_identity, api_key) to prevent API errors. - Testing: Added
test_token_debug.pyshowing parameter flow from config → router → API with debug logging confirming correct values at each stage. Test demonstrates thatoai_gpt4owith configmax_tokens: 6000now sends 6000 to the API instead of the global default 2000. - Documentation: Created comprehensive
INVESTIGATION_REPORT.mddocumenting root cause analysis (router only loaded global defaults at init, never read model-specific params), fix implementation (post-route parameter loading), verification (debug test showing before/after API calls), and impact (all 10 models now properly configured). - Business Impact: Enables proper per-model tuning strategies. Models can have different token limits for different use cases (e.g., summary model with lower tokens, detailed analysis model with higher tokens). Temperature and creativity controls work as configured per model instead of being universally applied.
- Provider Behavior: Testing revealed provider-specific minimum token enforcement: OpenAI respects low limits but finishes sentences gracefully (~16 tokens for 10-token request), while OpenRouter providers (Claude, GPT-5) enforce higher minimums (~280-400 tokens) regardless of requested max_tokens, indicating API-level safeguards against unusably short responses.
- Migration: No action required. Existing configurations will automatically benefit from correct parameter loading. Recommended: Audit model-specific
paramssections to ensure they reflect intended settings, as previously ignored values will now take effect.
- Features: Model-specific LLM parameters now correctly load from configuration and override global defaults. Each model's
-
2025-09-25 (v2.1.9):
- Features: Saving the "Update current default" forms now versions the prompt and immediately promotes that version as the active default—no separate Make Default click required.
- Developer:
POST /api/v2/promptsacceptssystem_prompt_make_default=on/response_template_make_default=on; the service returns the saved manifest entry, flipscurrent_id, and redirects withsaved=true&default_set=true. - Docs: Public and private prompt manager guides updated with business/developer guidance plus curl/Python examples for the auto-default workflow.
- Migration: None required; existing prompts remain untouched. Operators can continue using explicit
action=make_default_promptif desired.
-
2025-09-25 (v2.1.8):
- Features: Prompt repository rows and the selected prompt drawer now expose a Make default action; the backend accepts
POST /api/v2/promptswithaction=make_default_promptto switch the manifest’s current record. - UX: Prompt previews render inside a dedicated bordered panel, preserve whitespace, and load immediately on first click (no second-click retry).
- Docs: Updated landing, public prompt guidance, and private prompt manager docs with business/developer examples for the new workflow.
- Migration: None required.
- Features: Prompt repository rows and the selected prompt drawer now expose a Make default action; the backend accepts
-
2025-09-24 (v2.1.7):
- Features:
GET /api/v2/unified-test?json=truenow returns active model, llm parameter defaults, configured system prompt, and prompt metadata for programmatic bootstraps. - UX: Unified Test UI renders immediately with skeleton states before hydrating from the new JSON endpoint, reducing perceived load times.
- Observability: Metrics dashboards and external tools can consume the JSON payload instead of scraping HTML to keep in sync with the built-in UI.
- Docs: Added public docs covering the bootstrap JSON shape plus curl/Python examples for the unified test endpoint.
- Migration: No required action. Clients may adopt the JSON bootstrap to prefill forms or cache defaults.
- Features:
-
2025-09-24 (v2.1.6):
- Features: Settings dry-run mode (
POST /api/v2/settings?dry_run=1) returns JSON preview of inferred type changes without persisting; new diagnostics endpointGET /api/v2/settings/blob-statusfor blob connectivity/container existence; flexible-chat now surfacesresponse_metadata.reasoning_tokens(multi-key fallback) matching flexible-rag. - Reliability: Auto-creation attempt for missing blob container during config save; graceful logging if creation fails.
- Internal: Extracted centralized form value coercion (numbers, bools, list[str]) reducing Pydantic serialization warnings; structured JSON log lines (
{"event":"config_change",...}) emitted for each applied mutation. - Tests: Added unit test
test_coerce_config_value.pyvalidating coercion matrix; enhanced integration tests for reasoning tokens fallback and effort matrix. - Docs: Updated landing & tracker to reflect dry-run, blob-status endpoint, reasoning token parity, and structured logging.
- Migration: No action required. Optional adoption: clients may call dry-run before applying bulk parameter changes.
- Features: Settings dry-run mode (
-
2025-09-23 (v2.1.5):
- Features: Added
response_metadata.reasoning_tokens(when providers return reasoning token usage) and nestedtoken_usage.detail.reasoning. Models startup cache now includessupports_reasoningflag. - Docs: Updated Chat, Multimodal, Flexible RAG and Unified Test UI docs to mention reasoning token reporting.
- Internal: Normalized reasoning effort default application and safe extraction from varying provider usage payload shapes.
- Migration: No action required. Clients may optionally read the new
reasoning_tokensfield.
- Features: Added
-
2025-09-20 (v2.1.0):
- Features: Multi-completions end-to-end with
answers[]in responses;answermirrorsanswers[0]. Effectivenprecedence standardized:override_config.n>override_config.params.n>config.llms.defaults.params.n. - UX: Unified Test UI dark mode overhaul using theme variables; removed hardcoded light colors; instant tooltips with simplified Top P explanation.
- Models: Startup prewarm and cache for models/validation/pings;
/api/v2/modelsleverages cache; non-model keys (e.g.,defaults) filtered from listings; deep validation shows "Ping n/a" when appropriate. - Settings: Added LLM Parameters side card with debounced auto-save to
/api/v2/settings(manual Save preserved); POST redirect fix; GET settings path corrected; storage fallback to filesystem when Blob container missing. UI surfaces saved_source/save_error. - Endpoints: Flexible form now includes multimodal inline (Multimodal tab removed). Send Request button wiring fixed with unit test.
- Docs: Public and private docs updated to reflect
answers[], parameter reflection viaresponse_metadata.parameters(optional), and n-precedence. Added new integration test coveringtop_p,frequency_penalty,presence_penalty,repetition_penalty, andstop.
- Features: Multi-completions end-to-end with
-
2025-09-20: Unified Test page simplified to a single Flexible form. The separate "Multimodal" tab was removed; multimodal is now an optional section inside the Flexible form. The UI auto-selects -mm endpoints when images are included.
- 2025-09-19: Unified Test page adds an LLM Parameters side panel with debounced auto-save (~800ms) to POST /api/v2/settings. Manual “Save Parameters” remains available. If the page URL includes subscription-key, it’s forwarded on save.
- 2025-09-19: Flexible endpoints now surface
response_metadata.token_usage,response_metadata.reasoning_effort, and for multimodal alsoresponse_metadata.reasoning_effective(requested vs. sent, provider, sanitized). Multimodal endpoints align timing shape with regular endpoints and always includetimings.orchestratorand aretrievalblock. Health docs restored with detailed per-service steps; Chat docs updated for model selection, reasoning controls, andskip_knowledge_basebehavior. - 2025-09-16: Knowledge stats: added
file_type_breakdownandindex_breakdownfields; UI shows both charts. Recent ingestion list now wraps long filenames so timestamps remain visible. - 2025-09-15: Added POST /api/v2/search/indexes/clear/{alias_or_name} to delete all documents in a selected index (accepts alias or concrete name).
- 2025-09-15: /api/v2/knowledge/search-file now returns one row per file (not per chunk), supports substring/partial filename search, and the UI displays: Index, Source File, File Type, Chunk Count, and Latest Ingestion.
- 2025-09-15: Ingestion Table UI now supports multi-column sorting for filename and timestamp columns. Sorting uses the other column as a tiebreaker for stable, intuitive ordering.
- 2025-09-14: Added Support to generate and download fine-tuning files for OpenAI models.
- 2025-09-12: First V2 version released (initial public release)
- none