Chat Endpoints: Conversational AI and Knowledge Retrieval

These endpoints provide a unified interface for interacting with your organization's knowledge base using advanced conversational AI. They are designed to:

Enable natural language chat with your data, documents, and business content—not just with a generic LLM.
Support both single-turn Q&A and multi-turn conversations, maintaining context and history for richer, more relevant answers.
Allow users to ask questions, summarize content, extract facts, or perform research across your indexed knowledge base.
Integrate Retrieval-Augmented Generation (RAG): every answer can cite, link, or explain the sources used, giving transparency and traceability.
Accept flexible, fine-grained configuration for search, filtering, prompt engineering, and output formatting—empowering both business users and developers.
Support multimodal input (text + images) for scenarios like document analysis, diagram Q&A, or visual context.

Business Value: - These endpoints let you build chatbots, assistants, and automation that are grounded in your actual data—not just generic model knowledge. - They help ensure answers are accurate, up-to-date, and explainable, with links back to the original content. - You can tailor the experience for customer support, internal knowledge management, research, compliance, and more.

How it works: - The API receives a user question (and optionally, chat history, images, or custom instructions). - It queries your knowledge base using Azure AI Search and other connectors, retrieving the most relevant content. - The system combines this content with your prompt templates and sends it to the LLM for answer generation. - The response includes the answer, sources, and rich metadata for traceability and further automation.

What's new (2025-09-20) - Multi-completions: responses may include answers[] when n > 1; answer mirrors answers[0]. Standardized precedence for n: override_config.n > override_config.params.n > config.llms.defaults.params.n. - Optional parameter reflection: when available, response_metadata.parameters includes effective values like temperature, top_p, max_tokens, penalties, and stop.

What's new (2025-09-19) - Responses now include response_metadata.token_usage when available from the router. - response_metadata.reasoning_effort reflects requested/defaulted value; for providers that don’t accept "auto", the router sends a provider-safe value while preserving your intent in metadata. response_metadata.reasoning_tokens (v2.1.1+) may appear when the provider reports separate reasoning token usage. - Timings parity: response_metadata.timings exposes orchestrator sub-steps; multimodal responses also include a separate retrieval block indicating whether KB search was attempted or skipped. - All variants now include response_metadata.reasoning_effective showing requested vs. sent values and whether sanitization occurred.

Conversational API with Retrieval-Augmented Generation and rich templating controls. - POST /flexible-chat — primary multi-turn chat - POST /flexible-chat-mm — multimodal chat (text + images)

What this covers

Exact request/response shapes as implemented in code
All optional prompt/template override knobs (inline and file-based)
Fetch arguments passed through to Azure AI Search via the orchestrator
How history and sources are represented and returned

Chat Functionality: API Endpoints Collection

This page documents all endpoints used for chat and conversational AI, with practical examples and usage tips.

1) POST `/flexible-chat` (Primary)

Request body (FlexibleChatRequest)

question (string, required): The latest user message for this turn.
fetch_args (object, optional, default {}): Per-fetcher configuration. Common for Azure AI Search fetcher:
- AzureSearchFetcher (object):
  - query (string): Overrides the query text sent to search; defaults to question when omitted.
  - filter (string): OData filter, e.g. speaker eq 'david'.
  - top_k (int): Number of docs; typical 1–10.
  - include_total_count (bool)
  - facets (array of string): e.g. ["speaker,count:5", "topic"].
  - highlight_fields (array of string): e.g. ["text"].
  - select_fields (array of string): e.g. [ "id", "filename", "block_id", "chunk_index", "part", "speaker", "timestamp", "tokens", "video_url", "keyword", "topic", "text" ].
  - vector_search (bool): Toggle hybrid/vector search.
history (array, optional, default []): Prior messages, each item is a ChatMessage:
- role (string): user or assistant.
- content (string)
- sources (array, optional): Source docs attached to that message (see response metadata shape below).
ab_testing (object, optional):
- user_id (string, required)
- session_id (string, optional)
- experiment_name (string, optional)
System prompt customization (mutually independent; provide none, one, or both template+file for system/response):
- system_prompt_template (string, optional): Inline Jinja2 system prompt template.
- system_prompt_file (string, optional): Path to system prompt file. Source is repo filesystem or blob, per config.
Response template customization:
- response_template (string, optional): Inline Jinja2 response template to format the assistant answer.
- response_template_file (string, optional): Path to response template file (filesystem/blob per config).
template_variables (object, optional, default {}): Variables available when rendering templates (always includes question, history, user_id, metadata).
override_config (object, optional, default {}): Runtime configuration overrides (currently logged/forwards-looking). You can also set a per-request model alias override using override_config.llm (preferred) or override_config.model.
- Reasoning controls (provider-aware):
- override_config.reasoning_effort or override_config.reasoning.effort — accepted values: "low" | "medium" | "high" | "auto".
- Notes:
- If omitted, the system may default to "auto" for non-Azure providers; Azure providers map to their supported options. - Some non-Azure providers reject "auto"; the router preserves your intent and records what was actually sent via response_metadata.reasoning_effective.
metadata (object, optional, default {}): Freeform metadata you want echoed back/used in templates.

Note: - The chat endpoint always attempts a light retrieval for the latest turn; there is no skip_knowledge_base flag on /flexible-chat (that option exists only on /flexible-rag).

Minimal example:

{
    "question": "What can you tell me about yourself?",
    "fetch_args": {
        "AzureSearchFetcher": {
            "top_k": 5,
            "vector_search": true
        }
    },
    "history": []
}

Full example with inline prompts and template variables:

{
    "question": "What can you tell me about yourself?",
    "fetch_args": {
        "AzureSearchFetcher": {
            "query": "David role summary",
            "filter": "speaker eq 'david'",
            "top_k": 5,
            "facets": ["speaker,count:5", "topic"],
            "highlight_fields": ["text"],
            "select_fields": ["id", "filename", "chunk_index", "speaker", "timestamp", "text"],
            "vector_search": true
        }
    },
    "history": [
        {"role": "user", "content": "What can you tell me about yourself?"},
        {"role": "assistant", "content": "David is a principal…"}
    ],
    "ab_testing": {"user_id": "u-123", "session_id": "s-1", "experiment_name": "chat-layout"},
    "system_prompt_template": "You are a helpful assistant. Focus on facts.",
    "response_template": "Answer: {{ answer }}\nSources: {{ metadata | length }}",
    "template_variables": {"tone": "concise"},
        "override_config": {"llm": {"temperature": 0.1}, "reasoning_effort": "auto"},
    "metadata": {"customer_tier": "gold"}
}

Response body (FlexibleChatResponse)

response_id (string): Server-generated UUID for telemetry correlation.
answer (string): Assistant message for this turn. When answers[] is present, this equals the first item.
answers[] (array, optional): Alternative completions when n > 1.
metadata (array of object): Retrieved sources for this turn only. Typical fields include those requested in select_fields.
history (array of ChatMessage): Full history including this turn; assistant messages include sources for traceability.
ab_testing (object, optional):
- experiment_name (string)
- variant_name (string)
- session_id (string, optional)
errors (array, optional): Structured upstream issues (if any). Many upstream errors are logged but suppressed with a safe fallback answer.
data (object, optional): Raw bag with fetcher results for rich clients.
template_info (object):
- system_prompt_source (string): default | inline | file:<path> | blob:<path>
- response_template_source (string): default | inline | file:<path> | blob:<path>
- template_variables_used (array of string)
processing_time (number): Seconds.
tokens_used (number, optional)
response_metadata (object): includes
config_overrides_applied (bool)
custom_templates_used (bool)
request_metadata (object)
conversation_length (int), is_multi_turn (bool)
total_sources_retrieved (int)
conversation_sources_summary (object): counts of assistant turns with sources
question (string)
system_prompt (string)
errors_present (bool)
model_used (object): alias/provider/deployment/model when available
token_usage (object, optional): { prompt, completion, total } when available from the model/router
- parameters (object, optional): effective parameter reflection (e.g., temperature, top_p, max_tokens, frequency_penalty, presence_penalty, repetition_penalty, stop). Not all providers populate this.
reasoning_effort (string, optional): requested/defaulted effort; UI may display "auto" while provider receives default behavior
- reasoning_effective (object, optional): { requested, sent_effort, provider, source, sanitized } — reflects the effective values sent to the provider and whether any sanitization occurred

Multi-completions (N)

Business impact - Compare multiple alternatives for quality, style, or safety and select the best response.

Developer details - Request multiple completions with override_config.n (preferred) or override_config.params.n. - Precedence: override_config.n > override_config.params.n > config.llms.defaults.params.n. - When answers[] is present, answer === answers[0] for backward compatibility.

Examples

curl -X POST "https://yourhost/api/v2/flexible-chat" \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "question": "Suggest two concise introductions.",
        "history": [],
        "override_config": {"llm": "default", "n": 2, "params": {"temperature": 0.2}},
        "fetch_args": {"AzureSearchFetcher": {"top_k": 0}}
    }'

import requests
payload = {
    "question": "Suggest two concise introductions.",
    "history": [],
    "override_config": {"llm": "default", "n": 2, "params": {"temperature": 0.2}},
    "fetch_args": {"AzureSearchFetcher": {"top_k": 0}}
}
r = requests.post("https://yourhost/api/v2/flexible-chat", json=payload, headers={"Authorization": f"Bearer {TOKEN}"})
d = r.json()
print(d.get("answers"), d.get("answer"))

timings (object):
- For /flexible-rag: always includes sub-steps from the orchestrator and endpoint wrapper, e.g. orchestrator.fetch_total, orchestrator.fetchers, orchestrator.prompt_build, orchestrator.llm_generate, orchestrator.metadata_extract, orchestrator.history_build, as well as top-level orchestrator_call and build_response.
  - For /flexible-chat: not currently returned.
- retrieval (object, multimodal only): { attempted: bool, skipped_reason: string|null, fetch_args_present: bool }

2) POST `/flexible-chat-mm` (Multimodal)

Request body (FlexibleChatRequest, multimodal extension)

All fields from /flexible-chat are supported.
To send images, add a metadata.images array:
- Each image: { "url": "https://..." } or { "data_url": "data:image/png;base64,..." }
- Optional: detail ("low" | "high") for Azure OpenAI image detail level.

Example:

{
    "question": "What is in this image?",
    "metadata": {
        "images": [
            { "url": "https://example.com/pic.png", "detail": "high" },
            { "data_url": "data:image/png;base64,iVBORw0KGgo..." }
        ]
    },
    "history": []
}

Validation notes: - At least one image is required; 400 if missing. - url must be https or use data_url for inline content. - Data URLs > ~5MB are rejected (413). - Images are only processed on the latest user turn. - If the selected model alias does not support multimodal, the API returns 400 with a helpful message and try_models suggestions. Use /api/v2/models?validate=true to list aliases and check each model’s supports_multimodal flag.

Response body

Identical to /flexible-chat, with sources and history reflecting multimodal context.

Tips: - Use for tasks requiring image+text input (e.g., diagram Q&A, OCR, visual context).

3) POST `/flexible-rag` (Single-turn Q&A)

Request body (FlexibleRagRequest)

question (string, required): The user’s question for this turn.
skip_knowledge_base (boolean, optional, default false): When true, skips knowledge base search for faster responses without retrieval-augmented generation.
fetch_args (object, optional, default {}): Same as chat; see above for AzureSearchFetcher options.
history (array, optional, default []): Prior messages (rarely used for single-turn, but supported for context).
ab_testing (object, optional):
- user_id (string, required)
- session_id (string, optional)
- experiment_name (string, optional)
System prompt customization:
- system_prompt_template (string, optional): Inline Jinja2 system prompt template.
- system_prompt_file (string, optional): Path to system prompt file.
Response template customization:
- response_template (string, optional): Inline Jinja2 response template.
- response_template_file (string, optional): Path to response template file.
template_variables (object, optional, default {}): Variables for template rendering.
override_config (object, optional, default {}): Runtime config overrides. You can select a model alias using override_config.llm.
metadata (object, optional, default {}): Freeform metadata.

Minimal example:

{
    "question": "What is RAG?",
    "fetch_args": {
        "AzureSearchFetcher": {
            "top_k": 3,
            "vector_search": true
        }
    }
}

Full example with custom prompt and template:

{
    "question": "Summarize the main topics in the transcript.",
    "fetch_args": {
        "AzureSearchFetcher": {
            "query": "main topics",
            "top_k": 5,
            "facets": ["topic"],
            "vector_search": true
        }
    },
    "system_prompt_template": "You are a summarizer. Focus on key topics only.",
    "response_template": "Summary: {{ answer }}",
    "template_variables": {"audience": "executive"},
    "override_config": {"llm": {"temperature": 0.2}},
    "metadata": {"request_id": "abc-123"}
}

Example with knowledge base search skipped (faster response):

{
    "question": "What is the capital of France?",
    "skip_knowledge_base": true,
    "system_prompt_template": "You are a helpful assistant. Answer directly without referencing external sources.",
    "template_variables": {"style": "concise"}
}

Response body (FlexibleRagResponse)

response_id (string): Server-generated UUID.
answer (string): The answer for this turn.
metadata (array of object): Retrieved sources for this turn.
history (array): Single-turn history (or with context if provided).
ab_testing (object, optional):
- experiment_name (string)
- variant_name (string)
- session_id (string, optional)
errors (array, optional): Structured upstream issues.
data (object, optional): Raw fetcher results.
template_info (object):
- system_prompt_source, response_template_source, template_variables_used
processing_time (number): Seconds.
tokens_used (number, optional)
response_metadata (object):
- config_overrides_applied, custom_templates_used, request_metadata, question, system_prompt, errors_present, etc.
- queried_indexes (array, optional): unique index names referenced by retrieved results for this turn.

Tips: - Use for single-turn Q&A, document summarization, or when chat history is not needed. - All prompt/template overrides and fetch_args are supported as in chat.

Model selection and discovery

Default alias is resolved from configuration. You can override per request with override_config.llm (or override_config.model).
Discover configured models via GET /api/v2/models. Add ?validate=true to include validation results; add &deep=true for client init/auth checks. Use &filter_unhealthy=true to filter out models that failed validation.
Each model entry may include supports_multimodal: true|false. Use this to choose aliases for /flexible-chat-mm and /flexible-rag-mm.
Responses include response_metadata.model_used where available, with alias, provider, deployment, and model for telemetry and auditing.

Authentication

All endpoints require both a subscription key and a valid JWT as configured. See the Authentication page.

Error Codes

400 Bad Request: Invalid input
401 Unauthorized: Missing/invalid credentials
500 Internal Server Error: Unexpected failure

For advanced developer details, see the developer docs and source: rag_api_core/endpoints/v2/flexible_rag.py, schemas/v2/requests.py, schemas/v2/responses.py, and endpoints/v2/flexible_multimodal.py.