Chat Endpoints: Conversational AI and Knowledge Retrieval
These endpoints provide a unified interface for interacting with your organization's knowledge base using advanced conversational AI. They are designed to:
- Enable natural language chat with your data, documents, and business content—not just with a generic LLM.
- Support both single-turn Q&A and multi-turn conversations, maintaining context and history for richer, more relevant answers.
- Allow users to ask questions, summarize content, extract facts, or perform research across your indexed knowledge base.
- Integrate Retrieval-Augmented Generation (RAG): every answer can cite, link, or explain the sources used, giving transparency and traceability.
- Accept flexible, fine-grained configuration for search, filtering, prompt engineering, and output formatting—empowering both business users and developers.
- Support multimodal input (text + images) for scenarios like document analysis, diagram Q&A, or visual context.
Business Value: - These endpoints let you build chatbots, assistants, and automation that are grounded in your actual data—not just generic model knowledge. - They help ensure answers are accurate, up-to-date, and explainable, with links back to the original content. - You can tailor the experience for customer support, internal knowledge management, research, compliance, and more.
How it works: - The API receives a user question (and optionally, chat history, images, or custom instructions). - It queries your knowledge base using Azure AI Search and other connectors, retrieving the most relevant content. - The system combines this content with your prompt templates and sends it to the LLM for answer generation. - The response includes the answer, sources, and rich metadata for traceability and further automation.
What's new (2025-09-20)
- Multi-completions: responses may include answers[] when n > 1; answer mirrors answers[0]. Standardized precedence for n: override_config.n > override_config.params.n > config.llms.defaults.params.n.
- Optional parameter reflection: when available, response_metadata.parameters includes effective values like temperature, top_p, max_tokens, penalties, and stop.
What's new (2025-09-19)
- Responses now include response_metadata.token_usage when available from the router.
- response_metadata.reasoning_effort reflects requested/defaulted value; for providers that don’t accept "auto", the router sends a provider-safe value while preserving your intent in metadata. response_metadata.reasoning_tokens (v2.1.1+) may appear when the provider reports separate reasoning token usage.
- Timings parity: response_metadata.timings exposes orchestrator sub-steps; multimodal responses also include a separate retrieval block indicating whether KB search was attempted or skipped.
- All variants now include response_metadata.reasoning_effective showing requested vs. sent values and whether sanitization occurred.
Conversational API with Retrieval-Augmented Generation and rich templating controls.
- POST /flexible-chat — primary multi-turn chat
- POST /flexible-chat-mm — multimodal chat (text + images)
What this covers
- Exact request/response shapes as implemented in code
- All optional prompt/template override knobs (inline and file-based)
- Fetch arguments passed through to Azure AI Search via the orchestrator
- How history and sources are represented and returned
Chat Functionality: API Endpoints Collection
This page documents all endpoints used for chat and conversational AI, with practical examples and usage tips.
1) POST /flexible-chat (Primary)
Request body (FlexibleChatRequest)
question(string, required): The latest user message for this turn.fetch_args(object, optional, default{}): Per-fetcher configuration. Common for Azure AI Search fetcher:AzureSearchFetcher(object):query(string): Overrides the query text sent to search; defaults toquestionwhen omitted.filter(string): OData filter, e.g.speaker eq 'david'.top_k(int): Number of docs; typical 1–10.include_total_count(bool)facets(array of string): e.g.["speaker,count:5", "topic"].highlight_fields(array of string): e.g.["text"].select_fields(array of string): e.g.[ "id", "filename", "block_id", "chunk_index", "part", "speaker", "timestamp", "tokens", "video_url", "keyword", "topic", "text" ].vector_search(bool): Toggle hybrid/vector search.
history(array, optional, default[]): Prior messages, each item is aChatMessage:role(string):userorassistant.content(string)sources(array, optional): Source docs attached to that message (see responsemetadatashape below).
ab_testing(object, optional):user_id(string, required)session_id(string, optional)experiment_name(string, optional)
- System prompt customization (mutually independent; provide none, one, or both template+file for system/response):
system_prompt_template(string, optional): Inline Jinja2 system prompt template.system_prompt_file(string, optional): Path to system prompt file. Source is repo filesystem or blob, per config.
- Response template customization:
response_template(string, optional): Inline Jinja2 response template to format the assistant answer.response_template_file(string, optional): Path to response template file (filesystem/blob per config).
template_variables(object, optional, default{}): Variables available when rendering templates (always includesquestion,history,user_id,metadata).override_config(object, optional, default{}): Runtime configuration overrides (currently logged/forwards-looking). You can also set a per-request model alias override usingoverride_config.llm(preferred) oroverride_config.model.- Reasoning controls (provider-aware):
override_config.reasoning_effortoroverride_config.reasoning.effort— accepted values:"low" | "medium" | "high" | "auto".- Notes:
- If omitted, the system may default to
"auto"for non-Azure providers; Azure providers map to their supported options. - Some non-Azure providers reject"auto"; the router preserves your intent and records what was actually sent viaresponse_metadata.reasoning_effective.
metadata(object, optional, default{}): Freeform metadata you want echoed back/used in templates.
Note:
- The chat endpoint always attempts a light retrieval for the latest turn; there is no skip_knowledge_base flag on /flexible-chat (that option exists only on /flexible-rag).
Minimal example:
{
"question": "What can you tell me about yourself?",
"fetch_args": {
"AzureSearchFetcher": {
"top_k": 5,
"vector_search": true
}
},
"history": []
}
Full example with inline prompts and template variables:
{
"question": "What can you tell me about yourself?",
"fetch_args": {
"AzureSearchFetcher": {
"query": "David role summary",
"filter": "speaker eq 'david'",
"top_k": 5,
"facets": ["speaker,count:5", "topic"],
"highlight_fields": ["text"],
"select_fields": ["id", "filename", "chunk_index", "speaker", "timestamp", "text"],
"vector_search": true
}
},
"history": [
{"role": "user", "content": "What can you tell me about yourself?"},
{"role": "assistant", "content": "David is a principal…"}
],
"ab_testing": {"user_id": "u-123", "session_id": "s-1", "experiment_name": "chat-layout"},
"system_prompt_template": "You are a helpful assistant. Focus on facts.",
"response_template": "Answer: {{ answer }}\nSources: {{ metadata | length }}",
"template_variables": {"tone": "concise"},
"override_config": {"llm": {"temperature": 0.1}, "reasoning_effort": "auto"},
"metadata": {"customer_tier": "gold"}
}
Response body (FlexibleChatResponse)
response_id(string): Server-generated UUID for telemetry correlation.answer(string): Assistant message for this turn. Whenanswers[]is present, this equals the first item.answers[](array, optional): Alternative completions whenn > 1.metadata(array of object): Retrieved sources for this turn only. Typical fields include those requested inselect_fields.history(array of ChatMessage): Full history including this turn; assistant messages includesourcesfor traceability.ab_testing(object, optional):experiment_name(string)variant_name(string)session_id(string, optional)
errors(array, optional): Structured upstream issues (if any). Many upstream errors are logged but suppressed with a safe fallback answer.data(object, optional): Raw bag with fetcher results for rich clients.template_info(object):system_prompt_source(string):default|inline|file:<path>|blob:<path>response_template_source(string):default|inline|file:<path>|blob:<path>template_variables_used(array of string)
processing_time(number): Seconds.tokens_used(number, optional)response_metadata(object): includesconfig_overrides_applied(bool)custom_templates_used(bool)request_metadata(object)conversation_length(int),is_multi_turn(bool)total_sources_retrieved(int)conversation_sources_summary(object): counts of assistant turns with sourcesquestion(string)system_prompt(string)errors_present(bool)model_used(object): alias/provider/deployment/model when availabletoken_usage(object, optional):{ prompt, completion, total }when available from the model/routerparameters(object, optional): effective parameter reflection (e.g.,temperature,top_p,max_tokens,frequency_penalty,presence_penalty,repetition_penalty,stop). Not all providers populate this.
reasoning_effort(string, optional): requested/defaulted effort; UI may display"auto"while provider receives default behaviorreasoning_effective(object, optional):{ requested, sent_effort, provider, source, sanitized }— reflects the effective values sent to the provider and whether any sanitization occurred
Multi-completions (N)
Business impact - Compare multiple alternatives for quality, style, or safety and select the best response.
Developer details
- Request multiple completions with override_config.n (preferred) or override_config.params.n.
- Precedence: override_config.n > override_config.params.n > config.llms.defaults.params.n.
- When answers[] is present, answer === answers[0] for backward compatibility.
Examples
curl -X POST "https://yourhost/api/v2/flexible-chat" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"question": "Suggest two concise introductions.",
"history": [],
"override_config": {"llm": "default", "n": 2, "params": {"temperature": 0.2}},
"fetch_args": {"AzureSearchFetcher": {"top_k": 0}}
}'
import requests
payload = {
"question": "Suggest two concise introductions.",
"history": [],
"override_config": {"llm": "default", "n": 2, "params": {"temperature": 0.2}},
"fetch_args": {"AzureSearchFetcher": {"top_k": 0}}
}
r = requests.post("https://yourhost/api/v2/flexible-chat", json=payload, headers={"Authorization": f"Bearer {TOKEN}"})
d = r.json()
print(d.get("answers"), d.get("answer"))
timings(object):- For
/flexible-rag: always includes sub-steps from the orchestrator and endpoint wrapper, e.g.orchestrator.fetch_total,orchestrator.fetchers,orchestrator.prompt_build,orchestrator.llm_generate,orchestrator.metadata_extract,orchestrator.history_build, as well as top-levelorchestrator_callandbuild_response.- For
/flexible-chat: not currently returned.
- For
retrieval(object, multimodal only):{ attempted: bool, skipped_reason: string|null, fetch_args_present: bool }
- For
2) POST /flexible-chat-mm (Multimodal)
Request body (FlexibleChatRequest, multimodal extension)
- All fields from
/flexible-chatare supported. - To send images, add a
metadata.imagesarray:- Each image:
{ "url": "https://..." }or{ "data_url": "data:image/png;base64,..." } - Optional:
detail("low"|"high") for Azure OpenAI image detail level.
- Each image:
Example:
{
"question": "What is in this image?",
"metadata": {
"images": [
{ "url": "https://example.com/pic.png", "detail": "high" },
{ "data_url": "..." }
]
},
"history": []
}
Validation notes:
- At least one image is required; 400 if missing.
- url must be https or use data_url for inline content.
- Data URLs > ~5MB are rejected (413).
- Images are only processed on the latest user turn.
- If the selected model alias does not support multimodal, the API returns 400 with a helpful message and try_models suggestions. Use /api/v2/models?validate=true to list aliases and check each model’s supports_multimodal flag.
Response body
- Identical to
/flexible-chat, with sources and history reflecting multimodal context.
Tips: - Use for tasks requiring image+text input (e.g., diagram Q&A, OCR, visual context).
3) POST /flexible-rag (Single-turn Q&A)
Request body (FlexibleRagRequest)
question(string, required): The user’s question for this turn.skip_knowledge_base(boolean, optional, defaultfalse): Whentrue, skips knowledge base search for faster responses without retrieval-augmented generation.fetch_args(object, optional, default{}): Same as chat; see above for AzureSearchFetcher options.history(array, optional, default[]): Prior messages (rarely used for single-turn, but supported for context).ab_testing(object, optional):user_id(string, required)session_id(string, optional)experiment_name(string, optional)
- System prompt customization:
system_prompt_template(string, optional): Inline Jinja2 system prompt template.system_prompt_file(string, optional): Path to system prompt file.
- Response template customization:
response_template(string, optional): Inline Jinja2 response template.response_template_file(string, optional): Path to response template file.
template_variables(object, optional, default{}): Variables for template rendering.override_config(object, optional, default{}): Runtime config overrides. You can select a model alias usingoverride_config.llm.metadata(object, optional, default{}): Freeform metadata.
Minimal example:
{
"question": "What is RAG?",
"fetch_args": {
"AzureSearchFetcher": {
"top_k": 3,
"vector_search": true
}
}
}
Full example with custom prompt and template:
{
"question": "Summarize the main topics in the transcript.",
"fetch_args": {
"AzureSearchFetcher": {
"query": "main topics",
"top_k": 5,
"facets": ["topic"],
"vector_search": true
}
},
"system_prompt_template": "You are a summarizer. Focus on key topics only.",
"response_template": "Summary: {{ answer }}",
"template_variables": {"audience": "executive"},
"override_config": {"llm": {"temperature": 0.2}},
"metadata": {"request_id": "abc-123"}
}
Example with knowledge base search skipped (faster response):
{
"question": "What is the capital of France?",
"skip_knowledge_base": true,
"system_prompt_template": "You are a helpful assistant. Answer directly without referencing external sources.",
"template_variables": {"style": "concise"}
}
Response body (FlexibleRagResponse)
response_id(string): Server-generated UUID.answer(string): The answer for this turn.metadata(array of object): Retrieved sources for this turn.history(array): Single-turn history (or with context if provided).ab_testing(object, optional):experiment_name(string)variant_name(string)session_id(string, optional)
errors(array, optional): Structured upstream issues.data(object, optional): Raw fetcher results.template_info(object):system_prompt_source,response_template_source,template_variables_used
processing_time(number): Seconds.tokens_used(number, optional)response_metadata(object):config_overrides_applied,custom_templates_used,request_metadata,question,system_prompt,errors_present, etc.queried_indexes(array, optional): unique index names referenced by retrieved results for this turn.
Tips: - Use for single-turn Q&A, document summarization, or when chat history is not needed. - All prompt/template overrides and fetch_args are supported as in chat.
Model selection and discovery
- Default alias is resolved from configuration. You can override per request with
override_config.llm(oroverride_config.model). - Discover configured models via
GET /api/v2/models. Add?validate=trueto include validation results; add&deep=truefor client init/auth checks. Use&filter_unhealthy=trueto filter out models that failed validation. - Each model entry may include
supports_multimodal: true|false. Use this to choose aliases for/flexible-chat-mmand/flexible-rag-mm. - Responses include
response_metadata.model_usedwhere available, withalias,provider,deployment, andmodelfor telemetry and auditing.
Authentication
All endpoints require both a subscription key and a valid JWT as configured. See the Authentication page.
Error Codes
400 Bad Request: Invalid input401 Unauthorized: Missing/invalid credentials500 Internal Server Error: Unexpected failure
For advanced developer details, see the developer docs and source: rag_api_core/endpoints/v2/flexible_rag.py, schemas/v2/requests.py, schemas/v2/responses.py, and endpoints/v2/flexible_multimodal.py.