RAG Service
← Back to Documentation Center

Chat Endpoints: Conversational AI and Knowledge Retrieval

These endpoints provide a unified interface for interacting with your organization's knowledge base using advanced conversational AI. They are designed to:

  • Enable natural language chat with your data, documents, and business content—not just with a generic LLM.
  • Support both single-turn Q&A and multi-turn conversations, maintaining context and history for richer, more relevant answers.
  • Allow users to ask questions, summarize content, extract facts, or perform research across your indexed knowledge base.
  • Integrate Retrieval-Augmented Generation (RAG): every answer can cite, link, or explain the sources used, giving transparency and traceability.
  • Accept flexible, fine-grained configuration for search, filtering, prompt engineering, and output formatting—empowering both business users and developers.
  • Support multimodal input (text + images) for scenarios like document analysis, diagram Q&A, or visual context.

Business Value: - These endpoints let you build chatbots, assistants, and automation that are grounded in your actual data—not just generic model knowledge. - They help ensure answers are accurate, up-to-date, and explainable, with links back to the original content. - You can tailor the experience for customer support, internal knowledge management, research, compliance, and more.

How it works: - The API receives a user question (and optionally, chat history, images, or custom instructions). - It queries your knowledge base using Azure AI Search and other connectors, retrieving the most relevant content. - The system combines this content with your prompt templates and sends it to the LLM for answer generation. - The response includes the answer, sources, and rich metadata for traceability and further automation.


What's new (2025-09-20) - Multi-completions: responses may include answers[] when n > 1; answer mirrors answers[0]. Standardized precedence for n: override_config.n > override_config.params.n > config.llms.defaults.params.n. - Optional parameter reflection: when available, response_metadata.parameters includes effective values like temperature, top_p, max_tokens, penalties, and stop.

What's new (2025-09-19) - Responses now include response_metadata.token_usage when available from the router. - response_metadata.reasoning_effort reflects requested/defaulted value; for providers that don’t accept "auto", the router sends a provider-safe value while preserving your intent in metadata. response_metadata.reasoning_tokens (v2.1.1+) may appear when the provider reports separate reasoning token usage. - Timings parity: response_metadata.timings exposes orchestrator sub-steps; multimodal responses also include a separate retrieval block indicating whether KB search was attempted or skipped. - All variants now include response_metadata.reasoning_effective showing requested vs. sent values and whether sanitization occurred.

Conversational API with Retrieval-Augmented Generation and rich templating controls. - POST /flexible-chat — primary multi-turn chat - POST /flexible-chat-mm — multimodal chat (text + images)

What this covers

  • Exact request/response shapes as implemented in code
  • All optional prompt/template override knobs (inline and file-based)
  • Fetch arguments passed through to Azure AI Search via the orchestrator
  • How history and sources are represented and returned

Chat Functionality: API Endpoints Collection

This page documents all endpoints used for chat and conversational AI, with practical examples and usage tips.


1) POST /flexible-chat (Primary)

Request body (FlexibleChatRequest)

  • question (string, required): The latest user message for this turn.
  • fetch_args (object, optional, default {}): Per-fetcher configuration. Common for Azure AI Search fetcher:
    • AzureSearchFetcher (object):
      • query (string): Overrides the query text sent to search; defaults to question when omitted.
      • filter (string): OData filter, e.g. speaker eq 'david'.
      • top_k (int): Number of docs; typical 1–10.
      • include_total_count (bool)
      • facets (array of string): e.g. ["speaker,count:5", "topic"].
      • highlight_fields (array of string): e.g. ["text"].
      • select_fields (array of string): e.g. [ "id", "filename", "block_id", "chunk_index", "part", "speaker", "timestamp", "tokens", "video_url", "keyword", "topic", "text" ].
      • vector_search (bool): Toggle hybrid/vector search.
  • history (array, optional, default []): Prior messages, each item is a ChatMessage:
    • role (string): user or assistant.
    • content (string)
    • sources (array, optional): Source docs attached to that message (see response metadata shape below).
  • ab_testing (object, optional):
    • user_id (string, required)
    • session_id (string, optional)
    • experiment_name (string, optional)
  • System prompt customization (mutually independent; provide none, one, or both template+file for system/response):
    • system_prompt_template (string, optional): Inline Jinja2 system prompt template.
    • system_prompt_file (string, optional): Path to system prompt file. Source is repo filesystem or blob, per config.
  • Response template customization:
    • response_template (string, optional): Inline Jinja2 response template to format the assistant answer.
    • response_template_file (string, optional): Path to response template file (filesystem/blob per config).
  • template_variables (object, optional, default {}): Variables available when rendering templates (always includes question, history, user_id, metadata).
  • override_config (object, optional, default {}): Runtime configuration overrides (currently logged/forwards-looking). You can also set a per-request model alias override using override_config.llm (preferred) or override_config.model.
    • Reasoning controls (provider-aware):
    • override_config.reasoning_effort or override_config.reasoning.effort — accepted values: "low" | "medium" | "high" | "auto".
    • Notes:
    • If omitted, the system may default to "auto" for non-Azure providers; Azure providers map to their supported options. - Some non-Azure providers reject "auto"; the router preserves your intent and records what was actually sent via response_metadata.reasoning_effective.
  • metadata (object, optional, default {}): Freeform metadata you want echoed back/used in templates.

Note: - The chat endpoint always attempts a light retrieval for the latest turn; there is no skip_knowledge_base flag on /flexible-chat (that option exists only on /flexible-rag).

Minimal example:

{
    "question": "What can you tell me about yourself?",
    "fetch_args": {
        "AzureSearchFetcher": {
            "top_k": 5,
            "vector_search": true
        }
    },
    "history": []
}

Full example with inline prompts and template variables:

{
    "question": "What can you tell me about yourself?",
    "fetch_args": {
        "AzureSearchFetcher": {
            "query": "David role summary",
            "filter": "speaker eq 'david'",
            "top_k": 5,
            "facets": ["speaker,count:5", "topic"],
            "highlight_fields": ["text"],
            "select_fields": ["id", "filename", "chunk_index", "speaker", "timestamp", "text"],
            "vector_search": true
        }
    },
    "history": [
        {"role": "user", "content": "What can you tell me about yourself?"},
        {"role": "assistant", "content": "David is a principal…"}
    ],
    "ab_testing": {"user_id": "u-123", "session_id": "s-1", "experiment_name": "chat-layout"},
    "system_prompt_template": "You are a helpful assistant. Focus on facts.",
    "response_template": "Answer: {{ answer }}\nSources: {{ metadata | length }}",
    "template_variables": {"tone": "concise"},
        "override_config": {"llm": {"temperature": 0.1}, "reasoning_effort": "auto"},
    "metadata": {"customer_tier": "gold"}
}

Response body (FlexibleChatResponse)

  • response_id (string): Server-generated UUID for telemetry correlation.
  • answer (string): Assistant message for this turn. When answers[] is present, this equals the first item.
  • answers[] (array, optional): Alternative completions when n > 1.
  • metadata (array of object): Retrieved sources for this turn only. Typical fields include those requested in select_fields.
  • history (array of ChatMessage): Full history including this turn; assistant messages include sources for traceability.
  • ab_testing (object, optional):
    • experiment_name (string)
    • variant_name (string)
    • session_id (string, optional)
  • errors (array, optional): Structured upstream issues (if any). Many upstream errors are logged but suppressed with a safe fallback answer.
  • data (object, optional): Raw bag with fetcher results for rich clients.
  • template_info (object):
    • system_prompt_source (string): default | inline | file:<path> | blob:<path>
    • response_template_source (string): default | inline | file:<path> | blob:<path>
    • template_variables_used (array of string)
  • processing_time (number): Seconds.
  • tokens_used (number, optional)
  • response_metadata (object): includes
  • config_overrides_applied (bool)
  • custom_templates_used (bool)
  • request_metadata (object)
  • conversation_length (int), is_multi_turn (bool)
  • total_sources_retrieved (int)
  • conversation_sources_summary (object): counts of assistant turns with sources
  • question (string)
  • system_prompt (string)
  • errors_present (bool)
  • model_used (object): alias/provider/deployment/model when available
  • token_usage (object, optional): { prompt, completion, total } when available from the model/router
    • parameters (object, optional): effective parameter reflection (e.g., temperature, top_p, max_tokens, frequency_penalty, presence_penalty, repetition_penalty, stop). Not all providers populate this.
  • reasoning_effort (string, optional): requested/defaulted effort; UI may display "auto" while provider receives default behavior
    • reasoning_effective (object, optional): { requested, sent_effort, provider, source, sanitized } — reflects the effective values sent to the provider and whether any sanitization occurred

Multi-completions (N)

Business impact - Compare multiple alternatives for quality, style, or safety and select the best response.

Developer details - Request multiple completions with override_config.n (preferred) or override_config.params.n. - Precedence: override_config.n > override_config.params.n > config.llms.defaults.params.n. - When answers[] is present, answer === answers[0] for backward compatibility.

Examples

curl -X POST "https://yourhost/api/v2/flexible-chat" \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "question": "Suggest two concise introductions.",
        "history": [],
        "override_config": {"llm": "default", "n": 2, "params": {"temperature": 0.2}},
        "fetch_args": {"AzureSearchFetcher": {"top_k": 0}}
    }'
import requests
payload = {
    "question": "Suggest two concise introductions.",
    "history": [],
    "override_config": {"llm": "default", "n": 2, "params": {"temperature": 0.2}},
    "fetch_args": {"AzureSearchFetcher": {"top_k": 0}}
}
r = requests.post("https://yourhost/api/v2/flexible-chat", json=payload, headers={"Authorization": f"Bearer {TOKEN}"})
d = r.json()
print(d.get("answers"), d.get("answer"))
  • timings (object):
    • For /flexible-rag: always includes sub-steps from the orchestrator and endpoint wrapper, e.g. orchestrator.fetch_total, orchestrator.fetchers, orchestrator.prompt_build, orchestrator.llm_generate, orchestrator.metadata_extract, orchestrator.history_build, as well as top-level orchestrator_call and build_response.
      • For /flexible-chat: not currently returned.
    • retrieval (object, multimodal only): { attempted: bool, skipped_reason: string|null, fetch_args_present: bool }

2) POST /flexible-chat-mm (Multimodal)

Request body (FlexibleChatRequest, multimodal extension)

  • All fields from /flexible-chat are supported.
  • To send images, add a metadata.images array:
    • Each image: { "url": "https://..." } or { "data_url": "data:image/png;base64,..." }
    • Optional: detail ("low" | "high") for Azure OpenAI image detail level.

Example:

{
    "question": "What is in this image?",
    "metadata": {
        "images": [
            { "url": "https://example.com/pic.png", "detail": "high" },
            { "data_url": "data:image/png;base64,iVBORw0KGgo..." }
        ]
    },
    "history": []
}

Validation notes: - At least one image is required; 400 if missing. - url must be https or use data_url for inline content. - Data URLs > ~5MB are rejected (413). - Images are only processed on the latest user turn. - If the selected model alias does not support multimodal, the API returns 400 with a helpful message and try_models suggestions. Use /api/v2/models?validate=true to list aliases and check each model’s supports_multimodal flag.

Response body

  • Identical to /flexible-chat, with sources and history reflecting multimodal context.

Tips: - Use for tasks requiring image+text input (e.g., diagram Q&A, OCR, visual context).


3) POST /flexible-rag (Single-turn Q&A)

Request body (FlexibleRagRequest)

  • question (string, required): The user’s question for this turn.
  • skip_knowledge_base (boolean, optional, default false): When true, skips knowledge base search for faster responses without retrieval-augmented generation.
  • fetch_args (object, optional, default {}): Same as chat; see above for AzureSearchFetcher options.
  • history (array, optional, default []): Prior messages (rarely used for single-turn, but supported for context).
  • ab_testing (object, optional):
    • user_id (string, required)
    • session_id (string, optional)
    • experiment_name (string, optional)
  • System prompt customization:
    • system_prompt_template (string, optional): Inline Jinja2 system prompt template.
    • system_prompt_file (string, optional): Path to system prompt file.
  • Response template customization:
    • response_template (string, optional): Inline Jinja2 response template.
    • response_template_file (string, optional): Path to response template file.
  • template_variables (object, optional, default {}): Variables for template rendering.
  • override_config (object, optional, default {}): Runtime config overrides. You can select a model alias using override_config.llm.
  • metadata (object, optional, default {}): Freeform metadata.

Minimal example:

{
    "question": "What is RAG?",
    "fetch_args": {
        "AzureSearchFetcher": {
            "top_k": 3,
            "vector_search": true
        }
    }
}

Full example with custom prompt and template:

{
    "question": "Summarize the main topics in the transcript.",
    "fetch_args": {
        "AzureSearchFetcher": {
            "query": "main topics",
            "top_k": 5,
            "facets": ["topic"],
            "vector_search": true
        }
    },
    "system_prompt_template": "You are a summarizer. Focus on key topics only.",
    "response_template": "Summary: {{ answer }}",
    "template_variables": {"audience": "executive"},
    "override_config": {"llm": {"temperature": 0.2}},
    "metadata": {"request_id": "abc-123"}
}

Example with knowledge base search skipped (faster response):

{
    "question": "What is the capital of France?",
    "skip_knowledge_base": true,
    "system_prompt_template": "You are a helpful assistant. Answer directly without referencing external sources.",
    "template_variables": {"style": "concise"}
}

Response body (FlexibleRagResponse)

  • response_id (string): Server-generated UUID.
  • answer (string): The answer for this turn.
  • metadata (array of object): Retrieved sources for this turn.
  • history (array): Single-turn history (or with context if provided).
  • ab_testing (object, optional):
    • experiment_name (string)
    • variant_name (string)
    • session_id (string, optional)
  • errors (array, optional): Structured upstream issues.
  • data (object, optional): Raw fetcher results.
  • template_info (object):
    • system_prompt_source, response_template_source, template_variables_used
  • processing_time (number): Seconds.
  • tokens_used (number, optional)
  • response_metadata (object):
    • config_overrides_applied, custom_templates_used, request_metadata, question, system_prompt, errors_present, etc.
    • queried_indexes (array, optional): unique index names referenced by retrieved results for this turn.

Tips: - Use for single-turn Q&A, document summarization, or when chat history is not needed. - All prompt/template overrides and fetch_args are supported as in chat.


Model selection and discovery

  • Default alias is resolved from configuration. You can override per request with override_config.llm (or override_config.model).
  • Discover configured models via GET /api/v2/models. Add ?validate=true to include validation results; add &deep=true for client init/auth checks. Use &filter_unhealthy=true to filter out models that failed validation.
  • Each model entry may include supports_multimodal: true|false. Use this to choose aliases for /flexible-chat-mm and /flexible-rag-mm.
  • Responses include response_metadata.model_used where available, with alias, provider, deployment, and model for telemetry and auditing.

Authentication

All endpoints require both a subscription key and a valid JWT as configured. See the Authentication page.

Error Codes

  • 400 Bad Request: Invalid input
  • 401 Unauthorized: Missing/invalid credentials
  • 500 Internal Server Error: Unexpected failure

For advanced developer details, see the developer docs and source: rag_api_core/endpoints/v2/flexible_rag.py, schemas/v2/requests.py, schemas/v2/responses.py, and endpoints/v2/flexible_multimodal.py.