Data Pipelines

This page describes the data pipelines in RAG API Core, how they are implemented, and how they interact with the API service.

Overview

The data pipelines are responsible for: - Ingesting documents and files from various sources - Parsing, chunking, and transforming content - Indexing processed data into the knowledge base (e.g., Azure AI Search) - Maintaining metadata and tracking ingestion status

Implementation

Ingestion: Data is ingested via scripts, scheduled jobs, or manual uploads. Supported formats include PDF, DOCX, TXT, and more.
Parsing & Chunking: Documents are parsed and split into smaller, semantically meaningful chunks for efficient retrieval.
Indexing: Chunks and metadata are indexed into Azure AI Search or other supported backends.
Status Tracking: Table Storage is used to track ingestion status, errors, and progress.

Communication with the API

The API service queries the indexed knowledge base (populated by the data pipelines) to retrieve relevant context for RAG and chat endpoints.
Data pipelines and the API are decoupled: pipelines prepare and update the knowledge base, while the API only reads from it at runtime.
Ingestion jobs can be triggered independently of the API, and new data becomes available to the API as soon as it is indexed.

Typical Workflow

New documents are uploaded or discovered by the pipeline.
The pipeline parses, chunks, and indexes the content.
Metadata and status are updated in Table Storage.
The API can now retrieve and use the new content for RAG and chat responses.

Supported File Types

Value

Enables up-to-date, context-rich responses from the API
Decouples data preparation from real-time inference
Supports scalable, automated knowledge base management

Back to Architecture