Data Pipelines
This page describes the data pipelines in RAG API Core, how they are implemented, and how they interact with the API service.
Overview
The data pipelines are responsible for: - Ingesting documents and files from various sources - Parsing, chunking, and transforming content - Indexing processed data into the knowledge base (e.g., Azure AI Search) - Maintaining metadata and tracking ingestion status
Implementation
- Ingestion: Data is ingested via scripts, scheduled jobs, or manual uploads. Supported formats include PDF, DOCX, TXT, and more.
- Parsing & Chunking: Documents are parsed and split into smaller, semantically meaningful chunks for efficient retrieval.
- Indexing: Chunks and metadata are indexed into Azure AI Search or other supported backends.
- Status Tracking: Table Storage is used to track ingestion status, errors, and progress.
Communication with the API
- The API service queries the indexed knowledge base (populated by the data pipelines) to retrieve relevant context for RAG and chat endpoints.
- Data pipelines and the API are decoupled: pipelines prepare and update the knowledge base, while the API only reads from it at runtime.
- Ingestion jobs can be triggered independently of the API, and new data becomes available to the API as soon as it is indexed.
Typical Workflow
- New documents are uploaded or discovered by the pipeline.
- The pipeline parses, chunks, and indexes the content.
- Metadata and status are updated in Table Storage.
- The API can now retrieve and use the new content for RAG and chat responses.
Supported File Types
- CSV Files
- JSON Files
- Text Files (.txt)
- Markdown Files (.md)
- PDF Files
- Word Files (.docx)
- Excel Files
- Links / URLs
Value
- Enables up-to-date, context-rich responses from the API
- Decouples data preparation from real-time inference
- Supports scalable, automated knowledge base management