RAG Service
← Back to Documentation Center

TXT Pipeline

This document explains the TXT ingestion pipeline in detail.

Overview

Processes .txt files under conversionfiles/txt/raw/, producing parsed artifacts and indexing chunks into Azure AI Search.

Triggers & Code

  • Event Grid: TxtProcessor (BlobCreated for conversionfiles/txt/raw) — in processors/txt/processor.py.
  • HTTP: TxtProcessorHttp at /api/txt-process?blob_url=<url> — in processors/txt/http.py using Blueprints; GET with blob_url performs processing.

Contracts

  • Input: blob URL under conversionfiles/txt/raw/<file>.txt.
  • Output: artifacts under conversionfiles/txt/parsed/<filename_without_ext>/.
  • Error: HTTP returns status JSON; Event Grid writes diagnostics.

upload_id and Status Tracking

  • Same as JSON: from metadata or deterministic hash.
  • Status updated throughout with counts and stages.

Processing Stages (Technical)

  1. Download the raw text file.
  2. Normalize and split into chunks (configurable chunk size/overlap if exposed).
  3. Build documents with required fields: id, source_file, source_path, file_type=txt, mime_type=text/plain, ingestion_timestamp, chunk_index, text, extra_metadata.
  4. Write artifacts:
  5. txt_documents.jsonl
  6. txt_documents_index_ready.jsonl
  7. txt_summary.json

Field Schema (allowed)

Subset consistent with JSON; includes contentVector for embedded path.

Embeddings and Indexing

  • Same pattern as JSON: lazy-load Retrieval, pre-split, fallback chunking on context window errors.
  • Unique ids for multi-part chunks with -pN suffix; retain original_id.
  • Write txt_documents_index_ready_embedded.jsonl and upload to the index.
  • Write txt_index_upload_failures.json on batch errors.

Configuration

  • File mappings to txt/raw in resources/configs/development_config.yml.
  • ai_search.* for Search; embeddings are configured globally under app.models.embeddings.
  • Optional chunking envs similar to JSON if defined.

Error Modes & Observability

  • Download failures, chunking issues, embedding context errors (auto-fallback), schema rejections, duplicate ids.
  • Status store updates processing_stage and progress; logs summarize chunk counts and upload results.

Testing

Local HTTP

/api/txt-process?blob_url=<encoded> returns a JSON status body. Confirm artifacts under conversionfiles/txt/parsed/<file>/.

Artifacts

  • conversionfiles/txt/parsed/<filename_without_ext>/ containing the above files.

Troubleshooting

  • Duplicate id failures: confirm suffixed ids for chunk parts.
  • Missing vectors: ensure embeddings enabled and contentVector present.
  • Schema issues: only allowed fields are uploaded.