TXT Pipeline
This document explains the TXT ingestion pipeline in detail.
Overview
Processes .txt files under conversionfiles/txt/raw/, producing parsed artifacts and indexing chunks into Azure AI Search.
Triggers & Code
- Event Grid:
TxtProcessor(BlobCreated forconversionfiles/txt/raw) — inprocessors/txt/processor.py. - HTTP:
TxtProcessorHttpat/api/txt-process?blob_url=<url>— inprocessors/txt/http.pyusing Blueprints; GET withblob_urlperforms processing.
Contracts
- Input: blob URL under
conversionfiles/txt/raw/<file>.txt. - Output: artifacts under
conversionfiles/txt/parsed/<filename_without_ext>/. - Error: HTTP returns status JSON; Event Grid writes diagnostics.
upload_id and Status Tracking
- Same as JSON: from metadata or deterministic hash.
- Status updated throughout with counts and stages.
Processing Stages (Technical)
- Download the raw text file.
- Normalize and split into chunks (configurable chunk size/overlap if exposed).
- Build documents with required fields: id, source_file, source_path, file_type=txt, mime_type=text/plain, ingestion_timestamp, chunk_index, text, extra_metadata.
- Write artifacts:
txt_documents.jsonltxt_documents_index_ready.jsonltxt_summary.json
Field Schema (allowed)
Subset consistent with JSON; includes contentVector for embedded path.
Embeddings and Indexing
- Same pattern as JSON: lazy-load Retrieval, pre-split, fallback chunking on context window errors.
- Unique ids for multi-part chunks with
-pNsuffix; retainoriginal_id. - Write
txt_documents_index_ready_embedded.jsonland upload to the index. - Write
txt_index_upload_failures.jsonon batch errors.
Configuration
- File mappings to
txt/rawinresources/configs/development_config.yml. ai_search.*for Search; embeddings are configured globally underapp.models.embeddings.- Optional chunking envs similar to JSON if defined.
Error Modes & Observability
- Download failures, chunking issues, embedding context errors (auto-fallback), schema rejections, duplicate ids.
- Status store updates
processing_stageand progress; logs summarize chunk counts and upload results.
Testing
Local HTTP
/api/txt-process?blob_url=<encoded> returns a JSON status body. Confirm artifacts under conversionfiles/txt/parsed/<file>/.
Artifacts
conversionfiles/txt/parsed/<filename_without_ext>/containing the above files.
Troubleshooting
- Duplicate id failures: confirm suffixed ids for chunk parts.
- Missing vectors: ensure embeddings enabled and
contentVectorpresent. - Schema issues: only allowed fields are uploaded.