← Back to Documentation Center

TXT Pipeline

This document explains the TXT ingestion pipeline in detail.

Overview

Processes .txt files under conversionfiles/txt/raw/, producing parsed artifacts and indexing chunks into Azure AI Search.

Triggers & Code

Event Grid: TxtProcessor (BlobCreated for conversionfiles/txt/raw) — in processors/txt/processor.py.
HTTP: TxtProcessorHttp at /api/txt-process?blob_url=<url> — in processors/txt/http.py using Blueprints; GET with blob_url performs processing.

Contracts

Input: blob URL under conversionfiles/txt/raw/<file>.txt.
Output: artifacts under conversionfiles/txt/parsed/<filename_without_ext>/.
Error: HTTP returns status JSON; Event Grid writes diagnostics.

upload_id and Status Tracking

Same as JSON: from metadata or deterministic hash.
Status updated throughout with counts and stages.

Processing Stages (Technical)

Download the raw text file.
Normalize and split into chunks (configurable chunk size/overlap if exposed).
Build documents with required fields: id, source_file, source_path, file_type=txt, mime_type=text/plain, ingestion_timestamp, chunk_index, text, extra_metadata.
Write artifacts:
txt_documents.jsonl
txt_documents_index_ready.jsonl
txt_summary.json

Field Schema (allowed)

Subset consistent with JSON; includes contentVector for embedded path.

Embeddings and Indexing

Same pattern as JSON: lazy-load Retrieval, pre-split, fallback chunking on context window errors.
Unique ids for multi-part chunks with -pN suffix; retain original_id.
Write txt_documents_index_ready_embedded.jsonl and upload to the index.
Write txt_index_upload_failures.json on batch errors.

Configuration

File mappings to txt/raw in resources/configs/development_config.yml.
ai_search.* for Search; embeddings are configured globally under app.models.embeddings.
Optional chunking envs similar to JSON if defined.

Error Modes & Observability

Download failures, chunking issues, embedding context errors (auto-fallback), schema rejections, duplicate ids.
Status store updates processing_stage and progress; logs summarize chunk counts and upload results.

Testing

Local HTTP

/api/txt-process?blob_url=<encoded> returns a JSON status body. Confirm artifacts under conversionfiles/txt/parsed/<file>/.

Artifacts

conversionfiles/txt/parsed/<filename_without_ext>/ containing the above files.

Troubleshooting

Duplicate id failures: confirm suffixed ids for chunk parts.
Missing vectors: ensure embeddings enabled and contentVector present.
Schema issues: only allowed fields are uploaded.