DOCX Pipeline

Overview

Processes Word documents (.docx / .doc) under conversionfiles/docx/raw/.

Event Grid: DocxProcessor for conversionfiles/docx/raw (see docx/docx_processor.py).
Optional HTTP Blueprint:

Parse paragraphs/sections with ordering.
Build documents with section headings, chunk indices, and text.
Write artifacts: docx_documents.jsonl, docx_documents_index_ready.jsonl, docx_summary.json.
Embeddings + index upload; diagnostic failures to docx_index_upload_failures.json.

Section extraction
- Preserves heading hierarchy when available; sets section_title and subsection_title.
- Splits large paragraphs using character/tokens thresholds; parts get -pN id suffix and original_id.
Schema mapping (allow-list)
- id, source_file, source_path, file_type, mime_type, ingestion_timestamp, chunk_index, section_title, subsection_title, text, extra_metadata, contentVector.
Embeddings & upload
- Same retrieval/embedding fallback strategy as JSON/PDF.
- Writes docx_index_upload_failures.json with {id, error_message, vector_dim, text_len}.

docx_summary.json: { "file_name": "sample.docx", "total_documents": 42, "processed_at": "...Z" }
docx_documents_index_ready.jsonl: one record per chunk.
docx_documents_index_ready_embedded.jsonl: includes contentVector.
docx_index_upload_failures.json: present only on failures.

storage.blob_storage.file_mappings routes .docx|.doc to docx/raw.
ai_search.* provides Search settings; embeddings are configured globally under app.models.embeddings.
Optional chunking envs similar to JSON if defined.

Download failures, parse/extraction issues, embedding context errors (auto-fallback), schema rejections, duplicate ids.

Status store updates stage/progress; logs summarize chunks, sizes, and upload results.