DOCX Pipeline
Overview
Processes Word documents (.docx / .doc) under conversionfiles/docx/raw/.
Triggers & Code
- Event Grid:
DocxProcessorforconversionfiles/docx/raw(seedocx/docx_processor.py). - Optional HTTP Blueprint:
Contracts
- Input: blob URL under
conversionfiles/docx/raw/<file>.docx|.doc. - Output: artifacts under
conversionfiles/docx/parsed/<filename_without_ext>/. - Error: failures logged and recorded; diagnostic artifact on upload errors.
Stages (Technical)
- Parse paragraphs/sections with ordering.
- Build documents with section headings, chunk indices, and text.
- Write artifacts:
docx_documents.jsonl,docx_documents_index_ready.jsonl,docx_summary.json. - Embeddings + index upload; diagnostic failures to
docx_index_upload_failures.json.
Implementation Details
- Section extraction
- Preserves heading hierarchy when available; sets
section_titleandsubsection_title. - Splits large paragraphs using character/tokens thresholds; parts get
-pNid suffix andoriginal_id.
- Preserves heading hierarchy when available; sets
- Schema mapping (allow-list)
id, source_file, source_path, file_type, mime_type, ingestion_timestamp, chunk_index, section_title, subsection_title, text, extra_metadata, contentVector.
- Embeddings & upload
- Same retrieval/embedding fallback strategy as JSON/PDF.
- Writes
docx_index_upload_failures.jsonwith{id, error_message, vector_dim, text_len}.
Artifacts and Examples
docx_summary.json:{ "file_name": "sample.docx", "total_documents": 42, "processed_at": "...Z" }docx_documents_index_ready.jsonl: one record per chunk.docx_documents_index_ready_embedded.jsonl: includescontentVector.docx_index_upload_failures.json: present only on failures.
Configuration
storage.blob_storage.file_mappingsroutes.docx|.doctodocx/raw.ai_search.*provides Search settings; embeddings are configured globally underapp.models.embeddings.- Optional chunking envs similar to JSON if defined.
Error Modes
- Download failures, parse/extraction issues, embedding context errors (auto-fallback), schema rejections, duplicate ids.
Observability
- Status store updates stage/progress; logs summarize chunks, sizes, and upload results.
Testing
- Event Grid: upload a
.docxtoconversionfiles/docx/raw/. - HTTP (if available): call route with
blob_url.
Testing
- Upload
.docxtoconversionfiles/docx/raw/. - Confirm artifacts in
conversionfiles/docx/parsed/<file>/and index updates.