RAG Service
← Back to Documentation Center

DOCX Pipeline

Overview

Processes Word documents (.docx / .doc) under conversionfiles/docx/raw/.

Triggers & Code

  • Event Grid: DocxProcessor for conversionfiles/docx/raw (see docx/docx_processor.py).
  • Optional HTTP Blueprint:

Contracts

  • Input: blob URL under conversionfiles/docx/raw/<file>.docx|.doc.
  • Output: artifacts under conversionfiles/docx/parsed/<filename_without_ext>/.
  • Error: failures logged and recorded; diagnostic artifact on upload errors.

Stages (Technical)

  • Parse paragraphs/sections with ordering.
  • Build documents with section headings, chunk indices, and text.
  • Write artifacts: docx_documents.jsonl, docx_documents_index_ready.jsonl, docx_summary.json.
  • Embeddings + index upload; diagnostic failures to docx_index_upload_failures.json.

Implementation Details

  • Section extraction
    • Preserves heading hierarchy when available; sets section_title and subsection_title.
    • Splits large paragraphs using character/tokens thresholds; parts get -pN id suffix and original_id.
  • Schema mapping (allow-list)
    • id, source_file, source_path, file_type, mime_type, ingestion_timestamp, chunk_index, section_title, subsection_title, text, extra_metadata, contentVector.
  • Embeddings & upload
    • Same retrieval/embedding fallback strategy as JSON/PDF.
    • Writes docx_index_upload_failures.json with {id, error_message, vector_dim, text_len}.

Artifacts and Examples

  • docx_summary.json: { "file_name": "sample.docx", "total_documents": 42, "processed_at": "...Z" }
  • docx_documents_index_ready.jsonl: one record per chunk.
  • docx_documents_index_ready_embedded.jsonl: includes contentVector.
  • docx_index_upload_failures.json: present only on failures.

Configuration

  • storage.blob_storage.file_mappings routes .docx|.doc to docx/raw.
  • ai_search.* provides Search settings; embeddings are configured globally under app.models.embeddings.
  • Optional chunking envs similar to JSON if defined.

Error Modes

  • Download failures, parse/extraction issues, embedding context errors (auto-fallback), schema rejections, duplicate ids.

Observability

  • Status store updates stage/progress; logs summarize chunks, sizes, and upload results.

Testing

  • Event Grid: upload a .docx to conversionfiles/docx/raw/.
  • HTTP (if available): call route with blob_url.

Testing

  • Upload .docx to conversionfiles/docx/raw/.
  • Confirm artifacts in conversionfiles/docx/parsed/<file>/ and index updates.