Excel Pipeline

Overview

Processes spreadsheets (.xlsx / .xls) under conversionfiles/excel/raw/.

Event Grid: ExcelProcessor for conversionfiles/excel/raw (see excel/excel_processor.py).
Optional HTTP route if defined.

Input: Blob in conversionfiles/excel/raw/<file>.xlsx (optionally .xls).
Output (parsed):
- conversionfiles/excel/parsed/<file>/excel_documents.jsonl
- conversionfiles/excel/parsed/<file>/excel_documents_index_ready.jsonl
- conversionfiles/excel/parsed/<file>/excel_summary.json
Side effects: Status table rows updated with processing_stage and counters.
Error contract:
- Event Grid path writes excel_index_upload_failures.json on index batch errors.
- HTTP route returns JSON status (if exposed) instead of writing failures file.

Iterate sheets and tables; map rows to text blocks with sheet_name and row indices.
Write artifacts: excel_documents.jsonl, excel_documents_index_ready.jsonl, excel_summary.json.
Embeddings + index upload; failures in excel_index_upload_failures.json.

excel_documents.jsonl (one line): {"id":"-Sheet1-r2","original_id":"-Sheet1-r2","content":"Name: Alice, Score: 95","sheet":"Sheet1","row":2,"source_url":""}
excel_documents_index_ready.jsonl (one line): {"id":"-Sheet1-r2","content":"Name: Alice, Score: 95","sheet":"Sheet1","row":2,"source_url":""}

File mapping: resources/configs/development_config.yml → maps excel/raw to handler and parsed path.
AI Search: ai_search.index.name, allowed fields; embeddings are configured globally under app.models.embeddings.
Chunking/granularity: Per-row or per-table; ensure IDs are deterministic (sheet + row); avoid duplicates across sheets.
Status store: Table name and partition keys set in shared config (see monitoring docs).

Download/parse failures (bad Excel, encrypted): logged and surfaced in status.
Duplicate IDs across sheets/rows → index conflicts; ensure unique id scheme (e.g., <file>-<sheet>-r<row>).
Search schema rejections: fields not in allow-list.
Embedding/Indexing issues: batch errors summarized into excel_index_upload_failures.json.
Observability: summary file includes sheet/table counts; logs enumerate processed rows and upload results.

Upload .xlsx to conversionfiles/excel/raw/ and check .../parsed/<file>/.
- Validate documents.jsonl line count ≈ number of non-empty rows processed.
- Confirm index_ready presence and no failures file, or inspect failures JSON if present.