Excel Pipeline
Overview
Processes spreadsheets (.xlsx / .xls) under conversionfiles/excel/raw/.
Triggers & Code
- Event Grid:
ExcelProcessorforconversionfiles/excel/raw(seeexcel/excel_processor.py). - Optional HTTP route if defined.
Contracts
- Input: Blob in
conversionfiles/excel/raw/<file>.xlsx(optionally.xls). - Output (parsed):
conversionfiles/excel/parsed/<file>/excel_documents.jsonlconversionfiles/excel/parsed/<file>/excel_documents_index_ready.jsonlconversionfiles/excel/parsed/<file>/excel_summary.json
- Side effects: Status table rows updated with processing_stage and counters.
- Error contract:
- Event Grid path writes
excel_index_upload_failures.jsonon index batch errors. - HTTP route returns JSON status (if exposed) instead of writing failures file.
- Event Grid path writes
Stages (Technical)
- Iterate sheets and tables; map rows to text blocks with
sheet_nameand row indices. - Write artifacts:
excel_documents.jsonl,excel_documents_index_ready.jsonl,excel_summary.json. - Embeddings + index upload; failures in
excel_index_upload_failures.json.
Artifact examples (minimal)
- excel_documents.jsonl (one line):
{"id":"
-Sheet1-r2","original_id":" -Sheet1-r2","content":"Name: Alice, Score: 95","sheet":"Sheet1","row":2,"source_url":" "} - excel_documents_index_ready.jsonl (one line):
{"id":"
-Sheet1-r2","content":"Name: Alice, Score: 95","sheet":"Sheet1","row":2,"source_url":" "}
Configuration
- File mapping:
resources/configs/development_config.yml→ mapsexcel/rawto handler and parsed path. - AI Search:
ai_search.index.name, allowed fields; embeddings are configured globally underapp.models.embeddings. - Chunking/granularity: Per-row or per-table; ensure IDs are deterministic (sheet + row); avoid duplicates across sheets.
- Status store: Table name and partition keys set in shared config (see monitoring docs).
Error modes & observability
- Download/parse failures (bad Excel, encrypted): logged and surfaced in status.
- Duplicate IDs across sheets/rows → index conflicts; ensure unique id scheme (e.g.,
<file>-<sheet>-r<row>). - Search schema rejections: fields not in allow-list.
- Embedding/Indexing issues: batch errors summarized into
excel_index_upload_failures.json. - Observability: summary file includes sheet/table counts; logs enumerate processed rows and upload results.
Testing
- Upload
.xlsxtoconversionfiles/excel/raw/and check.../parsed/<file>/.- Validate documents.jsonl line count ≈ number of non-empty rows processed.
- Confirm index_ready presence and no failures file, or inspect failures JSON if present.