RAG Service
← Back to Documentation Center

Excel Pipeline

Overview

Processes spreadsheets (.xlsx / .xls) under conversionfiles/excel/raw/.

Triggers & Code

  • Event Grid: ExcelProcessor for conversionfiles/excel/raw (see excel/excel_processor.py).
  • Optional HTTP route if defined.

Contracts

  • Input: Blob in conversionfiles/excel/raw/<file>.xlsx (optionally .xls).
  • Output (parsed):
    • conversionfiles/excel/parsed/<file>/excel_documents.jsonl
    • conversionfiles/excel/parsed/<file>/excel_documents_index_ready.jsonl
    • conversionfiles/excel/parsed/<file>/excel_summary.json
  • Side effects: Status table rows updated with processing_stage and counters.
  • Error contract:
    • Event Grid path writes excel_index_upload_failures.json on index batch errors.
    • HTTP route returns JSON status (if exposed) instead of writing failures file.

Stages (Technical)

  • Iterate sheets and tables; map rows to text blocks with sheet_name and row indices.
  • Write artifacts: excel_documents.jsonl, excel_documents_index_ready.jsonl, excel_summary.json.
  • Embeddings + index upload; failures in excel_index_upload_failures.json.

Artifact examples (minimal)

  • excel_documents.jsonl (one line): {"id":"-Sheet1-r2","original_id":"-Sheet1-r2","content":"Name: Alice, Score: 95","sheet":"Sheet1","row":2,"source_url":""}
  • excel_documents_index_ready.jsonl (one line): {"id":"-Sheet1-r2","content":"Name: Alice, Score: 95","sheet":"Sheet1","row":2,"source_url":""}

Configuration

  • File mapping: resources/configs/development_config.yml → maps excel/raw to handler and parsed path.
  • AI Search: ai_search.index.name, allowed fields; embeddings are configured globally under app.models.embeddings.
  • Chunking/granularity: Per-row or per-table; ensure IDs are deterministic (sheet + row); avoid duplicates across sheets.
  • Status store: Table name and partition keys set in shared config (see monitoring docs).

Error modes & observability

  • Download/parse failures (bad Excel, encrypted): logged and surfaced in status.
  • Duplicate IDs across sheets/rows → index conflicts; ensure unique id scheme (e.g., <file>-<sheet>-r<row>).
  • Search schema rejections: fields not in allow-list.
  • Embedding/Indexing issues: batch errors summarized into excel_index_upload_failures.json.
  • Observability: summary file includes sheet/table counts; logs enumerate processed rows and upload results.

Testing

  • Upload .xlsx to conversionfiles/excel/raw/ and check .../parsed/<file>/.
    • Validate documents.jsonl line count ≈ number of non-empty rows processed.
    • Confirm index_ready presence and no failures file, or inspect failures JSON if present.