RAG Service
← Back to Documentation Center

PDF Pipeline

Overview

Processes PDFs under conversionfiles/pdf/raw/, extracts text/tables (and optionally OCR), and indexes searchable chunks.

Triggers & Code

  • Event Grid: PdfProcessor for conversionfiles/pdf/raw (see pdf/funcs.py or pdf/pdf_processor.py depending on registration).
  • Optional HTTP route if present in Blueprints.

Contracts

  • Input
  • Blob URL must be under https://<account>.blob.core.windows.net/conversionfiles/pdf/raw/<file>.pdf.
  • Content-Type: application/pdf.
  • Output
  • Artifacts under conversionfiles/pdf/parsed/<filename_without_ext>/.
  • Documents adhere to the unified index schema subset (see below).
  • Error
  • Emits failure artifacts and status updates; see Error Modes.

Stages (Technical)

  1. Download PDF.
  2. Extraction using configured method (Form Recognizer or library) with page segmentation.
  3. Build chunked documents with page numbers/sections; populate standard fields and ingestion_timestamp.
  4. Write artifacts:
  5. pdf_documents.jsonl
  6. pdf_documents_index_ready.jsonl
  7. pdf_summary.json
  8. Embeddings + indexing (as per JSON/TXT), write pdf_documents_index_ready_embedded.jsonl, and upload.
  9. On failures, write pdf_index_upload_failures.json.

Notes

  • Large files may be paginated and chunked to respect token limits.
  • Vector field contentVector must be present for indexing.
  • Configure Form Recognizer via form_recognizer section in resources/configs/development_config.yml.
  • Validate ai_search configuration for index/embedding endpoints.

Implementation Details

  • Code structure
  • Event handler orchestrates: blob download → extraction → mapping → artifacts → embeddings → upload.
  • Uses a shared status store: shared.status_store.get_status_store(); stages: downloading_file, extracting, documents_created, generating_embeddings, uploading_to_index, indexing_completed.
  • Extraction
  • Primary path: Azure Form Recognizer (model prebuilt-document) with paging.
  • Fallback path: library-based parsing if FR is unavailable, extracting text per page.
  • Chunking
  • Page-level chunks with optional intra-page splitting based on character or token heuristics.
  • Each chunk gets deterministic id; if split, subsequent parts suffix -pN and add original_id.
  • Mapping to schema
  • Allowed fields: id, source_file, source_path, file_type, mime_type, ingestion_timestamp, page_number, section_title, subsection_title, type, text, extra_metadata, contentVector.
  • Non-allowed keys are dropped before upload.
  • Embeddings
  • Lazy-loads rag_shared Retrieval; vector dimension matches config (default 1536) to field contentVector.
  • Context-window fallback rechunks and batches the embedding requests.
  • Index upload
  • Uploads batches to unified_text_index; collects per-item results.
  • Writes pdf_index_upload_failures.json with id, status_code, error_message, text_len, and vector_dim for diagnosis.

Artifacts and Schema Examples

  • pdf_summary.json
  • { "file_name": "sample.pdf", "total_documents": 123, "pages": 10, "processed_at": "...Z" }
  • pdf_documents_index_ready.jsonl (one JSON object per line)
  • { "id": "abc123", "source_file": "sample.pdf", "file_type": "pdf", "page_number": 5, "text": "...", "ingestion_timestamp": "...Z" }
  • pdf_documents_index_ready_embedded.jsonl
  • Same as index_ready plus contentVector: [float, ...].
  • pdf_index_upload_failures.json
  • { "failures": [ { "id": "...", "error_message": "...", "vector_dim": 1536 } ] }

Configuration

  • form_recognizer.endpoint, form_recognizer.api_key, form_recognizer.model_id, form_recognizer.pages_per_call
  • Embeddings are configured globally under app.models.embeddings (deployment, api_version, endpoint/auth)
  • Chunking envs (if exposed similarly to JSON): PDF_EMBED_TEXT_MAX_CHARS, PDF_EMBED_CHUNK_OVERLAP, PDF_EMBED_FALLBACK_*

Error Modes

  • Download failures → status file_download_error.
  • Form Recognizer errors → extraction_error with FR diagnostic.
  • Embedding context window → automatic retry with fallback chunking.
  • Missing contentVector on upload → failures categorized as missing_vector.
  • Schema mismatches → failures categorized as schema_related.
  • Duplicate IDs → avoid by suffix -pN; collisions recorded if present.

Observability

  • Status table fields: state, progress, processing_stage, documents_created, documents_uploaded, error_message.
  • Logs include counts, sizes (bytes/MB), and index upload summaries.

Testing

  • Event Grid: upload a PDF to conversionfiles/pdf/raw/.
  • HTTP (if available): /api/pdf/process?blob_url=<encoded> or similar route.
  • Verify conversionfiles/pdf/parsed/<filename>/ contains the artifacts; open pdf_index_upload_failures.json if present.

Testing

  • Upload a sample to conversionfiles/pdf/raw/ and watch logs.
  • Or call the HTTP route if exposed with blob_url.
  • Inspect parsed artifacts under conversionfiles/pdf/parsed/<file>/.