PDF Pipeline

Overview

Processes PDFs under conversionfiles/pdf/raw/, extracts text/tables (and optionally OCR), and indexes searchable chunks.

Event Grid: PdfProcessor for conversionfiles/pdf/raw (see pdf/funcs.py or pdf/pdf_processor.py depending on registration).
Optional HTTP route if present in Blueprints.

Input
Blob URL must be under https://<account>.blob.core.windows.net/conversionfiles/pdf/raw/<file>.pdf.
Content-Type: application/pdf.
Output
Artifacts under conversionfiles/pdf/parsed/<filename_without_ext>/.
Documents adhere to the unified index schema subset (see below).
Error
Emits failure artifacts and status updates; see Error Modes.

Download PDF.
Extraction using configured method (Form Recognizer or library) with page segmentation.
Build chunked documents with page numbers/sections; populate standard fields and ingestion_timestamp.
Write artifacts:
pdf_documents.jsonl
pdf_documents_index_ready.jsonl
pdf_summary.json
Embeddings + indexing (as per JSON/TXT), write pdf_documents_index_ready_embedded.jsonl, and upload.
On failures, write pdf_index_upload_failures.json.

Large files may be paginated and chunked to respect token limits.
Vector field contentVector must be present for indexing.
Configure Form Recognizer via form_recognizer section in resources/configs/development_config.yml.
Validate ai_search configuration for index/embedding endpoints.

Code structure
Event handler orchestrates: blob download → extraction → mapping → artifacts → embeddings → upload.
Uses a shared status store: shared.status_store.get_status_store(); stages: downloading_file, extracting, documents_created, generating_embeddings, uploading_to_index, indexing_completed.
Extraction
Primary path: Azure Form Recognizer (model prebuilt-document) with paging.
Fallback path: library-based parsing if FR is unavailable, extracting text per page.
Chunking
Page-level chunks with optional intra-page splitting based on character or token heuristics.
Each chunk gets deterministic id; if split, subsequent parts suffix -pN and add original_id.
Mapping to schema
Allowed fields: id, source_file, source_path, file_type, mime_type, ingestion_timestamp, page_number, section_title, subsection_title, type, text, extra_metadata, contentVector.
Non-allowed keys are dropped before upload.
Embeddings
Lazy-loads rag_shared Retrieval; vector dimension matches config (default 1536) to field contentVector.
Context-window fallback rechunks and batches the embedding requests.
Index upload
Uploads batches to unified_text_index; collects per-item results.
Writes pdf_index_upload_failures.json with id, status_code, error_message, text_len, and vector_dim for diagnosis.

pdf_summary.json
{ "file_name": "sample.pdf", "total_documents": 123, "pages": 10, "processed_at": "...Z" }
pdf_documents_index_ready.jsonl (one JSON object per line)
{ "id": "abc123", "source_file": "sample.pdf", "file_type": "pdf", "page_number": 5, "text": "...", "ingestion_timestamp": "...Z" }
pdf_documents_index_ready_embedded.jsonl
Same as index_ready plus contentVector: [float, ...].
pdf_index_upload_failures.json
{ "failures": [ { "id": "...", "error_message": "...", "vector_dim": 1536 } ] }

form_recognizer.endpoint, form_recognizer.api_key, form_recognizer.model_id, form_recognizer.pages_per_call
Embeddings are configured globally under app.models.embeddings (deployment, api_version, endpoint/auth)
Chunking envs (if exposed similarly to JSON): PDF_EMBED_TEXT_MAX_CHARS, PDF_EMBED_CHUNK_OVERLAP, PDF_EMBED_FALLBACK_*

Status table fields: state, progress, processing_stage, documents_created, documents_uploaded, error_message.
Logs include counts, sizes (bytes/MB), and index upload summaries.

Event Grid: upload a PDF to conversionfiles/pdf/raw/.
HTTP (if available): /api/pdf/process?blob_url=<encoded> or similar route.
Verify conversionfiles/pdf/parsed/<filename>/ contains the artifacts; open pdf_index_upload_failures.json if present.