PDF Pipeline
Overview
Processes PDFs under conversionfiles/pdf/raw/, extracts text/tables (and optionally OCR), and indexes searchable chunks.
Triggers & Code
- Event Grid:
PdfProcessorforconversionfiles/pdf/raw(seepdf/funcs.pyorpdf/pdf_processor.pydepending on registration). - Optional HTTP route if present in Blueprints.
Contracts
- Input
- Blob URL must be under
https://<account>.blob.core.windows.net/conversionfiles/pdf/raw/<file>.pdf. - Content-Type:
application/pdf. - Output
- Artifacts under
conversionfiles/pdf/parsed/<filename_without_ext>/. - Documents adhere to the unified index schema subset (see below).
- Error
- Emits failure artifacts and status updates; see Error Modes.
Stages (Technical)
- Download PDF.
- Extraction using configured method (Form Recognizer or library) with page segmentation.
- Build chunked documents with page numbers/sections; populate standard fields and
ingestion_timestamp. - Write artifacts:
pdf_documents.jsonlpdf_documents_index_ready.jsonlpdf_summary.json- Embeddings + indexing (as per JSON/TXT), write
pdf_documents_index_ready_embedded.jsonl, and upload. - On failures, write
pdf_index_upload_failures.json.
Notes
- Large files may be paginated and chunked to respect token limits.
- Vector field
contentVectormust be present for indexing. - Configure Form Recognizer via
form_recognizersection inresources/configs/development_config.yml. - Validate
ai_searchconfiguration for index/embedding endpoints.
Implementation Details
- Code structure
- Event handler orchestrates: blob download → extraction → mapping → artifacts → embeddings → upload.
- Uses a shared status store:
shared.status_store.get_status_store(); stages:downloading_file,extracting,documents_created,generating_embeddings,uploading_to_index,indexing_completed. - Extraction
- Primary path: Azure Form Recognizer (model
prebuilt-document) with paging. - Fallback path: library-based parsing if FR is unavailable, extracting text per page.
- Chunking
- Page-level chunks with optional intra-page splitting based on character or token heuristics.
- Each chunk gets deterministic
id; if split, subsequent parts suffix-pNand addoriginal_id. - Mapping to schema
- Allowed fields:
id, source_file, source_path, file_type, mime_type, ingestion_timestamp, page_number, section_title, subsection_title, type, text, extra_metadata, contentVector. - Non-allowed keys are dropped before upload.
- Embeddings
- Lazy-loads
rag_sharedRetrieval; vector dimension matches config (default 1536) to fieldcontentVector. - Context-window fallback rechunks and batches the embedding requests.
- Index upload
- Uploads batches to
unified_text_index; collects per-item results. - Writes
pdf_index_upload_failures.jsonwithid,status_code,error_message,text_len, andvector_dimfor diagnosis.
Artifacts and Schema Examples
pdf_summary.json{ "file_name": "sample.pdf", "total_documents": 123, "pages": 10, "processed_at": "...Z" }pdf_documents_index_ready.jsonl(one JSON object per line){ "id": "abc123", "source_file": "sample.pdf", "file_type": "pdf", "page_number": 5, "text": "...", "ingestion_timestamp": "...Z" }pdf_documents_index_ready_embedded.jsonl- Same as index_ready plus
contentVector: [float, ...]. pdf_index_upload_failures.json{ "failures": [ { "id": "...", "error_message": "...", "vector_dim": 1536 } ] }
Configuration
form_recognizer.endpoint,form_recognizer.api_key,form_recognizer.model_id,form_recognizer.pages_per_call- Embeddings are configured globally under
app.models.embeddings(deployment, api_version, endpoint/auth) - Chunking envs (if exposed similarly to JSON):
PDF_EMBED_TEXT_MAX_CHARS,PDF_EMBED_CHUNK_OVERLAP,PDF_EMBED_FALLBACK_*
Error Modes
- Download failures → status
file_download_error. - Form Recognizer errors →
extraction_errorwith FR diagnostic. - Embedding context window → automatic retry with fallback chunking.
- Missing
contentVectoron upload → failures categorized asmissing_vector. - Schema mismatches → failures categorized as
schema_related. - Duplicate IDs → avoid by suffix
-pN; collisions recorded if present.
Observability
- Status table fields:
state,progress,processing_stage,documents_created,documents_uploaded,error_message. - Logs include counts, sizes (bytes/MB), and index upload summaries.
Testing
- Event Grid: upload a PDF to
conversionfiles/pdf/raw/. - HTTP (if available):
/api/pdf/process?blob_url=<encoded>or similar route. - Verify
conversionfiles/pdf/parsed/<filename>/contains the artifacts; openpdf_index_upload_failures.jsonif present.
Testing
- Upload a sample to
conversionfiles/pdf/raw/and watch logs. - Or call the HTTP route if exposed with
blob_url. - Inspect parsed artifacts under
conversionfiles/pdf/parsed/<file>/.