Python SDK
Complete reference for the ClicheFactory Python SDK — extraction, conversion, batch processing, and trained models.
Installation
pip install clichefactory
# pip — with local parsing dependencies (BYOK mode)
pip install clichefactory[local]
# uv
uv add clichefactory
uv add clichefactory[local]
The core package is all you need for service mode. The [local] extra adds OCR engines and parsers for running extraction on your own machine.
Authentication
Service Mode
Pass your ClicheFactory API key directly:
client = factory(api_key="cliche-...")
Local Mode (BYOK)
Provide your own LLM key — extraction runs on your machine, ClicheFactory handles parsing and orchestration:
client = factory(
mode="local",
model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="your-llm-key")
)
Config File
Run the interactive setup to save credentials to ~/.clichefactory/config.toml:
Precedence: explicit factory() args > environment variables > config file > defaults. See Execution Modes for details on local vs. service.
Extraction
Define a schema, create a cliche (extraction pipeline), and extract. Schemas can be Pydantic models or plain dicts.
With Pydantic
from clichefactory import factory
class Invoice(BaseModel):
invoice_number: str
total: float
vendor: str
client = factory(api_key="cliche-...")
cliche = client.cliche(Invoice)
result = cliche.extract(file="invoice.pdf")
print(result.invoice_number, result.total)
With Dict Schema
No Pydantic required — pass a JSON Schema dict:
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"total": {"type": "number"},
"vendor": {"type": "string"}
}
}
cliche = client.cliche(schema)
result = cliche.extract(file="invoice.pdf")
print(result["total"])
See Schemas in Core Concepts for detailed schema patterns including nested objects and arrays.
Extraction Modes
Control the accuracy/cost tradeoff per call. See Core Concepts for what each mode does.
| Mode | Code |
|---|---|
| Default (balanced) | cliche.extract(file=...) |
| Fast | cliche.extract(file=..., mode="fast") |
| Robust | cliche.extract(file=..., mode="robust") |
| Trained | cliche.extract(file=..., artifact_id="art_xxx") |
| Robust + Trained | cliche.extract(file=..., mode="robust-trained", artifact_id="art_xxx") |
When artifact_id is provided, the pipeline mode is defined by the artifact — you don't need to set mode explicitly. For maximum accuracy, trained artifacts can be combined with verification automatically.
Trained Models
Use a trained extraction pipeline by passing the artifact_id from a completed training run:
result = cliche.extract(file="invoice.pdf")
The trained pipeline knows its data model — the schema you pass is used for local validation only. See Training → Using Trained Models for the full workflow.
Document Conversion
Convert any supported document to markdown text:
print(doc.get_markdown())
# Fast mode: skip OCR, send file directly to a VLM (service mode only)
doc = client.to_markdown("scan.pdf", conversion_mode="fast")
conversion_mode="fast" skips OCR and sends the file directly to a vision model — useful for speed when the document is image-heavy and the LLM can read it natively. Supported values: "default", "fast".
Supports PDF, images, DOCX, XLSX, CSV, EML, and more — see Supported File Types.
Long Documents (chunk + merge)
The single-call extract pipeline is bounded by the LLM context window — roughly ~100 pages per file. For longer documents, use extract_long: it converts the file to markdown once, chunks it, extracts each chunk in parallel, and merges the per-chunk results into a single Pydantic model.
result = cliche.extract_long("200_page_contract.pdf")
List-valued fields are concatenated across chunks; scalar fields pick the first non-null value by default. Override the merge rule per-field with the resolvers= argument on cliche() or extract_long() — e.g. {"total": "sum_numeric", "line_items": "concat_dedupe_by=sku"}.
Pass include_chunk_results=True to get a LongExtractionResult with per-chunk outputs, resolution traces, warnings and aggregated cost. v1 limitation: extract_long runs the BYOK one-shot path only; mode="trained" | "robust" is rejected.
Full reference (chunkers, resolver aliases, default policy): SDK README — Long documents.
Batch Operations
Process multiple files concurrently:
results = cliche.extract_batch(["inv1.pdf", "inv2.pdf", "inv3.pdf"], max_concurrency=5)
# Batch document conversion
docs = client.to_markdown_batch(["a.pdf", "b.pdf"], max_concurrency=5)
max_concurrency controls how many files are processed in parallel. Defaults to 5. All keyword arguments are forwarded to the underlying extract() or to_markdown() call.
Async API
Every synchronous method has an _async variant for use in async frameworks like FastAPI, asyncio scripts, or Jupyter notebooks:
from clichefactory import factory
client = factory(api_key="cliche-...")
cliche = client.cliche(Invoice)
async def main():
# Single async extraction
result = await cliche.extract_async(file="invoice.pdf")
# Batch with concurrency control
results = await cliche.extract_batch_async(["a.pdf", "b.pdf"], max_concurrency=10)
# Async document conversion
doc = await client.to_markdown_async("document.pdf")
asyncio.run(main())
| Sync | Async |
|---|---|
cliche.extract(...) | await cliche.extract_async(...) |
cliche.extract_batch(...) | await cliche.extract_batch_async(...) |
client.to_markdown(...) | await client.to_markdown_async(...) |
client.to_markdown_batch(...) | await client.to_markdown_batch_async(...) |
All async variants accept the same arguments as their sync counterparts.
Advanced
Partial Results
Return partial data on validation failure instead of raising an error:
result = cliche.extract(file="messy.pdf", allow_partial=True)
if isinstance(result, PartialExtraction):
print("Partial result:", result.raw)
print("Validation errors:", result.validation_errors)
else:
print("Full result:", result)
When validation passes, a normal result is returned. When it fails, PartialExtraction is returned with .raw (the coerced dict) and .validation_errors (list of Pydantic error dicts) so you can inspect and recover what the model extracted.
Include Document & Costs
Get the parsed document or cost breakdown alongside the extraction result:
print(doc.get_markdown()) # inspect what the model saw
result = cliche.extract(file="invoice.pdf", include_costs=True)
When include_doc=True, the return value is a (result, doc) tuple. doc.get_markdown() returns the parsed document text that was sent to the LLM — useful for debugging extraction quality.
Postprocess Hooks
Transform extraction results before they're returned:
result["date"] = parse_european_date(result["date"])
return result
cliche = client.cliche(Invoice, postprocess=fix_dates)
Pipeline order: raw LLM dict → system coerce (EU decimals, currency, accounting negatives) → postprocess → Pydantic validation. The function receives and must return a dict.
Parser & OCR Configuration (Local Mode)
Fine-tune parsing at the client or cliche level via ParsingOptions. ParsingOptions has no effect on service-mode extraction. See Processing & OCR for background.
PDF image parser — controls how image-based PDF pages are processed:
# Default: Docling OCR + table structure recognition
client = factory(..., parsing=ParsingOptions(pdf_image_parser="docling"))
# Enhanced: Docling OCR + per-page VLM refinement for complex layouts
client = factory(..., parsing=ParsingOptions(pdf_image_parser="docling_vlm"))
# vision_layout: SaaS-only, uses proprietary layout model
client = factory(..., parsing=ParsingOptions(pdf_image_parser="vision_layout"))
docling_vlm requires ocr_model to be configured (it uses a VLM per page). vision_layout is service-only and will raise an error in local mode.
PDF OCR engine — the engine used when OCR is applied to PDF pages:
client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="rapidocr"))
# tesseract: higher accuracy, requires tesseract binary
client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="tesseract", pdf_ocr_lang="deu"))
# easyocr: GPU-accelerated, downloads models on first use
client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="easyocr"))
Image parser — controls OCR for standalone image files (PNG, JPG, etc.):
| Value | Description | Default |
|---|---|---|
"rapidocr" | Pure Python OCR, no system deps | Yes |
"pytesseract" | Tesseract-backed, requires tesseract binary | — |
"docling" | Docling layout-aware OCR | — |
"ocr_llm" | Send image directly to VLM (requires ocr_model) | — |
...,
parsing=ParsingOptions(image_parser="pytesseract", image_parser_lang="deu")
)
Full ParsingOptions reference:
| Field | Type | Default | Description |
|---|---|---|---|
pdf_image_parser | "docling" | "docling_vlm" | "vision_layout" | "docling" | Parser for image-based PDF pages |
pdf_ocr_engine | "rapidocr" | "tesseract" | "easyocr" | "rapidocr" | OCR engine for PDF pages |
pdf_ocr_lang | str | "eng" | Language code for PDF OCR (e.g. "deu", "fra") |
pdf_fallback_to_ocr_llm | bool | True | Fall back to VLM when OCR confidence is low |
pdf_structured_fallback_to_image | bool | False | Re-process native-text PDF pages through image pipeline |
use_ocr_llm_body | bool | True | Use VLM for body text in addition to tables |
image_parser | "rapidocr" | "pytesseract" | "docling" | "ocr_llm" | "rapidocr" | Parser for standalone image files |
image_parser_fallback | bool | True | Fall back to VLM if image OCR confidence is low |
image_parser_lang | str | "eng" | Language code for image OCR |
Model & OCR Model Override Per Call
Use a different LLM for a specific extraction, or route OCR to a separate (cheaper) model:
result = cliche.extract(
file="contract.pdf",
model=Endpoint(provider_model="openai/gpt-4o", api_key="sk-...")
)
# Use a separate model for OCR (e.g. cheaper VLM for image parsing)
result = cliche.extract(
file="scanned.pdf",
model=Endpoint(provider_model="openai/gpt-4o", api_key="sk-..."),
ocr_model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="...")
)
ocr_model is used for document parsing / VLM refinement passes (e.g. with pdf_image_parser="docling_vlm" or when OCR falls back to a VLM). If not set, the extraction model is also used for OCR.
Endpoint Configuration
The Endpoint object configures any LLM call — on factory(), cliche(), or per extract():
| Field | Type | Description |
|---|---|---|
provider_model | str | Provider-prefixed model name, e.g. "openai/gpt-4o", "gemini/gemini-3-flash-preview", "ollama/llama3.2" |
api_key | str | API key for the provider. Omit or leave empty for Ollama. |
api_base | str | Custom base URL — required for self-hosted Ollama or Azure OpenAI endpoints |
max_tokens | int | Maximum output tokens (default: 10000) |
temperature | float | Sampling temperature 0.0–2.0 (default: 0.1 for extraction, 1.0 for OCR) |
num_retries | int | LLM call retries on transient failures (default: 8) |
client = factory(
mode="local",
model=Endpoint(
provider_model="ollama/llama3.2",
api_base="http://my-gpu-host:11434"
)
)
Text Input
Extract structured data from raw text (no file needed):