Python SDK

Complete reference for the ClicheFactory Python SDK — extraction, conversion, batch processing, and trained models.

Installation

# pip — core package (service mode)
pip install clichefactory

# pip — with local parsing dependencies (BYOK mode)
pip install clichefactory[local]

# uv
uv add clichefactory
uv add clichefactory[local]

The core package is all you need for service mode. The [local] extra adds OCR engines and parsers for running extraction on your own machine.

Authentication

Service Mode

Pass your ClicheFactory API key directly:

from clichefactory import factory

client = factory(api_key="cliche-...")
Local Mode (BYOK)

Provide your own LLM key — extraction runs on your machine, ClicheFactory handles parsing and orchestration:

from clichefactory import factory, Endpoint

client = factory(
    mode="local",
    model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="your-llm-key")
)
Config File

Run the interactive setup to save credentials to ~/.clichefactory/config.toml:

clichefactory configure

Precedence: explicit factory() args > environment variables > config file > defaults. See Execution Modes for details on local vs. service.

Extraction

Define a schema, create a cliche (extraction pipeline), and extract. Schemas can be Pydantic models or plain dicts.

With Pydantic
from pydantic import BaseModel
from clichefactory import factory

class Invoice(BaseModel):
    invoice_number: str
    total: float
    vendor: str

client = factory(api_key="cliche-...")
cliche = client.cliche(Invoice)
result = cliche.extract(file="invoice.pdf")
print(result.invoice_number, result.total)
With Dict Schema

No Pydantic required — pass a JSON Schema dict:

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "total": {"type": "number"},
        "vendor": {"type": "string"}
    }
}

cliche = client.cliche(schema)
result = cliche.extract(file="invoice.pdf")
print(result["total"])

See Schemas in Core Concepts for detailed schema patterns including nested objects and arrays.

Extraction Modes

Control the accuracy/cost tradeoff per call. See Core Concepts for what each mode does.

ModeCode
Default (balanced)cliche.extract(file=...)
Fastcliche.extract(file=..., mode="fast")
Robustcliche.extract(file=..., mode="robust")
Trainedcliche.extract(file=..., artifact_id="art_xxx")
Robust + Trainedcliche.extract(file=..., mode="robust-trained", artifact_id="art_xxx")

When artifact_id is provided, the pipeline mode is defined by the artifact — you don't need to set mode explicitly. For maximum accuracy, trained artifacts can be combined with verification automatically.

Trained Models

Training currently runs in BYOK mode only. Supply your model provider key (OpenAI, Gemini, or Anthropic) when training. Full-service training is on the roadmap.

Use a trained extraction pipeline by passing the artifact_id from a completed training run:

cliche = client.cliche(Invoice, artifact_id="art_abc123")
result = cliche.extract(file="invoice.pdf")

The trained pipeline knows its data model — the schema you pass is used for local validation only. See Training → Using Trained Models for the full workflow.

Document Conversion

Convert any supported document to markdown text:

doc = client.to_markdown("document.pdf")
print(doc.get_markdown())

# Fast mode: skip OCR, send file directly to a VLM (service mode only)
doc = client.to_markdown("scan.pdf", conversion_mode="fast")

conversion_mode="fast" skips OCR and sends the file directly to a vision model — useful for speed when the document is image-heavy and the LLM can read it natively. Supported values: "default", "fast".

Supports PDF, images, DOCX, XLSX, CSV, EML, and more — see Supported File Types.

Long Documents (chunk + merge)

The single-call extract pipeline is bounded by the LLM context window — roughly ~100 pages per file. For longer documents, use extract_long: it converts the file to markdown once, chunks it, extracts each chunk in parallel, and merges the per-chunk results into a single Pydantic model.

cliche = client.cliche(Invoice)
result = cliche.extract_long("200_page_contract.pdf")

List-valued fields are concatenated across chunks; scalar fields pick the first non-null value by default. Override the merge rule per-field with the resolvers= argument on cliche() or extract_long() — e.g. {"total": "sum_numeric", "line_items": "concat_dedupe_by=sku"}.

Pass include_chunk_results=True to get a LongExtractionResult with per-chunk outputs, resolution traces, warnings and aggregated cost. v1 limitation: extract_long runs the BYOK one-shot path only; mode="trained" | "robust" is rejected.

Full reference (chunkers, resolver aliases, default policy): SDK README — Long documents.

Batch Operations

Process multiple files concurrently:

# Batch extraction
results = cliche.extract_batch(["inv1.pdf", "inv2.pdf", "inv3.pdf"], max_concurrency=5)

# Batch document conversion
docs = client.to_markdown_batch(["a.pdf", "b.pdf"], max_concurrency=5)

max_concurrency controls how many files are processed in parallel. Defaults to 5. All keyword arguments are forwarded to the underlying extract() or to_markdown() call.

Async API

Every synchronous method has an _async variant for use in async frameworks like FastAPI, asyncio scripts, or Jupyter notebooks:

import asyncio
from clichefactory import factory

client = factory(api_key="cliche-...")
cliche = client.cliche(Invoice)

async def main():
    # Single async extraction
    result = await cliche.extract_async(file="invoice.pdf")

    # Batch with concurrency control
    results = await cliche.extract_batch_async(["a.pdf", "b.pdf"], max_concurrency=10)

    # Async document conversion
    doc = await client.to_markdown_async("document.pdf")

asyncio.run(main())
SyncAsync
cliche.extract(...)await cliche.extract_async(...)
cliche.extract_batch(...)await cliche.extract_batch_async(...)
client.to_markdown(...)await client.to_markdown_async(...)
client.to_markdown_batch(...)await client.to_markdown_batch_async(...)

All async variants accept the same arguments as their sync counterparts.

Advanced

Partial Results

Return partial data on validation failure instead of raising an error:

from clichefactory import PartialExtraction

result = cliche.extract(file="messy.pdf", allow_partial=True)

if isinstance(result, PartialExtraction):
    print("Partial result:", result.raw)
    print("Validation errors:", result.validation_errors)
else:
    print("Full result:", result)

When validation passes, a normal result is returned. When it fails, PartialExtraction is returned with .raw (the coerced dict) and .validation_errors (list of Pydantic error dicts) so you can inspect and recover what the model extracted.

Include Document & Costs

Get the parsed document or cost breakdown alongside the extraction result:

result, doc = cliche.extract(file="invoice.pdf", include_doc=True)
print(doc.get_markdown()) # inspect what the model saw

result = cliche.extract(file="invoice.pdf", include_costs=True)

When include_doc=True, the return value is a (result, doc) tuple. doc.get_markdown() returns the parsed document text that was sent to the LLM — useful for debugging extraction quality.

Postprocess Hooks

Transform extraction results before they're returned:

def fix_dates(result: dict) -> dict:
    result["date"] = parse_european_date(result["date"])
    return result

cliche = client.cliche(Invoice, postprocess=fix_dates)

Pipeline order: raw LLM dict → system coerce (EU decimals, currency, accounting negatives) → postprocess → Pydantic validation. The function receives and must return a dict.

Parser & OCR Configuration (Local Mode)

Fine-tune parsing at the client or cliche level via ParsingOptions. ParsingOptions has no effect on service-mode extraction. See Processing & OCR for background.

PDF image parser — controls how image-based PDF pages are processed:

from clichefactory import ParsingOptions

# Default: Docling OCR + table structure recognition
client = factory(..., parsing=ParsingOptions(pdf_image_parser="docling"))

# Enhanced: Docling OCR + per-page VLM refinement for complex layouts
client = factory(..., parsing=ParsingOptions(pdf_image_parser="docling_vlm"))

# vision_layout: SaaS-only, uses proprietary layout model
client = factory(..., parsing=ParsingOptions(pdf_image_parser="vision_layout"))

docling_vlm requires ocr_model to be configured (it uses a VLM per page). vision_layout is service-only and will raise an error in local mode.

PDF OCR engine — the engine used when OCR is applied to PDF pages:

# rapidocr is the default (pure Python, no system deps)
client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="rapidocr"))

# tesseract: higher accuracy, requires tesseract binary
client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="tesseract", pdf_ocr_lang="deu"))

# easyocr: GPU-accelerated, downloads models on first use
client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="easyocr"))

Image parser — controls OCR for standalone image files (PNG, JPG, etc.):

ValueDescriptionDefault
"rapidocr"Pure Python OCR, no system depsYes
"pytesseract"Tesseract-backed, requires tesseract binary
"docling"Docling layout-aware OCR
"ocr_llm"Send image directly to VLM (requires ocr_model)
client = factory(
    ...,
    parsing=ParsingOptions(image_parser="pytesseract", image_parser_lang="deu")
)

Full ParsingOptions reference:

FieldTypeDefaultDescription
pdf_image_parser"docling" | "docling_vlm" | "vision_layout""docling"Parser for image-based PDF pages
pdf_ocr_engine"rapidocr" | "tesseract" | "easyocr""rapidocr"OCR engine for PDF pages
pdf_ocr_langstr"eng"Language code for PDF OCR (e.g. "deu", "fra")
pdf_fallback_to_ocr_llmboolTrueFall back to VLM when OCR confidence is low
pdf_structured_fallback_to_imageboolFalseRe-process native-text PDF pages through image pipeline
use_ocr_llm_bodyboolTrueUse VLM for body text in addition to tables
image_parser"rapidocr" | "pytesseract" | "docling" | "ocr_llm""rapidocr"Parser for standalone image files
image_parser_fallbackboolTrueFall back to VLM if image OCR confidence is low
image_parser_langstr"eng"Language code for image OCR
Model & OCR Model Override Per Call

Use a different LLM for a specific extraction, or route OCR to a separate (cheaper) model:

# Override extraction model for one call
result = cliche.extract(
    file="contract.pdf",
    model=Endpoint(provider_model="openai/gpt-4o", api_key="sk-...")
)

# Use a separate model for OCR (e.g. cheaper VLM for image parsing)
result = cliche.extract(
    file="scanned.pdf",
    model=Endpoint(provider_model="openai/gpt-4o", api_key="sk-..."),
    ocr_model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="...")
)

ocr_model is used for document parsing / VLM refinement passes (e.g. with pdf_image_parser="docling_vlm" or when OCR falls back to a VLM). If not set, the extraction model is also used for OCR.

Endpoint Configuration

The Endpoint object configures any LLM call — on factory(), cliche(), or per extract():

FieldTypeDescription
provider_modelstrProvider-prefixed model name, e.g. "openai/gpt-4o", "gemini/gemini-3-flash-preview", "ollama/llama3.2"
api_keystrAPI key for the provider. Omit or leave empty for Ollama.
api_basestrCustom base URL — required for self-hosted Ollama or Azure OpenAI endpoints
max_tokensintMaximum output tokens (default: 10000)
temperaturefloatSampling temperature 0.0–2.0 (default: 0.1 for extraction, 1.0 for OCR)
num_retriesintLLM call retries on transient failures (default: 8)
# Ollama with a custom host
client = factory(
    mode="local",
    model=Endpoint(
        provider_model="ollama/llama3.2",
        api_base="http://my-gpu-host:11434"
    )
)
Text Input

Extract structured data from raw text (no file needed):

result = cliche.extract(text="Invoice #123, total $50.00, from Acme Corp")