SDK — Developer — ClicheFactory

Python SDK

Complete reference for the ClicheFactory Python SDK — extraction, conversion, batch processing, and trained models.

Installation

        # pip — core package (service mode)

        pip install clichefactory

        # pip — with local parsing dependencies (BYOK mode)

        pip install clichefactory[local]

        # uv

        uv add clichefactory

        uv add clichefactory[local]

The core package is all you need for service mode. The [local] extra adds OCR engines and parsers for running extraction on your own machine.

Authentication

Service Mode

Pass your ClicheFactory API key directly:

        from clichefactory import factory

        client = factory(api_key="cliche-...")

Local Mode (BYOK)

Provide your own LLM key — extraction runs on your machine, ClicheFactory handles parsing and orchestration:

        from clichefactory import factory, Endpoint

        client = factory(

            mode="local",

            model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="your-llm-key")

        )

Config File

Run the interactive setup to save credentials to ~/.clichefactory/config.toml:

clichefactory configure

Precedence: explicit factory() args > environment variables > config file > defaults. See Execution Modes for details on local vs. service.

Extraction

Define a schema, create a cliche (extraction pipeline), and extract. Schemas can be Pydantic models or plain dicts.

With Pydantic

        from pydantic import BaseModel

        from clichefactory import factory

        class Invoice(BaseModel):

            invoice_number: str

            total: float

            vendor: str

        client = factory(api_key="cliche-...")

        cliche = client.cliche(Invoice)

        result = cliche.extract(file="invoice.pdf")

        print(result.invoice_number, result.total)

With Dict Schema

No Pydantic required — pass a JSON Schema dict:

        schema = {

            "type": "object",

            "properties": {

                "invoice_number": {"type": "string"},

                "total": {"type": "number"},

                "vendor": {"type": "string"}

            }

        }

        cliche = client.cliche(schema)

        result = cliche.extract(file="invoice.pdf")

        print(result["total"])

See Schemas in Core Concepts for detailed schema patterns including nested objects and arrays.

Extraction Modes

Control the accuracy/cost tradeoff per call. See Core Concepts for what each mode does.

Mode	Code
Default (balanced)	`cliche.extract(file=...)`
Fast	`cliche.extract(file=..., mode="fast")`
Robust	`cliche.extract(file=..., mode="robust")`
Trained	`cliche.extract(file=..., artifact_id="art_xxx")`
Robust + Trained	`cliche.extract(file=..., mode="robust-trained", artifact_id="art_xxx")`

When artifact_id is provided, the pipeline mode is defined by the artifact — you don't need to set mode explicitly. For maximum accuracy, trained artifacts can be combined with verification automatically.

Trained Models

Training currently runs in BYOK mode only. Supply your model provider key (OpenAI, Gemini, or Anthropic) when training. Full-service training is on the roadmap.

Use a trained extraction pipeline by passing the artifact_id from a completed training run:

        cliche = client.cliche(Invoice, artifact_id="art_abc123")

        result = cliche.extract(file="invoice.pdf")

The trained pipeline knows its data model — the schema you pass is used for local validation only. See Training → Using Trained Models for the full workflow.

Document Conversion

Convert any supported document to markdown text:

        doc = client.to_markdown("document.pdf")

        print(doc.get_markdown())

        # Fast mode: skip OCR, send file directly to a VLM (service mode only)

        doc = client.to_markdown("scan.pdf", conversion_mode="fast")

conversion_mode="fast" skips OCR and sends the file directly to a vision model — useful for speed when the document is image-heavy and the LLM can read it natively. Supported values: "default", "fast".

Supports PDF, images, DOCX, XLSX, CSV, EML, and more — see Supported File Types.

Long Documents (chunk + merge)

The single-call extract pipeline is bounded by the LLM context window — roughly ~100 pages per file. For longer documents, use extract_long: it converts the file to markdown once, chunks it, extracts each chunk in parallel, and merges the per-chunk results into a single Pydantic model.

        cliche = client.cliche(Invoice)

        result = cliche.extract_long("200_page_contract.pdf")

List-valued fields are concatenated across chunks; scalar fields pick the first non-null value by default. Override the merge rule per-field with the resolvers= argument on cliche() or extract_long() — e.g. {"total": "sum_numeric", "line_items": "concat_dedupe_by=sku"}.

Pass include_chunk_results=True to get a LongExtractionResult with per-chunk outputs, resolution traces, warnings and aggregated cost. v1 limitation: extract_long runs the BYOK one-shot path only; mode="trained" | "robust" is rejected.

Full reference (chunkers, resolver aliases, default policy): SDK README — Long documents.

Batch Operations

Process multiple files concurrently:

        # Batch extraction

        results = cliche.extract_batch(["inv1.pdf", "inv2.pdf", "inv3.pdf"], max_concurrency=5)

        # Batch document conversion

        docs = client.to_markdown_batch(["a.pdf", "b.pdf"], max_concurrency=5)

max_concurrency controls how many files are processed in parallel. Defaults to 5. All keyword arguments are forwarded to the underlying extract() or to_markdown() call.

Async API

Every synchronous method has an _async variant for use in async frameworks like FastAPI, asyncio scripts, or Jupyter notebooks:

        import asyncio

        from clichefactory import factory

        client = factory(api_key="cliche-...")

        cliche = client.cliche(Invoice)

        async def main():

            # Single async extraction

            result = await cliche.extract_async(file="invoice.pdf")

            # Batch with concurrency control

            results = await cliche.extract_batch_async(["a.pdf", "b.pdf"], max_concurrency=10)

            # Async document conversion

            doc = await client.to_markdown_async("document.pdf")

        asyncio.run(main())

Sync	Async
`cliche.extract(...)`	`await cliche.extract_async(...)`
`cliche.extract_batch(...)`	`await cliche.extract_batch_async(...)`
`client.to_markdown(...)`	`await client.to_markdown_async(...)`
`client.to_markdown_batch(...)`	`await client.to_markdown_batch_async(...)`

All async variants accept the same arguments as their sync counterparts.

Advanced

Partial Results

Return partial data on validation failure instead of raising an error:

        from clichefactory import PartialExtraction

        result = cliche.extract(file="messy.pdf", allow_partial=True)

        if isinstance(result, PartialExtraction):

            print("Partial result:", result.raw)

            print("Validation errors:", result.validation_errors)

        else:

            print("Full result:", result)

When validation passes, a normal result is returned. When it fails, PartialExtraction is returned with .raw (the coerced dict) and .validation_errors (list of Pydantic error dicts) so you can inspect and recover what the model extracted.

Include Document & Costs

Get the parsed document or cost breakdown alongside the extraction result:

        result, doc = cliche.extract(file="invoice.pdf", include_doc=True)

        print(doc.get_markdown())  # inspect what the model saw

        result = cliche.extract(file="invoice.pdf", include_costs=True)

When include_doc=True, the return value is a (result, doc) tuple. doc.get_markdown() returns the parsed document text that was sent to the LLM — useful for debugging extraction quality.

Postprocess Hooks

Transform extraction results before they're returned:

        def fix_dates(result: dict) -> dict:

            result["date"] = parse_european_date(result["date"])

            return result

        cliche = client.cliche(Invoice, postprocess=fix_dates)

Pipeline order: raw LLM dict → system coerce (EU decimals, currency, accounting negatives) → postprocess → Pydantic validation. The function receives and must return a dict.

Parser & OCR Configuration (Local Mode)

Fine-tune parsing at the client or cliche level via ParsingOptions. ParsingOptions has no effect on service-mode extraction. See Processing & OCR for background.

PDF image parser — controls how image-based PDF pages are processed:

        from clichefactory import ParsingOptions

        # Default: Docling OCR + table structure recognition

        client = factory(..., parsing=ParsingOptions(pdf_image_parser="docling"))

        # Enhanced: Docling OCR + per-page VLM refinement for complex layouts

        client = factory(..., parsing=ParsingOptions(pdf_image_parser="docling_vlm"))

        # vision_layout: SaaS-only, uses proprietary layout model

        client = factory(..., parsing=ParsingOptions(pdf_image_parser="vision_layout"))

docling_vlm requires ocr_model to be configured (it uses a VLM per page). vision_layout is service-only and will raise an error in local mode.

PDF OCR engine — the engine used when OCR is applied to PDF pages:

        # rapidocr is the default (pure Python, no system deps)

        client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="rapidocr"))

        # tesseract: higher accuracy, requires tesseract binary

        client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="tesseract", pdf_ocr_lang="deu"))

        # easyocr: GPU-accelerated, downloads models on first use

        client = factory(..., parsing=ParsingOptions(pdf_ocr_engine="easyocr"))

Image parser — controls OCR for standalone image files (PNG, JPG, etc.):

Value	Description	Default
`"rapidocr"`	Pure Python OCR, no system deps	Yes
`"pytesseract"`	Tesseract-backed, requires `tesseract` binary	—
`"docling"`	Docling layout-aware OCR	—
`"ocr_llm"`	Send image directly to VLM (requires `ocr_model`)	—

        client = factory(

            ...,

            parsing=ParsingOptions(image_parser="pytesseract", image_parser_lang="deu")

        )

Full ParsingOptions reference:

Field	Type	Default	Description
`pdf_image_parser`	`"docling" \| "docling_vlm" \| "vision_layout"`	`"docling"`	Parser for image-based PDF pages
`pdf_ocr_engine`	`"rapidocr" \| "tesseract" \| "easyocr"`	`"rapidocr"`	OCR engine for PDF pages
`pdf_ocr_lang`	`str`	`"eng"`	Language code for PDF OCR (e.g. `"deu"`, `"fra"`)
`pdf_fallback_to_ocr_llm`	`bool`	`True`	Fall back to VLM when OCR confidence is low
`pdf_structured_fallback_to_image`	`bool`	`False`	Re-process native-text PDF pages through image pipeline
`use_ocr_llm_body`	`bool`	`True`	Use VLM for body text in addition to tables
`image_parser`	`"rapidocr" \| "pytesseract" \| "docling" \| "ocr_llm"`	`"rapidocr"`	Parser for standalone image files
`image_parser_fallback`	`bool`	`True`	Fall back to VLM if image OCR confidence is low
`image_parser_lang`	`str`	`"eng"`	Language code for image OCR

Model & OCR Model Override Per Call

Use a different LLM for a specific extraction, or route OCR to a separate (cheaper) model:

        # Override extraction model for one call

        result = cliche.extract(

            file="contract.pdf",

            model=Endpoint(provider_model="openai/gpt-4o", api_key="sk-...")

        )

        # Use a separate model for OCR (e.g. cheaper VLM for image parsing)

        result = cliche.extract(

            file="scanned.pdf",

            model=Endpoint(provider_model="openai/gpt-4o", api_key="sk-..."),

            ocr_model=Endpoint(provider_model="gemini/gemini-3-flash-preview", api_key="...")

        )

ocr_model is used for document parsing / VLM refinement passes (e.g. with pdf_image_parser="docling_vlm" or when OCR falls back to a VLM). If not set, the extraction model is also used for OCR.

Endpoint Configuration

The Endpoint object configures any LLM call — on factory(), cliche(), or per extract():

Field	Type	Description
`provider_model`	`str`	Provider-prefixed model name, e.g. `"openai/gpt-4o"`, `"gemini/gemini-3-flash-preview"`, `"ollama/llama3.2"`
`api_key`	`str`	API key for the provider. Omit or leave empty for Ollama.
`api_base`	`str`	Custom base URL — required for self-hosted Ollama or Azure OpenAI endpoints
`max_tokens`	`int`	Maximum output tokens (default: 10000)
`temperature`	`float`	Sampling temperature 0.0–2.0 (default: 0.1 for extraction, 1.0 for OCR)
`num_retries`	`int`	LLM call retries on transient failures (default: 8)

        # Ollama with a custom host

        client = factory(

            mode="local",

            model=Endpoint(

                provider_model="ollama/llama3.2",

                api_base="http://my-gpu-host:11434"

            )

        )

Text Input

Extract structured data from raw text (no file needed):

result = cliche.extract(text="Invoice #123, total $50.00, from Acme Corp")