Training
Train custom extraction pipelines optimized for your specific document types — higher accuracy, lower cost at scale.
Overview
Training creates a custom extraction pipeline that learns from your labeled examples. Instead of relying on a generic LLM prompt, the trained pipeline achieves higher accuracy, consistency, and often lower cost than standard or robust modes.
When to train:
- You have a recurring document type (e.g., always the same invoice format from a vendor)
- Generic extraction gets 80–90% right but you need 95%+
- You process high volumes and want consistent, reproducible results
Concepts
| Term | Meaning |
|---|---|
| Project | A workspace for a document type (e.g., "Vendor Invoices"). |
| Task | A labeling batch within a project — your training data. |
| Schema | The JSON schema / Pydantic model defining what to extract. See Schemas. |
| Training run | An optimization job that produces an artifact. |
| Artifact | The trained pipeline — a versioned binary with an artifact_id. |
| Deployment | Activating an artifact for a project + environment. |
Workflow
- Create a project in the web UI at clichefactory.com.
- Define your schema — the fields you want to extract from this document type.
- Upload documents — PDFs, images, DOCX, etc. to your project.
- Label ground truth — manually verify and correct AI-assisted extractions to build training data.
- Start a training run — select your data, choose an optimizer tier, and launch.
- Monitor progress — via the web UI. The web app shows real-time progress, metrics, and results.
- Use the trained model — once training completes, you get an
artifact_id. Use it in any extraction call.
Using Trained Models
Pass the artifact_id from a completed training run to any extraction call:
result = cliche.extract(file="new_invoice.pdf")
-H "X-API-KEY: cliche-..." \
-F "file=@invoice.pdf" \
-F "artifact_id=art_abc123"
In MCP, the LLM passes artifact_id to the extract tool. The trained pipeline knows its data model — the schema is used for local validation only.
Optimizer Tiers
The optimizer is selected automatically based on dataset size. Both tiers are available on every BYOK training run.
| Tier | When to Use | What it does |
|---|---|---|
| Tiny | < 75 examples, quick iteration and testing. | Lightweight optimizer pass tuned for low-data regimes. Fast turnaround, useful for validating that the schema and labels make sense before scaling. |
| MIPRO | ≥ 75 examples, production-quality pipelines. | Multi-step prompt + few-shot optimization for production accuracy. Recommended once you have ≥ 75 well-labeled examples. |
Pricing: 3000 credits flat per training run, regardless of optimizer tier or dataset size. You also pay your LLM provider directly for tokens consumed by the optimizer.
See Pricing for credit-to-dollar conversion.
API
Training runs are managed through the web app. To use a trained artifact programmatically, pass artifact_id to the extract endpoint — the pipeline resolves its own schema and mode automatically.
Tips
- Start with 20–30 labeled examples and the Tiny tier for a quick validation.
- Accuracy improves significantly between 50 and 100 examples.
- For documents where errors are costly, combine a trained artifact with
robustmode for an additional verification pass. - You can assign different weights to fields in your schema to prioritize accuracy on must-have fields. High-weight fields are optimized first during training.
- Artifacts are immutable and versioned — you can always roll back to a previous artifact.