Training

Train custom extraction pipelines optimized for your specific document types — higher accuracy, lower cost at scale.

Training currently runs in BYOK mode only. Supply your model provider key (OpenAI, Gemini, or Anthropic) when starting a run. Full-service training is on the roadmap.

Overview

Training creates a custom extraction pipeline that learns from your labeled examples. Instead of relying on a generic LLM prompt, the trained pipeline achieves higher accuracy, consistency, and often lower cost than standard or robust modes.

When to train:

  • You have a recurring document type (e.g., always the same invoice format from a vendor)
  • Generic extraction gets 80–90% right but you need 95%+
  • You process high volumes and want consistent, reproducible results

Concepts

TermMeaning
ProjectA workspace for a document type (e.g., "Vendor Invoices").
TaskA labeling batch within a project — your training data.
SchemaThe JSON schema / Pydantic model defining what to extract. See Schemas.
Training runAn optimization job that produces an artifact.
ArtifactThe trained pipeline — a versioned binary with an artifact_id.
DeploymentActivating an artifact for a project + environment.

Workflow

  1. Create a project in the web UI at clichefactory.com.
  2. Define your schema — the fields you want to extract from this document type.
  3. Upload documents — PDFs, images, DOCX, etc. to your project.
  4. Label ground truth — manually verify and correct AI-assisted extractions to build training data.
  5. Start a training run — select your data, choose an optimizer tier, and launch.
  6. Monitor progress — via the web UI. The web app shows real-time progress, metrics, and results.
  7. Use the trained model — once training completes, you get an artifact_id. Use it in any extraction call.

Using Trained Models

Pass the artifact_id from a completed training run to any extraction call:

cliche = client.cliche(Invoice, artifact_id="art_abc123")
result = cliche.extract(file="new_invoice.pdf")
clichefactory extract invoice.pdf --schema schema.json --artifact-id art_abc123
curl -X POST "https://api.clichefactory.com/v1/extract" \
  -H "X-API-KEY: cliche-..." \
  -F "file=@invoice.pdf" \
  -F "artifact_id=art_abc123"

In MCP, the LLM passes artifact_id to the extract tool. The trained pipeline knows its data model — the schema is used for local validation only.

Optimizer Tiers

The optimizer is selected automatically based on dataset size. Both tiers are available on every BYOK training run.

TierWhen to UseWhat it does
Tiny < 75 examples, quick iteration and testing. Lightweight optimizer pass tuned for low-data regimes. Fast turnaround, useful for validating that the schema and labels make sense before scaling.
MIPRO ≥ 75 examples, production-quality pipelines. Multi-step prompt + few-shot optimization for production accuracy. Recommended once you have ≥ 75 well-labeled examples.

Pricing: 3000 credits flat per training run, regardless of optimizer tier or dataset size. You also pay your LLM provider directly for tokens consumed by the optimizer.

See Pricing for credit-to-dollar conversion.

API

Training runs are managed through the web app. To use a trained artifact programmatically, pass artifact_id to the extract endpoint — the pipeline resolves its own schema and mode automatically.

Tips

  • Start with 20–30 labeled examples and the Tiny tier for a quick validation.
  • Accuracy improves significantly between 50 and 100 examples.
  • For documents where errors are costly, combine a trained artifact with robust mode for an additional verification pass.
  • You can assign different weights to fields in your schema to prioritize accuracy on must-have fields. High-weight fields are optimized first during training.
  • Artifacts are immutable and versioned — you can always roll back to a previous artifact.