Troubleshooting

Common issues, diagnostic tools, and performance tips.

clichefactory doctor

The doctor command checks your configuration, dependencies, and system binaries. Run it first when something isn't working.

clichefactory doctor
from clichefactory import factory
client = factory(api_key="your-key")
client.doctor()
# In Cursor or Claude Desktop, call the doctor tool.
# The MCP server exposes it as a tool with no parameters.
Example Output
Config: ~/.clichefactory/config.toml ✓
  mode: service
  api_key: cliche-...redacted
Dependencies:
  clichefactory: 0.4.2 ✓
  pydantic: 2.9.2 ✓
  docling: 2.31.0 ✓
System binaries:
  tesseract: not found (optional)
  pandoc: 3.1.12 ✓
  soffice: not found (optional)

Common Errors

401 Unauthorized — Invalid API key

Cause: The API key is missing, incorrect, or has been revoked.

Fix: Check that your key starts with cliche- and matches the one in your account settings. Ensure it's set in the right config file or environment variable.

403 Forbidden — Insufficient credits

Cause: Your credit balance is too low for the requested operation.

Fix: Check your balance and top up credits. Each extraction mode has a different per-page cost.

400 Bad Request — Invalid schema

Cause: The JSON schema is malformed or contains unsupported types.

Fix: Validate your schema at jsonschema.dev or check the Schemas reference. Ensure the JSON string is properly escaped if passing via CLI or curl.

File not found

Cause: The file path passed to the SDK, CLI, or MCP tool doesn't exist.

Fix: Use an absolute path. For MCP, the AI assistant must pass the full file path — relative paths may resolve against the wrong working directory.

Empty or garbled extraction results

Cause: Poor OCR quality from low-resolution scans, complex table layouts, or handwriting.

Fix: Try the enhanced parser for complex documents (local mode). Use robust extraction mode for a verification pass. For scanned documents, ensure the original is at least 200 DPI.

BYOK: model API errors

Cause: Your LLM API key is invalid, rate-limited, or the model is unavailable.

Fix: Verify your model API key and check the provider's status page. Run doctor to validate the configuration. See BYOK for supported providers.

MCP: tools not appearing

Cause: The MCP server config is incorrect, or clichefactory-mcp isn't installed.

Fix: Verify the JSON config is valid and the command/args are correct. Restart the IDE after changes. See MCP Setup.

System Dependencies

Some features require system binaries. These are optional — the default OCR engine (rapidocr) and default parser work without any system deps.

Tesseract (optional — for tesseract OCR engine)
# macOS
brew install tesseract

# Ubuntu / Debian
sudo apt install tesseract-ocr

# Windows (via scoop)
scoop install tesseract
OCR Language Packs (optional — for Tesseract non-English OCR)

Tesseract needs traineddata files for each language. English is included by default.

# macOS — install all languages
brew install tesseract-lang

# Ubuntu — install a specific language pack
sudo apt install tesseract-ocr-<lang>

OCR engines support multiple languages. See your OCR engine's documentation for available language packs.

Pandoc (optional — for .doc and .odt files)
# macOS
brew install pandoc

# Ubuntu / Debian
sudo apt install pandoc
LibreOffice (optional — fallback for .doc and .odt)
# macOS
brew install --cask libreoffice

# Ubuntu / Debian
sudo apt install libreoffice

Used as a fallback if Pandoc is not available. Provides the soffice binary for document conversion.

Performance Tips

GoalRecommendation
Fastest extractionUse fast mode for one-shot extraction, skipping the full parsing step.
Best accuracyUse robust mode for a verification pass on high-stakes documents.
Complex scansUse the enhanced parser (local mode) for documents with complex tables, mixed layouts, or handwriting.
High volumeUse extract_batch / extract-batch with max_concurrency tuned to your rate limits.
Non-English docsUse Tesseract with the appropriate language pack. See your OCR engine's docs for available packs.
Repeated formatsTrain a model (BYOK) — trained pipelines are faster and more accurate than generic extraction on recurring document types.

Getting Help

If you're stuck: