Stop hand-checking PDFs every Friday.
parakh is the OCR / extraction evaluation framework you wish existed — field-level metrics, confidence calibration, human-in-the-loop correction, and a one-command CI gate. Self-hosted OSS. Hosted version coming.
Your OCR is a black box that gaslights you.
Six commands. One framework. No more Friday hand-checks.
Wire your model
BYO-model. Adapter pattern, ~30 lines. Ships with adapters for all major OCR engines and leading vision-language models — bring whatever you run in prod.
Point at your golden set
Whatever format you already have. JSONL, CSV, BIO tags, bounding boxes. parakh meets you where you are.
Get honest metrics
Field-level character accuracy, exact match, fuzzy match, IoU. Reading order scored separately via taul.
Gate your PRs
One YAML block in GitHub Actions. PR fails if accuracy regresses past your threshold. Sleep at night.
Correct what's wrong
Local web UI for review and correction. Your golden set compounds. Slack alerts when drift sets in.
Compare models honestly
Side-by-side on your corpus. Stop picking models because the vendor said so.
Run it on your laptop.
$ pip install parakh
$ parakh init my-eval/
$ parakh eval --model your-vlm --config my-eval/config.yaml
Document: contract_47.pdf
field-level accuracy ......... 0.94
reading-order accuracy ....... 0.61 ⚠
worst spans:
[page 3] col-2 block 4 -> read before col-1 block 7
[page 5] footnote orphaned (1.2 KB before parent ref)
recommended: 2-column with explicit footnote linking
15/20 documents pass threshold (0.90).
PR check: FAIL ✗Block bad PRs.
# .github/workflows/eval.yml
- uses: sarcascoder/parakh-action@v1
with:
model: your-vlm
config: ./eval/config.yaml
threshold: 0.92The OSS package + GitHub Action is free forever. parakh Cloud is the hosted multi-team layer on top — dashboards, history, Slack alerts, RBAC.
Free for one team. Paid when you outgrow YAML.
Self-hosted, single team. Apache-2.0.
- Field-level metrics
- Reading-order eval (taul)
- All model adapters
- GitHub Action (free tier)
- Local dashboard
- Community support
Small ML teams. Up to 5 users.
- Everything in OSS
- Hosted dashboard + history
- Side-by-side model comparison
- Slack alerts on regression
- Email + Discord support
- EU + US data residency
Growing teams + RBAC.
- Everything in Team
- Up to 25 users + RBAC
- Audit log
- Dataset versioning
- Webhooks + API
- Priority support
Self-hosted or VPC. Compliance-grade.
- Self-hosted or VPC deployment
- SSO + SCIM
- DPA / BAA available
- Custom adapters
- Dedicated Slack channel
- SLA
Get early access.
First 20 teams free for 3 months. I want feedback more than dollars right now.
Honest answers.
- How is this different from LangSmith / Braintrust / HumanLoop?
- Those are LLM-evaluation tools. parakh is built for OCR / VLM / document-extraction, where you need field-level metrics, reading-order scoring, and bounding-box IoU — not chat-quality grading.
- Is the OSS actually usable, or is it a crippleware demo?
- The OSS is the full single-team product, indefinitely free, Apache-2.0. The hosted version adds multi-team RBAC, history, alerts, and side-by-side comparison.
- Which models do you support out of the box?
- All major OCR engines and the leading open-source vision-language models. Adding a new one is ~30 lines of adapter code — bring your own.
- Can I run this on-prem / air-gapped?
- Yes — that's the OSS. Enterprise tier ships a self-hosted hosted-dashboard option for compliance-bound teams.
- Do you handle PII?
- Hosted: EU + US data residency, SOC-2-friendly architecture, no model traffic touches our infra (we evaluate your own outputs). Enterprise: DPA / BAA available.
- Who is behind this?
- Anupam Deep Tripathi — Founding AI Engineer at Hashteelab, IIT Tirupati '25. Shipping production OCR / extraction pipelines for legal, manufacturing, automotive, cement clients. Reimplemented ICLR 2026 TurboQuant from scratch.