/ OCR · VLM · Document Extraction · Evals That Don't Lie /

Stop hand-checking PDFs every Friday.

parakh is the OCR / extraction evaluation framework you wish existed — field-level metrics, confidence calibration, human-in-the-loop correction, and a one-command CI gate. Self-hosted OSS. Hosted version coming.

Used in production by AI/ML teams across legal · manufacturing · automotive · cement.
/ The problem /

Your OCR is a black box that gaslights you.

/ 01
Vendor benchmarks say 98% accuracy.
Your corpus gets 71%. You find out from a customer.
/ 02
"Eval" = three engineers checking 20 PDFs each Friday.
Now your golden set is a Slack thread that compounds nothing.
/ 03
Model vendor pushes an update.
Field-level accuracy drops 9 points. No alert. No history.
/ 04
Each team rolls their own metric.
Nobody agrees on a number. The CFO loses faith.
/ How it works /

Six commands. One framework. No more Friday hand-checks.

Wire your model

BYO-model. Adapter pattern, ~30 lines. Ships with adapters for all major OCR engines and leading vision-language models — bring whatever you run in prod.

Point at your golden set

Whatever format you already have. JSONL, CSV, BIO tags, bounding boxes. parakh meets you where you are.

Get honest metrics

Field-level character accuracy, exact match, fuzzy match, IoU. Reading order scored separately via taul.

Gate your PRs

One YAML block in GitHub Actions. PR fails if accuracy regresses past your threshold. Sleep at night.

Correct what's wrong

Local web UI for review and correction. Your golden set compounds. Slack alerts when drift sets in.

Compare models honestly

Side-by-side on your corpus. Stop picking models because the vendor said so.

/ Local /

Run it on your laptop.

$ pip install parakh
$ parakh init my-eval/
$ parakh eval --model your-vlm --config my-eval/config.yaml

Document: contract_47.pdf
  field-level accuracy ......... 0.94
  reading-order accuracy ....... 0.61  ⚠
  worst spans:
    [page 3] col-2 block 4 -> read before col-1 block 7
    [page 5] footnote orphaned (1.2 KB before parent ref)
  recommended: 2-column with explicit footnote linking

15/20 documents pass threshold (0.90).
PR check: FAIL  ✗
/ CI /

Block bad PRs.

# .github/workflows/eval.yml
- uses: sarcascoder/parakh-action@v1
  with:
    model: your-vlm
    config: ./eval/config.yaml
    threshold: 0.92

The OSS package + GitHub Action is free forever. parakh Cloud is the hosted multi-team layer on top — dashboards, history, Slack alerts, RBAC.

/ Pricing /

Free for one team. Paid when you outgrow YAML.

OSS
$0forever

Self-hosted, single team. Apache-2.0.

  • Field-level metrics
  • Reading-order eval (taul)
  • All model adapters
  • GitHub Action (free tier)
  • Local dashboard
  • Community support
View on GitHub →
Team
$99/month

Small ML teams. Up to 5 users.

  • Everything in OSS
  • Hosted dashboard + history
  • Side-by-side model comparison
  • Slack alerts on regression
  • Email + Discord support
  • EU + US data residency
Join waitlist
Scale
$499/month

Growing teams + RBAC.

  • Everything in Team
  • Up to 25 users + RBAC
  • Audit log
  • Dataset versioning
  • Webhooks + API
  • Priority support
Join waitlist
Enterprise
$1,499+/month

Self-hosted or VPC. Compliance-grade.

  • Self-hosted or VPC deployment
  • SSO + SCIM
  • DPA / BAA available
  • Custom adapters
  • Dedicated Slack channel
  • SLA
Talk to Anupam

Get early access.

First 20 teams free for 3 months. I want feedback more than dollars right now.

.
/ FAQ /

Honest answers.

How is this different from LangSmith / Braintrust / HumanLoop?
Those are LLM-evaluation tools. parakh is built for OCR / VLM / document-extraction, where you need field-level metrics, reading-order scoring, and bounding-box IoU — not chat-quality grading.
Is the OSS actually usable, or is it a crippleware demo?
The OSS is the full single-team product, indefinitely free, Apache-2.0. The hosted version adds multi-team RBAC, history, alerts, and side-by-side comparison.
Which models do you support out of the box?
All major OCR engines and the leading open-source vision-language models. Adding a new one is ~30 lines of adapter code — bring your own.
Can I run this on-prem / air-gapped?
Yes — that's the OSS. Enterprise tier ships a self-hosted hosted-dashboard option for compliance-bound teams.
Do you handle PII?
Hosted: EU + US data residency, SOC-2-friendly architecture, no model traffic touches our infra (we evaluate your own outputs). Enterprise: DPA / BAA available.
Who is behind this?
Anupam Deep Tripathi — Founding AI Engineer at Hashteelab, IIT Tirupati '25. Shipping production OCR / extraction pipelines for legal, manufacturing, automotive, cement clients. Reimplemented ICLR 2026 TurboQuant from scratch.