Benchmark Results
How We Compare
Character Error Rate (CER)
Lower is better — percentage of incorrectly recognised characters
Word Error Rate (WER)
Lower is better — percentage of incorrectly recognised words
Exact Line Match
Higher is better — percentage of lines recognised perfectly
Throughput (lines/second)
Higher is better — processing speed on the same hardware
Methodology
Dataset — MarathiLine 2.5K
2,500 line images sampled from real scanned documents — clean printed, degraded archival material, synthetic multi-font, and mixed real sources. Balanced across difficulty levels.
Controlled Conditions
All models run on identical hardware with the same image preprocessing. No cherry-picking — every line in the dataset is evaluated.
Four Evaluation Metrics
Character Error Rate, Word Error Rate, Exact Line Match, and Throughput. Industry-standard metrics computed using standard edit distance at both character and word levels.
Reproducibility
Dataset, evaluation scripts, and model weights will be published openly. Any researcher or institution can independently verify results.
Coming Soon
Model on Hugging Face
Pre-trained Marathi OCR model weights — download, fine-tune, or deploy. Coming soon.
MarathiLine Benchmark
The full 2,500-line evaluation dataset with ground truth — for independent verification. Coming soon.
Review Pipeline
See how documents flow from raw scans to verified, AI-ready text through our quality control system.