Best-in-Class Marathi OCR

Benchmark Results

Independently benchmarked on 2,500 real Marathi document line images. Our purpose-built model outperforms every major open-source OCR system across accuracy, word recognition, exact line matching, and speed.
6.23% Character Error Rate
Lowest error rate of any tested system on the MarathiLine benchmark. PaddleOCR scores 8.34% — over 33% more errors per character.
21.91% Word Error Rate
Half the word error rate of PaddleOCR (44.69%) because our model correctly predicts word boundaries — something competitors cannot do.
35.8% Exact Line Match
5x higher exact match rate than any competitor (PaddleOCR 7.4%, EasyOCR 5.6%). Over a third of all lines are recognised perfectly.
124.5 Lines Per Second
13–20x faster than every competitor. Enables processing of large government document archives in hours, not weeks.
MarathiLine 2.5K Benchmark

How We Compare

All models tested on the same 2,500-line dataset of real Marathi text — balanced across clean printed, degraded, synthetic multi-font, and mixed real sources. Same hardware, same preprocessing, character-level and word-level evaluation.

Character Error Rate (CER)

Lower is better — percentage of incorrectly recognised characters

Incubrain
6.23%
PaddleOCR
8.34%
EasyOCR
15.31%
Tesseract
16.72%

Word Error Rate (WER)

Lower is better — percentage of incorrectly recognised words

Incubrain
21.91%
PaddleOCR
44.69%
EasyOCR
52.75%
Tesseract
48.75%

Exact Line Match

Higher is better — percentage of lines recognised perfectly

Incubrain
35.8%
PaddleOCR
7.4%
EasyOCR
5.6%
Tesseract
5.6%

Throughput (lines/second)

Higher is better — processing speed on the same hardware

Incubrain
124.5
PaddleOCR
6.2
EasyOCR
9.5
Tesseract
7.7

Methodology

How the MarathiLine benchmark works — designed to be a robust, independently verifiable Marathi text recognition benchmark.
1

Dataset — MarathiLine 2.5K

2,500 line images sampled from real scanned documents — clean printed, degraded archival material, synthetic multi-font, and mixed real sources. Balanced across difficulty levels.

2

Controlled Conditions

All models run on identical hardware with the same image preprocessing. No cherry-picking — every line in the dataset is evaluated.

3

Four Evaluation Metrics

Character Error Rate, Word Error Rate, Exact Line Match, and Throughput. Industry-standard metrics computed using standard edit distance at both character and word levels.

4

Reproducibility

Dataset, evaluation scripts, and model weights will be published openly. Any researcher or institution can independently verify results.