Instead of building generalist models, we fine-tune for high-value domains — starting with 200,000+ Government Resolutions. Purpose-built tools that outperform general-purpose AI on the problems that matter most.
Enhancing India's AI ecosystem
The problem
Millions of documents across government, judiciary, agriculture, and education sit inaccessible — scanned on paper, trapped in legacy font encodings, or digital but unstructured. General-purpose AI tools fail on Marathi. Quality Indic data is the acknowledged global bottleneck.
Decades of government records, court documents, and manuscripts exist only as scanned images. Full OCR is required. Generic tools produce 8-17% error rates on Devanagari.
Documents look like Devanagari but use proprietary ASCII mappings — Shree Dev, Kruti Dev, Shusha. Text extraction produces garbled output. Each font family has its own mapping.
Even natively digital documents are not searchable, indexed, or cross-referenced. Without entity extraction and metadata, they are invisible to analysis.
At the Delhi AI Summit, OpenAI, Google, and Sarvam identified quality Indic training data as the single biggest barrier to multilingual AI performance.
The approach
One model trying to handle Marathi alongside English, Hindi, and Tamil will always compromise. Instead, we build focused models — each one deeply tuned to a single domain's script, layout, and vocabulary.
Generalist model
Adequate on all. Excellent at none.
Specialist — GR model
Purpose-built for one thing. Excellent at it.
Domain-specific AI
Instead of building one model for every language, we target specific domains where precision matters most — then fine-tune until we outperform every general-purpose alternative.
The platform
Purpose-built for Devanagari script — shirorekha, matras, conjuncts. Fine-tuned per domain for accuracy that generalist models cannot match.
AI pre-verifies every line. High-confidence output is auto-promoted. Uncertain cases go to trained Marathi-fluent reviewers via a purpose-built application.
Raw documents become searchable databases with entity extraction, metadata tagging, and cross-referencing. Open source, state-owned, no vendor lock-in.
What you actually get
Generic OCR stops at "extract text." We keep going — through structure, entity extraction, and a queryable knowledge graph. Each layer unlocks something the layer below cannot.
The starting point — a PDF or scanned page, image-only, untouched.
Characters extracted from Devanagari, legacy encodings, and mixed-source documents.
Headers, body, tables, stamps — document anatomy preserved, not flattened.
Departments, officials, laws, dates, budgets — every meaningful thing named and tagged.
Every GR becomes a node; every entity becomes a connection. Queryable, traceable, public.
The next generation of Marathi AI starts with data that's accurate, verified, and owned by the state.