OCR for Indian Scripts — Extract Text from Images and Scanned Documents
How OCR works for Indian language scripts at TranslitHub — supported formats, accuracy by script, digitizing old books and documents, and practical workflows for text extraction.
Somewhere in most Indian households there's a stack of documents — old letters in a grandmother's handwriting, photocopied government forms, newspaper clippings, printed railway receipts — that contain text you need but can't easily work with digitally. You could retype it all. Or you could photograph it and let OCR extract the text for you in seconds.
OCR for Indian scripts is harder than OCR for Latin scripts, and for a long time the quality was poor enough that manual retyping was often genuinely faster. That's changed significantly. TranslitHub includes OCR for Indian scripts that handles printed text — typed or typeset documents — well, and handles handwriting passably for clean, standard scripts.
This guide covers what to expect from Indian language OCR, which scripts work best, how to get clean extraction from different source materials, and practical use cases.
Why Indian Script OCR Is Harder
Latin alphabets consist of 26 letters with a few diacritics. Most Indian scripts have 40-60 base characters, plus vowel diacritics that appear above, below, before, and after those characters, plus conjunct consonants that merge multiple characters into a single glyph. Devanagari alone has hundreds of distinct conjunct forms in common use.
Add to this:
- The headline stroke (matra) in Devanagari that connects all characters in a word — when printed at low resolution or with slight ink bleed, the headline can merge with the top strokes of characters
- Circular/curved letterforms in Odia, Telugu, Kannada, and Malayalam that require high image quality to distinguish
- The complex ligature system in Malayalam (old orthography) that produces unique joined forms for consonant clusters
- Tamil's curved, open letterforms that look similar to each other at small sizes or with print artifacts
Modern OCR handles most of these with high accuracy when image quality is good. Image quality is the dominant factor — poor scans produce poor results regardless of the OCR quality.
Supported Input Formats
TranslitHub OCR accepts:
| Format | Notes |
|---|---|
| JPG / JPEG | Standard photo format; 72-300 DPI works well |
| PNG | Lossless compression; better for screenshots and digital images |
| Extracts text from all pages; uses embedded text if available, falls back to OCR | |
| TIFF | High-quality scanned documents; preferred for archival material |
| WebP | Modern web format; supported |
Accuracy by Script and Condition
Here's an honest breakdown of what to expect:
| Script | Printed/Typeset | Handwritten | Notes |
|---|---|---|---|
| Devanagari (Hindi/Marathi) | Excellent | Fair | Best results for modern fonts; older typefaces need higher DPI |
| Bengali | Very good | Fair | Curved forms need clean scans |
| Tamil | Very good | Poor | Printed Tamil extracts cleanly; handwriting accuracy is low |
| Telugu | Good | Poor | Similar to Tamil |
| Kannada | Good | Poor | Complex circular letterforms need high DPI |
| Malayalam | Good | Poor | Old orthography (pre-1971 reformed) is less accurate |
| Gujarati | Good | Fair | Related to Devanagari; benefits from same model quality |
| Punjabi (Gurmukhi) | Good | Poor | Printed Gurmukhi works well |
| Odia | Moderate | Poor | Unique rounded letterforms still being improved |
Getting Good Results: Image Quality Guidelines
OCR accuracy is almost entirely determined by image quality. Here's what matters:
Resolution
- Minimum acceptable: 150 DPI for large, clear text (24pt+)
- Recommended for body text: 300 DPI
- For small text or fine details: 400-600 DPI
- Phone photos: Most modern phones shoot at sufficient resolution — the issue is usually focus, not pixel count
Lighting and Contrast
- Even lighting is better than dramatic lighting. Avoid shadows across the text.
- High contrast between text and background is crucial. Black ink on white paper is ideal.
- Faded, yellowed, or water-stained paper significantly hurts accuracy.
- Backlit documents (where you can see through the paper) are difficult — if the document is thin, place a black sheet behind it before photographing.
Image Orientation
TranslitHub's OCR automatically deskews (straightens) slightly rotated images, but extreme rotation (more than 10-15 degrees) reduces accuracy. Photograph or scan documents flat.
For phone photography of documents, enable the document scanning mode if your camera app has one — it flattens perspective distortion and applies contrast enhancement automatically.
Compression Artifacts
JPEG compression at low quality settings introduces block artifacts that interfere with small characters and fine strokes. If using JPEG, use high quality settings (80%+ quality). PNG avoids this entirely for screenshots and digital documents.
Uploading and Running OCR
- Open TranslitHub and navigate to the OCR tool (or use the OCR icon in the editor toolbar)
- Upload your image or PDF
- Select the source language — if you're unsure or the document contains multiple scripts, select "Auto-detect"
- Click Extract Text
- The extracted text appears in an editable panel within seconds for images; PDF extraction takes longer based on page count
Auto-Detection vs. Manual Language Selection
The auto-detect option works reliably for standard modern scripts. It analyzes the character shapes in the image to identify the script, then applies the appropriate recognition model. For a document that's clearly in one language, auto-detect is usually correct.
Manual language selection is better when:
- The document is partially faded or damaged (character recognition is ambiguous)
- You have a document that mixes two languages (auto-detect may pick the wrong primary)
- The document uses an uncommon font or typeface (manual language setting helps the model narrow its search)
Use Case: Digitizing Old Handwritten Letters
Old family letters in Hindi or Marathi handwriting are among the most emotionally valuable and practically challenging OCR targets. The challenges:
- Pre-independence and mid-century handwriting often uses letterforms that differ from modern standard forms
- Ink has faded and paper has yellowed
- Writing was dense with minimal spacing between words
- Ligatures and personal script variations don't match any standard typeface
Tips for handwritten documents:
- Photograph in full daylight (outdoors or near a bright window) rather than under artificial light
- Use the highest camera resolution available
- Shoot multiple times and choose the sharpest
- If the document is folded, gently flatten it completely before photographing
Use Case: Newspaper and Magazine Clippings
Printed newspaper content in Indian languages typically extracts very well — 90%+ accuracy for well-preserved clippings. Challenges include:
- Multi-column layouts: The OCR tool handles these by analyzing the layout and extracting text column by column
- Text wrapping around images: Irregular text shapes are harder; select specific text regions for better accuracy
- Low-quality newsprint: Old newspapers especially from the 1970s-1990s were printed on low-grade paper with ink bleed
- Scan at 400 DPI (newsprint has fine text)
- Use the region selection tool to define individual articles
- Extract text from each article separately
- Save as TXT or DOCX
Use Case: Scanned Government Documents
Government forms, certificates, and official correspondence often need to be digitized for records or further processing. The challenge is that government printing in India is not always high quality — forms are often photocopied multiple times, reducing clarity.
For photocopied government documents:
- Use PDF input rather than photograph if you have a scanner available
- Select the specific language manually rather than auto-detect
- After extraction, verify key information (names, dates, ID numbers) character by character
Batch OCR
For processing multiple documents:
- Upload multiple files (JPG, PNG, PDF, or a mix) at once
- Set a common language (or leave on auto-detect)
- Run batch processing
- Download results as a ZIP containing one TXT or DOCX file per input
Batch processing is available for accounts. Free accounts process one file at a time.
Converting OCR Output to Transliterated Roman
Once you've extracted text in an Indian script, you might need it in Roman transliteration — for data processing, URL slugs, or sharing with someone who can't read the script. The OCR result feeds directly into TranslitHub's transliteration feature:
- Run OCR to get Indian script text
- Click "Transliterate to Roman" in the editor toolbar
- Get the Roman phonetic equivalent of the extracted text
Limitations to Be Aware Of
- Mathematical and tabular content: OCR for tables within Indian language documents is less reliable. Numbers and table structures extract, but column alignment may not be preserved.
- Styled/decorative text: Fancy fonts, drop shadows, text on textures — common in movie posters, invitations, and wedding cards — have poor OCR accuracy.
- Mixed language documents: A document that switches between Hindi and English in the same paragraph is harder than a purely monolingual document.
- Very old scripts: Pre-colonial manuscripts and early printed books often use letterforms that diverge significantly from modern usage.
Related Tools
- Transliteration Editor — edit and format the text after extraction
- Bulk Transliteration Tool — process multiple documents' text at once
- Document Export — save your digitized documents as PDF or Word