March 26, 20269 min read

OCR for Indian Scripts — Extract Text from Images and Scanned Documents

How OCR works for Indian language scripts at TranslitHub — supported formats, accuracy by script, digitizing old books and documents, and practical workflows for text extraction.

ocr image to text indian scripts scanning digitization

Somewhere in most Indian households there's a stack of documents — old letters in a grandmother's handwriting, photocopied government forms, newspaper clippings, printed railway receipts — that contain text you need but can't easily work with digitally. You could retype it all. Or you could photograph it and let OCR extract the text for you in seconds.

OCR for Indian scripts is harder than OCR for Latin scripts, and for a long time the quality was poor enough that manual retyping was often genuinely faster. That's changed significantly. TranslitHub includes OCR for Indian scripts that handles printed text — typed or typeset documents — well, and handles handwriting passably for clean, standard scripts.

This guide covers what to expect from Indian language OCR, which scripts work best, how to get clean extraction from different source materials, and practical use cases.

Why Indian Script OCR Is Harder

Latin alphabets consist of 26 letters with a few diacritics. Most Indian scripts have 40-60 base characters, plus vowel diacritics that appear above, below, before, and after those characters, plus conjunct consonants that merge multiple characters into a single glyph. Devanagari alone has hundreds of distinct conjunct forms in common use.

Add to this:

The headline stroke (matra) in Devanagari that connects all characters in a word — when printed at low resolution or with slight ink bleed, the headline can merge with the top strokes of characters

Circular/curved letterforms in Odia, Telugu, Kannada, and Malayalam that require high image quality to distinguish

The complex ligature system in Malayalam (old orthography) that produces unique joined forms for consonant clusters

Tamil's curved, open letterforms that look similar to each other at small sizes or with print artifacts

Modern OCR handles most of these with high accuracy when image quality is good. Image quality is the dominant factor — poor scans produce poor results regardless of the OCR quality.

Supported Input Formats

TranslitHub OCR accepts:

Format	Notes
JPG / JPEG	Standard photo format; 72-300 DPI works well
PNG	Lossless compression; better for screenshots and digital images
PDF	Extracts text from all pages; uses embedded text if available, falls back to OCR
TIFF	High-quality scanned documents; preferred for archival material
WebP	Modern web format; supported

Maximum file size: 20MB per image, 50MB for PDF. For PDFs longer than 30 pages, use the batch processing option to avoid timeout issues.

Accuracy by Script and Condition

Here's an honest breakdown of what to expect:

Script	Printed/Typeset	Handwritten	Notes
Devanagari (Hindi/Marathi)	Excellent	Fair	Best results for modern fonts; older typefaces need higher DPI
Bengali	Very good	Fair	Curved forms need clean scans
Tamil	Very good	Poor	Printed Tamil extracts cleanly; handwriting accuracy is low
Telugu	Good	Poor	Similar to Tamil
Kannada	Good	Poor	Complex circular letterforms need high DPI
Malayalam	Good	Poor	Old orthography (pre-1971 reformed) is less accurate
Gujarati	Good	Fair	Related to Devanagari; benefits from same model quality
Punjabi (Gurmukhi)	Good	Poor	Printed Gurmukhi works well
Odia	Moderate	Poor	Unique rounded letterforms still being improved

"Excellent" means you'll get 95%+ accuracy on well-scanned printed material. "Moderate" means expect 80-85% accuracy and budget time for correction. "Poor" for handwriting means the feature is experimental — useful as a starting point but not a finished product.

Getting Good Results: Image Quality Guidelines

OCR accuracy is almost entirely determined by image quality. Here's what matters:

Resolution

Minimum acceptable: 150 DPI for large, clear text (24pt+)
Recommended for body text: 300 DPI
For small text or fine details: 400-600 DPI
Phone photos: Most modern phones shoot at sufficient resolution — the issue is usually focus, not pixel count

When scanning with a flatbed scanner, set it to 300 DPI for documents and 400 DPI for anything with small text or complex scripts.

Lighting and Contrast

Even lighting is better than dramatic lighting. Avoid shadows across the text.
High contrast between text and background is crucial. Black ink on white paper is ideal.
Faded, yellowed, or water-stained paper significantly hurts accuracy.
Backlit documents (where you can see through the paper) are difficult — if the document is thin, place a black sheet behind it before photographing.

Image Orientation

TranslitHub's OCR automatically deskews (straightens) slightly rotated images, but extreme rotation (more than 10-15 degrees) reduces accuracy. Photograph or scan documents flat.

For phone photography of documents, enable the document scanning mode if your camera app has one — it flattens perspective distortion and applies contrast enhancement automatically.

Compression Artifacts

JPEG compression at low quality settings introduces block artifacts that interfere with small characters and fine strokes. If using JPEG, use high quality settings (80%+ quality). PNG avoids this entirely for screenshots and digital documents.

Uploading and Running OCR

Open TranslitHub and navigate to the OCR tool (or use the OCR icon in the editor toolbar)
Upload your image or PDF
Select the source language — if you're unsure or the document contains multiple scripts, select "Auto-detect"
Click Extract Text
The extracted text appears in an editable panel within seconds for images; PDF extraction takes longer based on page count

The extracted text is editable immediately. You can correct errors, then use any of the editor's export options (PDF, DOCX, TXT) to save the clean digital version.

Auto-Detection vs. Manual Language Selection

The auto-detect option works reliably for standard modern scripts. It analyzes the character shapes in the image to identify the script, then applies the appropriate recognition model. For a document that's clearly in one language, auto-detect is usually correct.

Manual language selection is better when:

The document is partially faded or damaged (character recognition is ambiguous)

You have a document that mixes two languages (auto-detect may pick the wrong primary)

The document uses an uncommon font or typeface (manual language setting helps the model narrow its search)

Use Case: Digitizing Old Handwritten Letters

Old family letters in Hindi or Marathi handwriting are among the most emotionally valuable and practically challenging OCR targets. The challenges:

Pre-independence and mid-century handwriting often uses letterforms that differ from modern standard forms
Ink has faded and paper has yellowed
Writing was dense with minimal spacing between words
Ligatures and personal script variations don't match any standard typeface

For this use case, set your expectations accordingly. OCR will give you a starting draft — perhaps 60-70% accurate for older, faded handwriting — and you'll complete the rest manually. This is still faster than typing the entire document from scratch, and the output is editable text that you can correct character by character.

Tips for handwritten documents:

Photograph in full daylight (outdoors or near a bright window) rather than under artificial light

Use the highest camera resolution available

Shoot multiple times and choose the sharpest

If the document is folded, gently flatten it completely before photographing

Use Case: Newspaper and Magazine Clippings

Printed newspaper content in Indian languages typically extracts very well — 90%+ accuracy for well-preserved clippings. Challenges include:

Multi-column layouts: The OCR tool handles these by analyzing the layout and extracting text column by column
Text wrapping around images: Irregular text shapes are harder; select specific text regions for better accuracy
Low-quality newsprint: Old newspapers especially from the 1970s-1990s were printed on low-grade paper with ink bleed

For newspaper archives, the recommended workflow:

Scan at 400 DPI (newsprint has fine text)
Use the region selection tool to define individual articles
Extract text from each article separately
Save as TXT or DOCX

Use Case: Scanned Government Documents

Government forms, certificates, and official correspondence often need to be digitized for records or further processing. The challenge is that government printing in India is not always high quality — forms are often photocopied multiple times, reducing clarity.

For photocopied government documents:

Use PDF input rather than photograph if you have a scanner available

Select the specific language manually rather than auto-detect

After extraction, verify key information (names, dates, ID numbers) character by character

Batch OCR

For processing multiple documents:

Upload multiple files (JPG, PNG, PDF, or a mix) at once

Set a common language (or leave on auto-detect)

Run batch processing

Download results as a ZIP containing one TXT or DOCX file per input

Batch processing is available for accounts. Free accounts process one file at a time.

Converting OCR Output to Transliterated Roman

Once you've extracted text in an Indian script, you might need it in Roman transliteration — for data processing, URL slugs, or sharing with someone who can't read the script. The OCR result feeds directly into TranslitHub's transliteration feature:

Run OCR to get Indian script text
Click "Transliterate to Roman" in the editor toolbar
Get the Roman phonetic equivalent of the extracted text

This is useful for building searchable databases of Indian language content where you want to index both the native script and the phonetic representation.

Limitations to Be Aware Of

Mathematical and tabular content: OCR for tables within Indian language documents is less reliable. Numbers and table structures extract, but column alignment may not be preserved.
Styled/decorative text: Fancy fonts, drop shadows, text on textures — common in movie posters, invitations, and wedding cards — have poor OCR accuracy.
Mixed language documents: A document that switches between Hindi and English in the same paragraph is harder than a purely monolingual document.
Very old scripts: Pre-colonial manuscripts and early printed books often use letterforms that diverge significantly from modern usage.

Transliteration Editor — edit and format the text after extraction
Bulk Transliteration Tool — process multiple documents' text at once
Document Export — save your digitized documents as PDF or Word