March 26, 20269 min read

OCR for Indian Scripts — Extract Text from Images and Scanned Documents

How OCR works for Indian language scripts at TranslitHub — supported formats, accuracy by script, digitizing old books and documents, and practical workflows for text extraction.

ocr image to text indian scripts scanning digitization
Ad 336x280

Somewhere in most Indian households there's a stack of documents — old letters in a grandmother's handwriting, photocopied government forms, newspaper clippings, printed railway receipts — that contain text you need but can't easily work with digitally. You could retype it all. Or you could photograph it and let OCR extract the text for you in seconds.

OCR for Indian scripts is harder than OCR for Latin scripts, and for a long time the quality was poor enough that manual retyping was often genuinely faster. That's changed significantly. TranslitHub includes OCR for Indian scripts that handles printed text — typed or typeset documents — well, and handles handwriting passably for clean, standard scripts.

This guide covers what to expect from Indian language OCR, which scripts work best, how to get clean extraction from different source materials, and practical use cases.

Why Indian Script OCR Is Harder

Latin alphabets consist of 26 letters with a few diacritics. Most Indian scripts have 40-60 base characters, plus vowel diacritics that appear above, below, before, and after those characters, plus conjunct consonants that merge multiple characters into a single glyph. Devanagari alone has hundreds of distinct conjunct forms in common use.

Add to this:


  • The headline stroke (matra) in Devanagari that connects all characters in a word — when printed at low resolution or with slight ink bleed, the headline can merge with the top strokes of characters

  • Circular/curved letterforms in Odia, Telugu, Kannada, and Malayalam that require high image quality to distinguish

  • The complex ligature system in Malayalam (old orthography) that produces unique joined forms for consonant clusters

  • Tamil's curved, open letterforms that look similar to each other at small sizes or with print artifacts


Modern OCR handles most of these with high accuracy when image quality is good. Image quality is the dominant factor — poor scans produce poor results regardless of the OCR quality.

Supported Input Formats

TranslitHub OCR accepts:

FormatNotes
JPG / JPEGStandard photo format; 72-300 DPI works well
PNGLossless compression; better for screenshots and digital images
PDFExtracts text from all pages; uses embedded text if available, falls back to OCR
TIFFHigh-quality scanned documents; preferred for archival material
WebPModern web format; supported
Maximum file size: 20MB per image, 50MB for PDF. For PDFs longer than 30 pages, use the batch processing option to avoid timeout issues.

Accuracy by Script and Condition

Here's an honest breakdown of what to expect:

ScriptPrinted/TypesetHandwrittenNotes
Devanagari (Hindi/Marathi)ExcellentFairBest results for modern fonts; older typefaces need higher DPI
BengaliVery goodFairCurved forms need clean scans
TamilVery goodPoorPrinted Tamil extracts cleanly; handwriting accuracy is low
TeluguGoodPoorSimilar to Tamil
KannadaGoodPoorComplex circular letterforms need high DPI
MalayalamGoodPoorOld orthography (pre-1971 reformed) is less accurate
GujaratiGoodFairRelated to Devanagari; benefits from same model quality
Punjabi (Gurmukhi)GoodPoorPrinted Gurmukhi works well
OdiaModeratePoorUnique rounded letterforms still being improved
"Excellent" means you'll get 95%+ accuracy on well-scanned printed material. "Moderate" means expect 80-85% accuracy and budget time for correction. "Poor" for handwriting means the feature is experimental — useful as a starting point but not a finished product.

Getting Good Results: Image Quality Guidelines

OCR accuracy is almost entirely determined by image quality. Here's what matters:

Resolution

  • Minimum acceptable: 150 DPI for large, clear text (24pt+)
  • Recommended for body text: 300 DPI
  • For small text or fine details: 400-600 DPI
  • Phone photos: Most modern phones shoot at sufficient resolution — the issue is usually focus, not pixel count
When scanning with a flatbed scanner, set it to 300 DPI for documents and 400 DPI for anything with small text or complex scripts.

Lighting and Contrast

  • Even lighting is better than dramatic lighting. Avoid shadows across the text.
  • High contrast between text and background is crucial. Black ink on white paper is ideal.
  • Faded, yellowed, or water-stained paper significantly hurts accuracy.
  • Backlit documents (where you can see through the paper) are difficult — if the document is thin, place a black sheet behind it before photographing.

Image Orientation

TranslitHub's OCR automatically deskews (straightens) slightly rotated images, but extreme rotation (more than 10-15 degrees) reduces accuracy. Photograph or scan documents flat.

For phone photography of documents, enable the document scanning mode if your camera app has one — it flattens perspective distortion and applies contrast enhancement automatically.

Compression Artifacts

JPEG compression at low quality settings introduces block artifacts that interfere with small characters and fine strokes. If using JPEG, use high quality settings (80%+ quality). PNG avoids this entirely for screenshots and digital documents.

Uploading and Running OCR

  1. Open TranslitHub and navigate to the OCR tool (or use the OCR icon in the editor toolbar)
  2. Upload your image or PDF
  3. Select the source language — if you're unsure or the document contains multiple scripts, select "Auto-detect"
  4. Click Extract Text
  5. The extracted text appears in an editable panel within seconds for images; PDF extraction takes longer based on page count
The extracted text is editable immediately. You can correct errors, then use any of the editor's export options (PDF, DOCX, TXT) to save the clean digital version.

Auto-Detection vs. Manual Language Selection

The auto-detect option works reliably for standard modern scripts. It analyzes the character shapes in the image to identify the script, then applies the appropriate recognition model. For a document that's clearly in one language, auto-detect is usually correct.

Manual language selection is better when:


  • The document is partially faded or damaged (character recognition is ambiguous)

  • You have a document that mixes two languages (auto-detect may pick the wrong primary)

  • The document uses an uncommon font or typeface (manual language setting helps the model narrow its search)


Use Case: Digitizing Old Handwritten Letters

Old family letters in Hindi or Marathi handwriting are among the most emotionally valuable and practically challenging OCR targets. The challenges:

  • Pre-independence and mid-century handwriting often uses letterforms that differ from modern standard forms
  • Ink has faded and paper has yellowed
  • Writing was dense with minimal spacing between words
  • Ligatures and personal script variations don't match any standard typeface
For this use case, set your expectations accordingly. OCR will give you a starting draft — perhaps 60-70% accurate for older, faded handwriting — and you'll complete the rest manually. This is still faster than typing the entire document from scratch, and the output is editable text that you can correct character by character.

Tips for handwritten documents:


  • Photograph in full daylight (outdoors or near a bright window) rather than under artificial light

  • Use the highest camera resolution available

  • Shoot multiple times and choose the sharpest

  • If the document is folded, gently flatten it completely before photographing


Use Case: Newspaper and Magazine Clippings

Printed newspaper content in Indian languages typically extracts very well — 90%+ accuracy for well-preserved clippings. Challenges include:

  • Multi-column layouts: The OCR tool handles these by analyzing the layout and extracting text column by column
  • Text wrapping around images: Irregular text shapes are harder; select specific text regions for better accuracy
  • Low-quality newsprint: Old newspapers especially from the 1970s-1990s were printed on low-grade paper with ink bleed
For newspaper archives, the recommended workflow:
  1. Scan at 400 DPI (newsprint has fine text)
  2. Use the region selection tool to define individual articles
  3. Extract text from each article separately
  4. Save as TXT or DOCX

Use Case: Scanned Government Documents

Government forms, certificates, and official correspondence often need to be digitized for records or further processing. The challenge is that government printing in India is not always high quality — forms are often photocopied multiple times, reducing clarity.

For photocopied government documents:


  • Use PDF input rather than photograph if you have a scanner available

  • Select the specific language manually rather than auto-detect

  • After extraction, verify key information (names, dates, ID numbers) character by character


Batch OCR

For processing multiple documents:


  1. Upload multiple files (JPG, PNG, PDF, or a mix) at once

  2. Set a common language (or leave on auto-detect)

  3. Run batch processing

  4. Download results as a ZIP containing one TXT or DOCX file per input


Batch processing is available for accounts. Free accounts process one file at a time.

Converting OCR Output to Transliterated Roman

Once you've extracted text in an Indian script, you might need it in Roman transliteration — for data processing, URL slugs, or sharing with someone who can't read the script. The OCR result feeds directly into TranslitHub's transliteration feature:

  1. Run OCR to get Indian script text
  2. Click "Transliterate to Roman" in the editor toolbar
  3. Get the Roman phonetic equivalent of the extracted text
This is useful for building searchable databases of Indian language content where you want to index both the native script and the phonetic representation.

Limitations to Be Aware Of

  • Mathematical and tabular content: OCR for tables within Indian language documents is less reliable. Numbers and table structures extract, but column alignment may not be preserved.
  • Styled/decorative text: Fancy fonts, drop shadows, text on textures — common in movie posters, invitations, and wedding cards — have poor OCR accuracy.
  • Mixed language documents: A document that switches between Hindi and English in the same paragraph is harder than a purely monolingual document.
  • Very old scripts: Pre-colonial manuscripts and early printed books often use letterforms that diverge significantly from modern usage.
Ad 728x90