Unicode and Indian Scripts — Why Your Hindi Text Sometimes Looks Broken
Why Hindi, Tamil, and Bengali text sometimes renders as boxes, question marks, or garbled characters — and how to fix it. A plain explanation of Unicode, encoding, font rendering, and copy-paste problems.
You paste a Hindi paragraph into a document and some words render fine while others show up as strange boxes. You receive a WhatsApp message in Tamil that displays perfectly on your phone but shows question marks when you open it on your old laptop. You copy text from a 2005-era Hindi website and it comes out as random Roman letters that make no sense.
All of these are encoding problems, and they have specific causes and fixes. Understanding a little about how Indian text works under the hood will save you a lot of frustration.
The Short Version
Every character displayed on any digital device is actually a number. The question is: which number refers to which character? That's what encoding defines.
There are two major encoding worlds for Indian text:
- Unicode — the universal modern standard, used everywhere since the mid-2000s
- Legacy encodings — older standards (ISCII, Krutidev, Shree-Lipi, and dozens of font-based encodings) that predate Unicode and are still present in older documents and systems
What Unicode Actually Is
Unicode is a standard that assigns a unique number (called a code point) to every character in every writing system on Earth. There are currently over 149,000 characters in Unicode, covering 159 scripts.
For Indian scripts:
| Script | Unicode Block | Range |
|---|---|---|
| Devanagari (Hindi, Marathi, Sanskrit) | Devanagari | U+0900–U+097F |
| Bengali | Bengali | U+0980–U+09FF |
| Gurmukhi (Punjabi) | Gurmukhi | U+0A00–U+0A7F |
| Gujarati | Gujarati | U+0A80–U+0AFF |
| Odia | Oriya | U+0B00–U+0B7F |
| Tamil | Tamil | U+0B80–U+0BFF |
| Telugu | Telugu | U+0C00–U+0C7F |
| Kannada | Kannada | U+0C80–U+0CFF |
| Malayalam | Malayalam | U+0D00–U+0D7F |
Why Text Breaks: The Three Main Causes
1. Legacy Encoding vs Unicode Mismatch
Before Unicode became standard, Indian language computing used custom encodings. Krutidev was dominant for Hindi desktop publishing. Shree-Lipi covered many scripts. ISCII was the government standard. In all these systems, the same byte values used for English letters were repurposed to represent Indian characters.
When you open a Krutidev-encoded document in a Unicode-aware application, the application reads the bytes correctly but interprets them as Unicode — which puts them in completely different code points. The result looks like random Roman letters and symbols (often: "fgkjfkaj" or similar) rather than Devanagari.
The text isn't corrupted — it's just being read with the wrong key.
Fix: The document needs to be converted from its original encoding to Unicode. Search for "Krutidev to Unicode converter" or "ISCII to Unicode converter" — there are free web tools that handle this. Open the file in the correct encoding first, then convert.2. Missing Font Support
Unicode defines the code points, but the visual rendering of characters depends on fonts. If your system doesn't have a font that includes Devanagari glyphs, the operating system substitutes a "tofu" character — a small box or rectangle — for each character it can't display.
This is the "boxes everywhere" problem. The text is correct Unicode; your device just doesn't have a font to draw those characters.
Fix on Windows: Install a Devanagari font. Mangal is pre-installed on most Windows systems. If it's missing, download it from the Microsoft site or install Noto Sans Devanagari from Google Fonts. Fix on older Android: Go to Settings → Language & Input → check if your language is in the installed list. Installing the language pack usually installs the required system fonts. Fix in web browsers: Modern browsers (Chrome, Firefox, Edge) handle Unicode font rendering automatically for most scripts, falling back to system fonts. If you're seeing boxes in a browser, update your browser — very old versions had poor Unicode support.3. Incomplete Text Shaping Support
This is the subtlest problem. Indian scripts don't just display characters independently — characters combine. Matras (vowel diacritics) attach to preceding consonants. Consonant conjuncts merge into combined glyphs. A vowel that appears to the left of its consonant in visual rendering is actually stored after it in Unicode order.
This combining behavior is called text shaping, and it requires both a capable font AND a capable rendering engine. If either is missing, you get:
- Matras appearing in the wrong position (floating above/below instead of attached to the right character)
- Conjuncts appearing as separate characters rather than combined glyphs
- Vowel signs appearing after the consonant instead of before
Fix: Use modern software. Text shaping for Indian scripts requires HarfBuzz (used in Chrome, Firefox, LibreOffice, modern Word) or a platform-level shaping engine (Apple's CoreText, Windows' DirectWrite). Old software from before 2010 often lacked this. If you're seeing shaping errors in a specific application, update it or try a different one.
The Copy-Paste Problem: What Actually Happens
When you copy Indian text from a webpage:
- The browser copies the Unicode code points (the numbers that represent each character)
- The clipboard stores these code points
- When you paste into another application, that application reads the code points and renders them using available fonts
The Krutidev / Font-Based Encoding Problem in Depth
Krutidev deserves special mention because it's still extremely common in older Hindi documents, printed materials that were digitized, and some government documents.
Krutidev works by using a specially designed font where the glyphs at standard ASCII positions are replaced with Devanagari character forms. So when you see the Roman letter "d" in Krutidev font, the font draws ड instead. The underlying byte stored in the file is the ASCII code for "d" — 100 — but the visual output is ड.
This is a font trick, not real Unicode. The text file literally contains "fdkjfkjak" and only looks like Hindi because of a specific font rendering it.
Consequences:- Can't be searched by search engines (Google searches for Hindi words won't find Krutidev files)
- Breaks when opened in any application that uses a different font
- Can't be read by screen readers (accessibility failure)
- Doesn't sort, compare, or process correctly in databases
Unicode Normalization: NFC vs NFD
This is an edge case but it causes genuinely mysterious bugs. Some characters in Unicode can be represented in multiple ways. For example, a consonant with a vowel matra could be stored as:
- A single precomposed character (NFC — Normalized Form Composed)
- A base character + combining character (NFD — Normalized Form Decomposed)
This is rare in practice but comes up in programming contexts, database queries, and occasionally when copying text between different platforms (iOS tends to produce NFD, Windows tends to produce NFC).
Fix: Unicode normalization functions exist in every programming language. If you're building software that handles Indian text, normalize all input to NFC before storing or comparing.Quick Diagnostics
| Symptom | Likely Cause | Fix |
|---|---|---|
| Boxes/rectangles where text should be | Missing font | Install Noto or Mangal font |
| Roman letters where Indian script should be | Legacy encoding (Krutidev etc.) | Use a legacy-to-Unicode converter |
| Correct characters but wrong positions (matras floating) | Old software / poor text shaping | Update application or browser |
| Text looks right on screen but garbled in PDF | Font not embedded in PDF | Export PDF with embedded fonts |
| Copy-paste loses some characters | Clipboard encoding issue | Paste as plain text, then reformat |
| Same text looks different on two phones | Different fallback fonts | Use explicit font specification |