March 26, 20269 min read

Unicode and Indian Scripts — Why Your Hindi Text Sometimes Looks Broken

Why Hindi, Tamil, and Bengali text sometimes renders as boxes, question marks, or garbled characters — and how to fix it. A plain explanation of Unicode, encoding, font rendering, and copy-paste problems.

unicode encoding indian scripts fonts rendering
Ad 336x280

You paste a Hindi paragraph into a document and some words render fine while others show up as strange boxes. You receive a WhatsApp message in Tamil that displays perfectly on your phone but shows question marks when you open it on your old laptop. You copy text from a 2005-era Hindi website and it comes out as random Roman letters that make no sense.

All of these are encoding problems, and they have specific causes and fixes. Understanding a little about how Indian text works under the hood will save you a lot of frustration.

The Short Version

Every character displayed on any digital device is actually a number. The question is: which number refers to which character? That's what encoding defines.

There are two major encoding worlds for Indian text:

  1. Unicode — the universal modern standard, used everywhere since the mid-2000s
  2. Legacy encodings — older standards (ISCII, Krutidev, Shree-Lipi, and dozens of font-based encodings) that predate Unicode and are still present in older documents and systems
When text encoded in one system is read by software expecting the other, it breaks. The characters exist — but they're the wrong ones.

What Unicode Actually Is

Unicode is a standard that assigns a unique number (called a code point) to every character in every writing system on Earth. There are currently over 149,000 characters in Unicode, covering 159 scripts.

For Indian scripts:

ScriptUnicode BlockRange
Devanagari (Hindi, Marathi, Sanskrit)DevanagariU+0900–U+097F
BengaliBengaliU+0980–U+09FF
Gurmukhi (Punjabi)GurmukhiU+0A00–U+0A7F
GujaratiGujaratiU+0A80–U+0AFF
OdiaOriyaU+0B00–U+0B7F
TamilTamilU+0B80–U+0BFF
TeluguTeluguU+0C00–U+0C7F
KannadaKannadaU+0C80–U+0CFF
MalayalamMalayalamU+0D00–U+0D7F
When you type Hindi using a modern transliteration tool or IME, you're producing Devanagari Unicode characters — U+0900 range. These characters are universal: any software, any device, any browser that supports Unicode will render them correctly, assuming an appropriate font is available.

Why Text Breaks: The Three Main Causes

1. Legacy Encoding vs Unicode Mismatch

Before Unicode became standard, Indian language computing used custom encodings. Krutidev was dominant for Hindi desktop publishing. Shree-Lipi covered many scripts. ISCII was the government standard. In all these systems, the same byte values used for English letters were repurposed to represent Indian characters.

When you open a Krutidev-encoded document in a Unicode-aware application, the application reads the bytes correctly but interprets them as Unicode — which puts them in completely different code points. The result looks like random Roman letters and symbols (often: "fgkjfkaj" or similar) rather than Devanagari.

The text isn't corrupted — it's just being read with the wrong key.

Fix: The document needs to be converted from its original encoding to Unicode. Search for "Krutidev to Unicode converter" or "ISCII to Unicode converter" — there are free web tools that handle this. Open the file in the correct encoding first, then convert.

2. Missing Font Support

Unicode defines the code points, but the visual rendering of characters depends on fonts. If your system doesn't have a font that includes Devanagari glyphs, the operating system substitutes a "tofu" character — a small box or rectangle — for each character it can't display.

This is the "boxes everywhere" problem. The text is correct Unicode; your device just doesn't have a font to draw those characters.

Fix on Windows: Install a Devanagari font. Mangal is pre-installed on most Windows systems. If it's missing, download it from the Microsoft site or install Noto Sans Devanagari from Google Fonts. Fix on older Android: Go to Settings → Language & Input → check if your language is in the installed list. Installing the language pack usually installs the required system fonts. Fix in web browsers: Modern browsers (Chrome, Firefox, Edge) handle Unicode font rendering automatically for most scripts, falling back to system fonts. If you're seeing boxes in a browser, update your browser — very old versions had poor Unicode support.

3. Incomplete Text Shaping Support

This is the subtlest problem. Indian scripts don't just display characters independently — characters combine. Matras (vowel diacritics) attach to preceding consonants. Consonant conjuncts merge into combined glyphs. A vowel that appears to the left of its consonant in visual rendering is actually stored after it in Unicode order.

This combining behavior is called text shaping, and it requires both a capable font AND a capable rendering engine. If either is missing, you get:


  • Matras appearing in the wrong position (floating above/below instead of attached to the right character)

  • Conjuncts appearing as separate characters rather than combined glyphs

  • Vowel signs appearing after the consonant instead of before


Fix: Use modern software. Text shaping for Indian scripts requires HarfBuzz (used in Chrome, Firefox, LibreOffice, modern Word) or a platform-level shaping engine (Apple's CoreText, Windows' DirectWrite). Old software from before 2010 often lacked this. If you're seeing shaping errors in a specific application, update it or try a different one.

The Copy-Paste Problem: What Actually Happens

When you copy Indian text from a webpage:

  1. The browser copies the Unicode code points (the numbers that represent each character)
  2. The clipboard stores these code points
  3. When you paste into another application, that application reads the code points and renders them using available fonts
This process is reliable between modern applications. Where it breaks: Copy from legacy-encoded website: Some older Hindi news sites still serve text in Krutidev or similar. Your browser may render it correctly (if it detects the encoding) but the copied text will be the raw bytes — which paste as garbage in Unicode-native applications. Copy into a PDF editor that lacks Indian font embedding: The PDF editor may have the characters but not embed the font in the saved PDF. The recipient's PDF viewer then substitutes a fallback font that may not cover Indian scripts. Copy into plain text email: Email plain text fields are usually UTF-8 (Unicode), so this works fine. But some legacy email clients enforce ASCII encoding and strip or mangle high-byte characters. Practical test: Copy your Indian text and paste it into a plain text editor (Notepad on Windows, TextEdit in plain text mode on Mac). If it looks correct there, the text itself is fine. If it breaks, the source was using a legacy encoding.

The Krutidev / Font-Based Encoding Problem in Depth

Krutidev deserves special mention because it's still extremely common in older Hindi documents, printed materials that were digitized, and some government documents.

Krutidev works by using a specially designed font where the glyphs at standard ASCII positions are replaced with Devanagari character forms. So when you see the Roman letter "d" in Krutidev font, the font draws ड instead. The underlying byte stored in the file is the ASCII code for "d" — 100 — but the visual output is ड.

This is a font trick, not real Unicode. The text file literally contains "fdkjfkjak" and only looks like Hindi because of a specific font rendering it.

Consequences:
  • Can't be searched by search engines (Google searches for Hindi words won't find Krutidev files)
  • Breaks when opened in any application that uses a different font
  • Can't be read by screen readers (accessibility failure)
  • Doesn't sort, compare, or process correctly in databases
Converting Krutidev to Unicode: Several free tools handle this. CopyPaste-it, e-tools.in, and others offer batch conversion. You paste Krutidev text and get back proper Unicode Devanagari. The result is real, searchable, transferable text.

Unicode Normalization: NFC vs NFD

This is an edge case but it causes genuinely mysterious bugs. Some characters in Unicode can be represented in multiple ways. For example, a consonant with a vowel matra could be stored as:

  • A single precomposed character (NFC — Normalized Form Composed)
  • A base character + combining character (NFD — Normalized Form Decomposed)
Both are valid Unicode. Both render identically. But they are byte-for-byte different, so string comparison fails: searching for a word finds no results because your search term is NFC and the document text is NFD.

This is rare in practice but comes up in programming contexts, database queries, and occasionally when copying text between different platforms (iOS tends to produce NFD, Windows tends to produce NFC).

Fix: Unicode normalization functions exist in every programming language. If you're building software that handles Indian text, normalize all input to NFC before storing or comparing.

Quick Diagnostics

SymptomLikely CauseFix
Boxes/rectangles where text should beMissing fontInstall Noto or Mangal font
Roman letters where Indian script should beLegacy encoding (Krutidev etc.)Use a legacy-to-Unicode converter
Correct characters but wrong positions (matras floating)Old software / poor text shapingUpdate application or browser
Text looks right on screen but garbled in PDFFont not embedded in PDFExport PDF with embedded fonts
Copy-paste loses some charactersClipboard encoding issuePaste as plain text, then reformat
Same text looks different on two phonesDifferent fallback fontsUse explicit font specification
The good news: any text you produce today using a modern transliteration tool — including TranslitHub — is standard Unicode. It's searchable, portable, and future-proof. The encoding nightmare described in this article is largely a legacy problem from before Unicode became universal. You'll mostly encounter it when dealing with older documents, older websites, and systems built before 2005.
Ad 728x90