What is the difference between Unicode and UTF-8?

Unicode is a character set assigning numbers to characters. UTF-8 is an encoding defining how those numbers are stored as bytes.

Why should I use UTF-8?

UTF-8 is backward compatible with ASCII, supports all Unicode characters, and is used by 98%+ of the web.

What causes mojibake?

Mojibake occurs when text is decoded with a different encoding than it was encoded with.

What is Base64 encoding?

Base64 encodes binary data as ASCII text for safe transport in text-based protocols like email and JSON.

How do I set encoding in HTML?

Add meta charset UTF-8 as the first element in your head tag.

Text Encoding Explained: UTF-8, ASCII, Unicode & Character Sets

March 31, 2026 · 12 min read

Table of Contents

What Is Character Encoding?
ASCII: The Foundation of Text Encoding
Unicode: A Universal Character Set
UTF-8: The Internet's Standard Encoding
UTF-8 vs UTF-16 vs UTF-32: Choosing the Right Encoding
Mojibake and Encoding Problems
Encoding in HTML and Web Development
Encoding in Programming Languages
Base64 Encoding: Binary Data as Text
Best Practices and Common Pitfalls
Frequently Asked Questions
Related Articles

Every time you type a message, save a document, or browse a website, character encoding works behind the scenes to translate human-readable text into binary data that computers understand. Despite being fundamental to all digital communication, encoding remains one of the most misunderstood aspects of computing.

This comprehensive guide explains everything you need to know about text encoding, from the basics of ASCII to the complexities of Unicode and UTF-8. Whether you're a developer debugging encoding issues or simply curious about how computers handle text, you'll find practical insights and solutions here.

What Is Character Encoding?

Character encoding is the system that maps characters—letters, numbers, symbols, and special characters—to numeric values that computers can store and process. When you type the letter "A" on your keyboard, your computer doesn't store the letter itself. Instead, it stores a number (in ASCII, that's 65) and uses the encoding scheme to convert that number back into "A" when displaying it.

Think of character encoding as a translation dictionary between human language and computer language. Without this dictionary, text would be meaningless sequences of bytes with no way to interpret them correctly.

The encoding process works in two directions:

Encoding: Converting characters into bytes (what happens when you save a file)
Decoding: Converting bytes back into characters (what happens when you open a file)

Problems arise when the encoding and decoding use different schemes. Imagine if you encrypted a message with one cipher and tried to decrypt it with a different one—you'd get gibberish. The same thing happens with text encoding mismatches, resulting in corrupted characters or the infamous "mojibake" (more on that later).

Pro tip: Use our Text Encoder tool to see exactly how different encoding schemes represent the same text. This hands-on approach helps demystify the encoding process.

ASCII: The Foundation of Text Encoding

ASCII (American Standard Code for Information Interchange) was developed in 1963 and became the foundation for modern text encoding. It uses 7 bits to represent 128 characters, which was sufficient for English text and basic computing needs of the era.

The ASCII character set is divided into several ranges, each serving a specific purpose:

Range	Characters	Count	Purpose
0-31	Control characters	32	Non-printable commands (tab, newline, carriage return)
32-47	Punctuation & symbols	16	Space, !, ", #, $, %, &, ', (, ), *, +, comma, -, ., /
48-57	Digits	10	0-9
58-64	Punctuation	7	:, ;, <, =, >, ?, @
65-90	Uppercase letters	26	A-Z
91-96	Punctuation	6	[, \, ], ^, _, `
97-122	Lowercase letters	26	a-z
123-126	Punctuation	4	{, \|, }, ~
127	Delete	1	DEL control character

The Limitations of ASCII

ASCII works perfectly for English text, but it has severe limitations for international communication:

No accented characters (é, ñ, ü, ø)
No characters from non-Latin scripts (Chinese, Arabic, Hebrew, Cyrillic)
No currency symbols beyond the dollar sign
No emoji or modern symbols
No mathematical or scientific notation beyond basic operators

These limitations led to the creation of "extended ASCII" variants like ISO-8859-1 (Latin-1), which used the 8th bit to add 128 more characters. However, different regions created incompatible extensions, causing the same byte values to represent different characters depending on which code page was in use.

ASCII's Lasting Impact

Despite its limitations, ASCII remains relevant today. The first 128 characters of UTF-8 (the dominant modern encoding) are identical to ASCII, ensuring backward compatibility. This means any valid ASCII text is also valid UTF-8, making migration seamless.

ASCII's simplicity also makes it ideal for protocols, file formats, and systems where only basic English text is needed. Programming languages, command-line interfaces, and network protocols still rely heavily on ASCII characters.

Unicode: A Universal Character Set

Unicode was created in 1991 to solve the fundamental problem that ASCII and its extensions couldn't address: representing all the world's writing systems in a single, unified standard. Rather than having dozens of incompatible encoding schemes, Unicode provides one system that works for everyone.

Unicode is not an encoding itself—it's a character set that assigns a unique number called a code point to every character. As of Unicode 15.1 (released in 2023), the standard includes over 149,000 characters covering 161 scripts and symbol sets.

Understanding Code Points

Code points are written in the format U+XXXX, where XXXX is a hexadecimal number. Here are some examples:

U+0041 = A (Latin capital letter A)
U+00E9 = é (Latin small letter e with acute)
U+4E2D = 中 (Chinese character for "middle")
U+0628 = ب (Arabic letter beh)
U+1F600 = 😀 (grinning face emoji)
U+03B1 = α (Greek small letter alpha)

The Unicode code space ranges from U+0000 to U+10FFFF, providing room for 1,114,112 possible code points. These are organized into 17 planes of 65,536 code points each.

Unicode Planes

The most important planes include:

Plane 0 (BMP - Basic Multilingual Plane): U+0000 to U+FFFF. Contains the most commonly used characters from all modern scripts, including Latin, Chinese, Arabic, Hebrew, Cyrillic, and many others. About 55,000 code points are assigned in this plane.
Plane 1 (SMP - Supplementary Multilingual Plane): U+10000 to U+1FFFF. Contains historic scripts, musical notation, mathematical symbols, and emoji. This is where most emoji live.
Plane 2 (SIP - Supplementary Ideographic Plane): U+20000 to U+2FFFF. Contains additional Chinese, Japanese, and Korean (CJK) ideographs that didn't fit in the BMP.
Planes 3-13: Currently unassigned, reserved for future expansion.
Plane 14 (SSP - Supplementary Special-purpose Plane): Contains special-purpose characters like variation selectors and tags.
Planes 15-16: Private use areas for custom characters.

Quick tip: Characters in the BMP (Plane 0) can be represented with 16 bits, while characters in other planes require more bits. This distinction is important when choosing between UTF-8, UTF-16, and UTF-32.

Unicode Normalization

One complexity of Unicode is that some characters can be represented in multiple ways. For example, the character "é" can be encoded as:

A single code point: U+00E9 (precomposed form)
Two code points: U+0065 (e) + U+0301 (combining acute accent)

Both representations look identical but have different byte sequences. Unicode normalization forms (NFD, NFC, NFKD, NFKC) provide standard ways to convert between these representations, ensuring consistent comparison and searching.

UTF-8: The Internet's Standard Encoding

UTF-8 (Unicode Transformation Format - 8-bit) is the most widely used character encoding on the internet, accounting for over 98% of all web pages. It was designed by Ken Thompson and Rob Pike in 1992 and has become the de facto standard for text encoding.

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per character. This clever design provides several advantages:

How UTF-8 Works

UTF-8 encodes characters using the following scheme:

Code Point Range	Bytes	Byte Pattern	Example Characters
U+0000 to U+007F	1	0xxxxxxx	ASCII characters (A, 5, $)
U+0080 to U+07FF	2	110xxxxx 10xxxxxx	Latin extended, Greek, Cyrillic (é, α, Ж)
U+0800 to U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx	Most Asian scripts, symbols (中, ह, €)
U+10000 to U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	Emoji, rare scripts (😀, 𝕳, 𐐷)

The "x" positions in the byte patterns hold the actual character data. The leading bits indicate how many bytes the character uses, allowing decoders to synchronize correctly even if they start reading mid-stream.

Advantages of UTF-8

UTF-8's dominance comes from several key benefits:

Backward compatibility: ASCII text is valid UTF-8 without any conversion. The first 128 characters use identical byte values.
Space efficiency: English and code use only 1 byte per character, while still supporting all Unicode characters.
Self-synchronizing: You can find character boundaries by looking at byte patterns, making error recovery easier.
No byte order issues: Unlike UTF-16 and UTF-32, UTF-8 doesn't require a byte order mark (BOM) to indicate endianness.
Null-byte safe: The null byte (0x00) only appears as the NULL character, not as part of multi-byte sequences, making it compatible with C-style strings.

UTF-8 in Practice

Let's see how UTF-8 encodes different characters:

"A" (U+0041): 1 byte → 0x41
"é" (U+00E9): 2 bytes → 0xC3 0xA9
"中" (U+4E2D): 3 bytes → 0xE4 0xB8 0xAD
"😀" (U+1F600): 4 bytes → 0xF0 0x9F 0x98 0x80

This variable-length approach means that a document containing mostly English text uses far less space than UTF-16 or UTF-32, while still supporting the full Unicode range when needed.

Pro tip: Always specify UTF-8 encoding in your HTML documents with <meta charset="UTF-8"> and in HTTP headers with Content-Type: text/html; charset=UTF-8. This prevents browsers from guessing the encoding incorrectly.

UTF-8 vs UTF-16 vs UTF-32: Choosing the Right Encoding

While UTF-8 dominates web content, UTF-16 and UTF-32 have their own use cases. Understanding the differences helps you choose the right encoding for your specific needs.

UTF-16: The Middle Ground

UTF-16 uses 2 or 4 bytes per character. Characters in the BMP (U+0000 to U+FFFF) use 2 bytes, while characters outside the BMP use 4 bytes through a mechanism called surrogate pairs.

Advantages:

More space-efficient than UTF-8 for Asian languages (Chinese, Japanese, Korean)
Used internally by Windows, Java, JavaScript, and .NET
Constant 2-byte width for most common characters simplifies some string operations

Disadvantages:

Not backward compatible with ASCII
Requires byte order mark (BOM) or explicit endianness specification
Less space-efficient for English and code
Variable-length encoding (due to surrogate pairs) complicates string indexing
Contains null bytes in normal text, breaking C-style string functions

UTF-32: Fixed Width Simplicity

UTF-32 uses exactly 4 bytes for every character, making it a fixed-width encoding. Each code point maps directly to a 32-bit integer.

Advantages:

Constant width simplifies string indexing and length calculations
Direct mapping between code points and encoded values
No complex decoding logic needed

Disadvantages:

Extremely space-inefficient (4 bytes per character, even for ASCII)
Rarely used for storage or transmission
Not backward compatible with ASCII
Requires byte order specification

Comparison Table

Feature	UTF-8	UTF-16	UTF-32
Bytes per character	1-4 (variable)	2-4 (variable)	4 (fixed)
ASCII compatibility	Yes	No	No
Space efficiency (English)	Excellent	Poor	Very poor
Space efficiency (Asian)	Good	Excellent	Poor
Byte order issues	None	Yes (BOM needed)	Yes (BOM needed)
String indexing	Complex	Moderate	Simple
Web usage	98%+	<1%	<0.1%
Best for	Web, files, interchange	Internal processing	Internal processing

When to Use Each Encoding

Use UTF-8 for:

Web pages and APIs
File storage and data interchange
Email and text protocols
Configuration files
Any content that will be transmitted over networks

Use UTF-16 for:

Internal string representation in Windows applications
Java and .NET string processing
Applications primarily handling Asian languages
When interfacing with APIs that require UTF-16

Use UTF-32 for:

Internal processing when you need constant-time character indexing
Text analysis algorithms that benefit from fixed-width characters
Temporary buffers during encoding conversion

In practice, UTF-8 is the right choice for almost everything except internal string processing in specific programming environments.

Mojibake and Encoding Problems

Mojibake (文字化け, Japanese for "character transformation") refers to the garbled text that appears when text is decoded using the wrong character encoding. You've probably seen it: "café" becomes "cafÃ©" or "resumé" becomes "resumÃ©".

Common Causes of Encoding Problems

Encoding issues typically occur in these scenarios:

Missing encoding declaration: When a file or web page doesn't specify its encoding, software must guess, often incorrectly.
Encoding mismatch: Text saved in one encoding (like UTF-8) is opened with another (like ISO-8859-1).
Double encoding: Text is encoded twice, such as UTF-8 text being treated as ISO-8859-1 and then re-encoded as UTF-8.
Truncated multi-byte sequences: A multi-byte character is cut off, leaving incomplete data.
Copy-paste between systems: Copying text from one application to another with different default encodings.

Recognizing Encoding Problems

Here are telltale signs of encoding issues:

Accented characters appear as multiple strange characters (é → Ã©)
Question marks or replacement characters (�) appear instead of text
Asian characters display as random symbols or boxes
Curly quotes and dashes become weird characters (" → â€œ)
Emoji appear as multiple boxes or question marks

Fixing Encoding Problems

When you encounter mojibake, try these solutions:

Identify the actual encoding: Use tools like Text Encoder or file command-line utilities to detect the encoding.
Re-open with correct encoding: Most text editors let you specify encoding when opening files. Try UTF-8, ISO-8859-1, or Windows-1252.
Convert the file: Use tools like iconv (command line) or online converters to change the encoding.
Fix double encoding: If text was double-encoded, you may need to decode it twice or use specialized repair tools.
Prevent future issues: Always specify UTF-8 encoding in your files, databases, and HTTP headers.

Pro tip: If you see "Ã©" instead of "é", the text is UTF-8 being interpreted as ISO-8859-1. If you see "â€™" instead of "'", that's UTF-8 curly quotes being misread. These patterns help diagnose the specific encoding mismatch.

Prevention Strategies

Avoid encoding problems by following these best practices:

Use UTF-8 everywhere: files, databases, APIs, web pages
Always declare encoding explicitly in HTML, XML, and HTTP headers
Configure your text editor to default to UTF-8
Set database connections to use UTF-8 (utf8mb4 in MySQL)
Test with international characters during development
Use encoding-aware string functions in your programming language

Encoding in HTML and Web Development

Proper character encoding is critical for web development. Incorrect encoding causes display issues, breaks forms, and can even create security vulnerabilities.

Declaring Encoding in HTML

Always declare UTF-8 encoding in your HTML documents using the meta charset tag in the <head> section:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Your Page Title</title>
</head>
<body>
    <!-- Your content -->
</body>
</html>

This meta tag should appear within the first 1024 bytes of your HTML document. Browsers use it to determine how to decode the page content.

HTTP Headers

The HTTP Content-Type header should also specify the encoding:

Content-Type: text/html; charset=UTF-8

If the HTTP header and HTML meta tag disagree, the HTTP header takes precedence. Always ensure they match to avoid confusion.

HTML Entities and Character References

HTML provides two ways to represent special characters:

With UTF-8 encoding, you can use most characters directly without entities. However, you must still escape these HTML special characters:

< as <
> as >
& as &
" as " (in attributes)
' as ' or ' (in attributes)

URL Encoding

URLs can only contain ASCII characters. Non-ASCII characters must be percent-encoded using UTF-8 bytes. For example:

"café" becomes "caf%C3%A9"
"中文" becomes "%E4%B8%AD%E6%96%87"
"hello world" becomes "hello%20world"

Modern browsers handle this automatically, but when constructing URLs programmatically, use proper encoding functions like encodeURIComponent() in JavaScript.

Form Submissions

HTML forms should specify the encoding used for submission:

<form action="/submit" method="POST" accept-charset="UTF-8">
    <!-- form fields -->
</form>

The accept-charset attribute tells the browser to encode form data as UTF-8 before submission. Without this, some browsers may use legacy encodings, causing data corruption.

Quick tip: Use our URL Encoder tool to properly encode URLs with special characters, and our HTML Encoder to escape HTML entities correctly.

Encoding in Programming Languages

Different programming languages handle text encoding in different ways. Understanding your language's approach prevents bugs and data corruption.

Python

Python 3 uses Unicode strings by default. All string literals are Unicode, and you must explicitly encode/decode when working with bytes:

# String (Unicode)
text = "Hello, 世界"

# Encode to bytes
utf8_bytes = text.encode('utf-8')  # b'Hello, \xe4\xb8\x96\xe7\x95\x8c'

# Decode from bytes
decoded = utf8_bytes.decode('utf-8')  # "Hello, 世界"

# Reading files with encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

Always specify encoding when opening files. The default encoding varies by platform and can cause cross-platform issues.

JavaScript

JavaScript strings are sequences of UTF-16 code units. Characters outside the BMP (like emoji) are represented as surrogate pairs:

// String length counts UTF-16 code units, not characters
"😀".length  // 2 (surrogate pair)
"A".length   // 1

// Use spread operator or Array.from for correct character counting
[..."😀"].length  // 1
Array.from("😀").length  // 1

// Encoding/decoding
const encoder = new TextEncoder();  // Always UTF-8
const bytes = encoder.encode("Hello");  // Uint8Array

const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes);  // "Hello"

Java

Java uses UTF-16 internally for strings. The String class provides methods for encoding and decoding:

// String to bytes
String text = "Hello, 世界";
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);

// Bytes to string
String decoded = new String(utf8Bytes, StandardCharsets.UTF_8);

// Reading files with encoding
Path path = Paths.get("file.txt");
String content = Files.readString(path, StandardCharsets.UTF_8);

Always use StandardCharsets constants instead of string literals to avoid typos and ensure the encoding is supported.

C/C++

C and C++ don't have built-in Unicode support. You must use libraries like ICU or platform-specific APIs: