Text Encoding Explained: UTF-8, ASCII, Unicode & Character Sets
· 11 min read
What Is Character Encoding?
Character encoding is the system that maps characters (letters, numbers, symbols) to numbers that computers can store and process. When you type the letter "A", your computer stores the number 65. When it displays that number, it looks up 65 in the encoding table and shows "A".
Without encoding, text would be meaningless sequences of bytes. The encoding tells software how to interpret those bytes as human-readable characters. Problems arise when the sender and receiver use different encodings — the same bytes produce different characters.
ASCII
ASCII (American Standard Code for Information Interchange) was created in 1963. It uses 7 bits to represent 128 characters:
| Range | Characters | Count |
|---|---|---|
| 0-31 | Control characters (tab, newline, etc.) | 32 |
| 32-47 | Punctuation and symbols | 16 |
| 48-57 | Digits 0-9 | 10 |
| 65-90 | Uppercase A-Z | 26 |
| 97-122 | Lowercase a-z | 26 |
| Other | Brackets, math symbols, etc. | 18 |
ASCII works perfectly for English but cannot represent characters from other languages — no accented letters (é, ñ), no Chinese characters, no emoji. This limitation led to the creation of extended character sets and eventually Unicode.
Use our Text Encoder to see the ASCII values of any text.
Unicode
Unicode is a universal character set that assigns a unique number (called a code point) to every character in every writing system. As of Unicode 15.1, it includes over 149,000 characters covering 161 scripts.
Code points are written as U+XXXX. For example: U+0041 is "A", U+00E9 is "é", U+4E2D is "中", U+1F600 is "😀".
Unicode organizes characters into 17 planes of 65,536 code points each:
- Plane 0 (BMP) — Basic Multilingual Plane: most common characters (Latin, Chinese, Arabic, etc.)
- Plane 1 (SMP) — Supplementary: emoji, historic scripts, musical notation
- Plane 2 (SIP) — Supplementary Ideographic: rare CJK characters
- Planes 3-16 — Reserved for future use
Unicode defines WHAT characters exist. How they are stored as bytes is determined by the encoding (UTF-8, UTF-16, or UTF-32).
UTF-8
UTF-8 (Unicode Transformation Format - 8 bit) is the dominant encoding on the web, used by over 98% of websites. It is a variable-length encoding that uses 1 to 4 bytes per character:
| Bytes | Code Point Range | Characters |
|---|---|---|
| 1 byte | U+0000 to U+007F | ASCII (English, digits, basic punctuation) |
| 2 bytes | U+0080 to U+07FF | Latin extensions, Greek, Cyrillic, Arabic, Hebrew |
| 3 bytes | U+0800 to U+FFFF | Chinese, Japanese, Korean, most BMP characters |
| 4 bytes | U+10000 to U+10FFFF | Emoji, historic scripts, rare characters |
Key advantages of UTF-8:
- Backward compatible with ASCII — Any valid ASCII text is also valid UTF-8
- No byte order issues — Unlike UTF-16, no BOM needed
- Self-synchronizing — You can find character boundaries by looking at any byte
- Space efficient for Latin text — English uses 1 byte per character, same as ASCII
UTF-8 vs UTF-16 vs UTF-32
| Feature | UTF-8 | UTF-16 | UTF-32 |
|---|---|---|---|
| Bytes per char | 1-4 | 2-4 | 4 |
| ASCII compatible | Yes | No | No |
| English text size | 1x | 2x | 4x |
| CJK text size | 3x | 2x | 4x |
| Byte order issues | No | Yes (BOM) | Yes (BOM) |
| Used by | Web, Linux, macOS | Windows, Java, .NET | Internal processing |
Mojibake and Encoding Problems
Mojibake (文字化け, from Japanese "character transformation") is garbled text caused by encoding mismatches. Common examples:
| Original | Mojibake | Cause |
|---|---|---|
| café | café | UTF-8 read as Latin-1 |
| 日本語 | 日本語 | UTF-8 read as Latin-1 |
| naïve | naïve | UTF-8 read as ASCII |
To fix encoding issues: identify the original encoding, then re-decode correctly. In Python: text.encode('latin-1').decode('utf-8'). Use our Unicode Converter to inspect character code points.
Encoding in HTML
Always declare encoding in your HTML:
<meta charset="UTF-8">
This must be within the first 1024 bytes of the document. Place it as the first element in <head>. Without it, browsers guess the encoding, which can cause mojibake.
For special characters, you can use HTML entities: & for &, < for <, © for ©. But with UTF-8, you can type most characters directly.
Encoding in Programming
Python
# Python 3 strings are Unicode by default
text = "café"
encoded = text.encode('utf-8') # b'caf\xc3\xa9'
decoded = encoded.decode('utf-8') # 'café'
# Read file with specific encoding
with open('file.txt', encoding='utf-8') as f:
content = f.read()
JavaScript
// TextEncoder/TextDecoder API
const encoder = new TextEncoder(); // UTF-8 by default
const bytes = encoder.encode("café"); // Uint8Array
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes); // "café"
Base64 Encoding
Base64 encodes binary data as ASCII text using 64 characters (A-Z, a-z, 0-9, +, /). It is not encryption — it is encoding for safe transport.
Common uses: email attachments (MIME), data URIs in CSS/HTML, JWT tokens, API payloads. Base64 increases size by approximately 33%.
Use our Base64 Text tool to encode and decode Base64.
Frequently Asked Questions
What is the difference between Unicode and UTF-8?
Unicode is a character set that assigns a unique number to every character. UTF-8 is an encoding that defines how those numbers are stored as bytes. Unicode defines WHAT; UTF-8 defines HOW.
Why should I use UTF-8?
UTF-8 is backward compatible with ASCII, supports all Unicode characters, is the dominant encoding on the web (98%+), and is space-efficient for Latin text.
What causes mojibake?
Mojibake occurs when text is decoded with a different encoding than it was encoded with. For example, UTF-8 text decoded as Latin-1 turns "café" into "café".
What is Base64 encoding used for?
Base64 encodes binary data as ASCII text for safe transport in text-based protocols like email, URLs, and JSON. It increases size by about 33%.
How do I set encoding in HTML?
Add <meta charset="UTF-8"> as the first element in your <head> tag. Always use UTF-8 for new web pages.