Text Encoding Explained: UTF-8, ASCII, Unicode & Character Sets

· 11 min read

What Is Character Encoding?

Character encoding is the system that maps characters (letters, numbers, symbols) to numbers that computers can store and process. When you type the letter "A", your computer stores the number 65. When it displays that number, it looks up 65 in the encoding table and shows "A".

Without encoding, text would be meaningless sequences of bytes. The encoding tells software how to interpret those bytes as human-readable characters. Problems arise when the sender and receiver use different encodings — the same bytes produce different characters.

ASCII

ASCII (American Standard Code for Information Interchange) was created in 1963. It uses 7 bits to represent 128 characters:

RangeCharactersCount
0-31Control characters (tab, newline, etc.)32
32-47Punctuation and symbols16
48-57Digits 0-910
65-90Uppercase A-Z26
97-122Lowercase a-z26
OtherBrackets, math symbols, etc.18

ASCII works perfectly for English but cannot represent characters from other languages — no accented letters (é, ñ), no Chinese characters, no emoji. This limitation led to the creation of extended character sets and eventually Unicode.

Use our Text Encoder to see the ASCII values of any text.

Unicode

Unicode is a universal character set that assigns a unique number (called a code point) to every character in every writing system. As of Unicode 15.1, it includes over 149,000 characters covering 161 scripts.

Code points are written as U+XXXX. For example: U+0041 is "A", U+00E9 is "é", U+4E2D is "中", U+1F600 is "😀".

Unicode organizes characters into 17 planes of 65,536 code points each:

Unicode defines WHAT characters exist. How they are stored as bytes is determined by the encoding (UTF-8, UTF-16, or UTF-32).

UTF-8

UTF-8 (Unicode Transformation Format - 8 bit) is the dominant encoding on the web, used by over 98% of websites. It is a variable-length encoding that uses 1 to 4 bytes per character:

BytesCode Point RangeCharacters
1 byteU+0000 to U+007FASCII (English, digits, basic punctuation)
2 bytesU+0080 to U+07FFLatin extensions, Greek, Cyrillic, Arabic, Hebrew
3 bytesU+0800 to U+FFFFChinese, Japanese, Korean, most BMP characters
4 bytesU+10000 to U+10FFFFEmoji, historic scripts, rare characters

Key advantages of UTF-8:

UTF-8 vs UTF-16 vs UTF-32

FeatureUTF-8UTF-16UTF-32
Bytes per char1-42-44
ASCII compatibleYesNoNo
English text size1x2x4x
CJK text size3x2x4x
Byte order issuesNoYes (BOM)Yes (BOM)
Used byWeb, Linux, macOSWindows, Java, .NETInternal processing

Mojibake and Encoding Problems

Mojibake (文字化け, from Japanese "character transformation") is garbled text caused by encoding mismatches. Common examples:

OriginalMojibakeCause
cafécaféUTF-8 read as Latin-1
日本語日本語UTF-8 read as Latin-1
naïvenaïveUTF-8 read as ASCII

To fix encoding issues: identify the original encoding, then re-decode correctly. In Python: text.encode('latin-1').decode('utf-8'). Use our Unicode Converter to inspect character code points.

Encoding in HTML

Always declare encoding in your HTML:

<meta charset="UTF-8">

This must be within the first 1024 bytes of the document. Place it as the first element in <head>. Without it, browsers guess the encoding, which can cause mojibake.

For special characters, you can use HTML entities: &amp; for &, &lt; for <, &copy; for ©. But with UTF-8, you can type most characters directly.

Encoding in Programming

Python

# Python 3 strings are Unicode by default
text = "café"
encoded = text.encode('utf-8')    # b'caf\xc3\xa9'
decoded = encoded.decode('utf-8') # 'café'

# Read file with specific encoding
with open('file.txt', encoding='utf-8') as f:
    content = f.read()

JavaScript

// TextEncoder/TextDecoder API
const encoder = new TextEncoder(); // UTF-8 by default
const bytes = encoder.encode("café"); // Uint8Array
const decoder = new TextDecoder('utf-8');
const text = decoder.decode(bytes); // "café"

Base64 Encoding

Base64 encodes binary data as ASCII text using 64 characters (A-Z, a-z, 0-9, +, /). It is not encryption — it is encoding for safe transport.

Common uses: email attachments (MIME), data URIs in CSS/HTML, JWT tokens, API payloads. Base64 increases size by approximately 33%.

Use our Base64 Text tool to encode and decode Base64.

Frequently Asked Questions

What is the difference between Unicode and UTF-8?

Unicode is a character set that assigns a unique number to every character. UTF-8 is an encoding that defines how those numbers are stored as bytes. Unicode defines WHAT; UTF-8 defines HOW.

Why should I use UTF-8?

UTF-8 is backward compatible with ASCII, supports all Unicode characters, is the dominant encoding on the web (98%+), and is space-efficient for Latin text.

What causes mojibake?

Mojibake occurs when text is decoded with a different encoding than it was encoded with. For example, UTF-8 text decoded as Latin-1 turns "café" into "café".

What is Base64 encoding used for?

Base64 encodes binary data as ASCII text for safe transport in text-based protocols like email, URLs, and JSON. It increases size by about 33%.

How do I set encoding in HTML?

Add <meta charset="UTF-8"> as the first element in your <head> tag. Always use UTF-8 for new web pages.

Related Tools

Text Encoder Unicode Converter HTML Encoder Base64 Text URL Encoder