A concise, practical guide to ASCII, its limits, and how Unicode (UTF-8/16/32) solves them. Examples use C.
ASCII (American Standard Code for Information Interchange) encodes characters with 7 bits (0–127).
+-- zone (3 bits) --+--- digit (4 bits) ---+
b6 b5 b4 b3 b2 b1 b0
Historically, links and memory often used an extra 8th bit (parity or just 0).
Character sets included:
0–9, A–Z, a–zReading the classic table
'A': zone 100, digit 0001 → 1000001₂ → 0x41 → decimal 65.| Group | Range (hex) | Notes |
|---|---|---|
| Control chars | 0x00–0x1F, 0x7F | NUL, BEL, CR, LF, ESC, DEL |
| Space | 0x20 | ' ' |
Digits 0–9 | 0x30–0x39 | '0' = 0x30 |
Uppercase A–Z | 0x41–0x5A | 'A' = 0x41 |
Lowercase a–z | 0x61–0x7A | 'a' = 0x61 |
Case bit trick
Upper/lower differ by 0x20 (bit 5).
'a' - 'A' == 32c ^ 0x20 toggles case when c is an ASCII letter.#include <stdio.h>
int main(void) {
char c = 'A';
printf("char: %c, dec: %d, hex: 0x%02X, bin: ", c, (unsigned char)c, (unsigned char)c);
unsigned char u = (unsigned char)c;
for (int i = 7; i >= 0; --i) {
putchar((u & (1u << i)) ? '1' : '0');
}
putchar('\n');
return 0;
}
#include <stdbool.h>
static inline bool is_upper_ascii(char c) { return c >= 'A' && c <= 'Z'; }
static inline bool is_lower_ascii(char c) { return c >= 'a' && c <= 'z'; }
static inline char to_lower_ascii(char c) {
return is_upper_ascii(c) ? (char)(c | 0x20) : c; // set bit 5
}
static inline char to_upper_ascii(char c) {
return is_lower_ascii(c) ? (char)(c & ~0x20) : c; // clear bit 5
}
#include <stdio.h>
int main(void) {
for (int i = 32; i <= 126; ++i) {
printf("%3d 0x%02X '%c'%s", i, i, (char)i, (i % 8 == 7) ? "\n" : " ");
}
printf("\n");
return 0;
}
ASCII covers 128 symbols. That’s fine for basic English, but it cannot represent:
Historic workarounds:
0x7FThese lacked global interoperability.
Unicode assigns a unique code point to each character:
U+0041 'A', U+AC00 '가', U+1F600 '😀', …
Unicode is a catalog of code points; you still need an encoding to store/transmit:
Rule of thumb: use UTF-8 unless a legacy interface requires otherwise.
U+0000–U+007F.ASCII : 7-bit (0x00–0x7F). Often stored in 8 bits.
EBCDIC : IBM 8-bit alternative; different assignments.
BCD : Decimal digits in 4 bits (not a character set).
Unicode : Universal code points (U+0000…); needs an encoding.
UTF-8 : 1–4 bytes, ASCII-compatible. Default choice today.
UTF-16 : 2/4 bytes with surrogates.
UTF-32 : 4 bytes fixed width.
0x41 → 0100 0001₂ → 'A'0x30–0x39 → '0'…'9'0x61 (0110 0001₂) → 'a'; toggle case: 0x61 ^ 0x20 = 0x41 ('A')