Representation of Data (3)

Liam·2025년 8월 25일

[Section 02] representation of data

ASCII to Unicode — How Text Is Stored and Moved


A concise, practical guide to ASCII, its limits, and how Unicode (UTF-8/16/32) solves them. Examples use C.


ASCII in One Page

ASCII (American Standard Code for Information Interchange) encodes characters with 7 bits (0–127).

+-- zone (3 bits) --+--- digit (4 bits) ---+
 b6      b5     b4      b3  b2  b1   b0
  • Historically, links and memory often used an extra 8th bit (parity or just 0).

  • Character sets included:

    • Control characters: 0–31 and 127 (NUL, BEL, CR, LF, ESC, DEL)
    • Printable: space (32), punctuation, digits 0–9, A–Z, a–z

Reading the classic table

  • Column gives the zone (upper 3 bits), row gives the digit (lower 4 bits).
    Example for 'A': zone 100, digit 00011000001₂0x41 → decimal 65.

Common Ranges (you’ll use these daily)

GroupRange (hex)Notes
Control chars0x00–0x1F, 0x7FNUL, BEL, CR, LF, ESC, DEL
Space0x20' '
Digits 0–90x30–0x39'0' = 0x30
Uppercase A–Z0x41–0x5A'A' = 0x41
Lowercase a–z0x61–0x7A'a' = 0x61

Case bit trick
Upper/lower differ by 0x20 (bit 5).

  • 'a' - 'A' == 32
  • c ^ 0x20 toggles case when c is an ASCII letter.

C Snippets

#include <stdio.h>

int main(void) {
    char c = 'A';
    printf("char: %c, dec: %d, hex: 0x%02X, bin: ", c, (unsigned char)c, (unsigned char)c);

    unsigned char u = (unsigned char)c;
    for (int i = 7; i >= 0; --i) {
        putchar((u & (1u << i)) ? '1' : '0');
    }
    putchar('\n');
    return 0;
}

Lowercase/uppercase (ASCII only)

#include <stdbool.h>

static inline bool is_upper_ascii(char c) { return c >= 'A' && c <= 'Z'; }
static inline bool is_lower_ascii(char c) { return c >= 'a' && c <= 'z'; }

static inline char to_lower_ascii(char c) {
    return is_upper_ascii(c) ? (char)(c | 0x20) : c;   // set bit 5
}

static inline char to_upper_ascii(char c) {
    return is_lower_ascii(c) ? (char)(c & ~0x20) : c;  // clear bit 5
}

Minimal ASCII table (printables only)

#include <stdio.h>

int main(void) {
    for (int i = 32; i <= 126; ++i) {
        printf("%3d 0x%02X '%c'%s", i, i, (char)i, (i % 8 == 7) ? "\n" : "   ");
    }
    printf("\n");
    return 0;
}

Why ASCII Wasn’t Enough

ASCII covers 128 symbols. That’s fine for basic English, but it cannot represent:

  • Accented Latin letters (é, ü), math symbols, emoji
  • Non-Latin scripts (한국어, 日本語, العربية, …)

Historic workarounds:

  • Extended ASCII (8-bit code pages): incompatible sets above 0x7F
  • EBCDIC (IBM mainframes): different 8-bit layout
  • BCD (Binary-Coded Decimal): encodes digits in 4 bits; not a character set

These lacked global interoperability.


Unicode: One Code Space for All Writing Systems

Unicode assigns a unique code point to each character:
U+0041 'A', U+AC00 '가', U+1F600 '😀', …

Unicode is a catalog of code points; you still need an encoding to store/transmit:

UTF-8

  • Variable length (1–4 bytes)
  • ASCII 0x00–0x7F maps to exactly the same single byte → backward compatible
  • Dominant on the web and UNIX-like systems

UTF-16

  • 2 or 4 bytes (surrogate pairs)
  • Common in Windows and some language runtimes

UTF-32

  • Fixed 4 bytes per code point (simple, larger memory footprint)

Rule of thumb: use UTF-8 unless a legacy interface requires otherwise.


ASCII vs Unicode in Practice

  • ASCII is a subset of Unicode: U+0000–U+007F.
  • In UTF-8, those code points use the same bytes as ASCII.
  • The moment you need anything beyond ASCII (accents, CJK, emoji), store and serve text as UTF-8.

Quick Reference

ASCII   : 7-bit (0x00–0x7F). Often stored in 8 bits.
EBCDIC  : IBM 8-bit alternative; different assignments.
BCD     : Decimal digits in 4 bits (not a character set).
Unicode : Universal code points (U+0000…); needs an encoding.
UTF-8   : 1–4 bytes, ASCII-compatible. Default choice today.
UTF-16  : 2/4 bytes with surrogates.
UTF-32  : 4 bytes fixed width.

Decode (practice)

  1. 0x410100 0001₂'A'
  2. 0x30–0x39'0'…'9'
  3. 0x61 (0110 0001₂) → 'a'; toggle case: 0x61 ^ 0x20 = 0x41 ('A')
profile
System Software Engineer

0개의 댓글