Representation of Data (3)

Liam·2025년 8월 25일

C C로 배우는 쉬운 자료구조 data structures

C로 배우는 쉬운 자료구조

목록 보기

3/3

[Section 02] representation of data

ASCII to Unicode — How Text Is Stored and Moved

A concise, practical guide to ASCII, its limits, and how Unicode (UTF-8/16/32) solves them. Examples use C.

ASCII in One Page

ASCII (American Standard Code for Information Interchange) encodes characters with 7 bits (0–127).

+-- zone (3 bits) --+--- digit (4 bits) ---+
 b6      b5     b4      b3  b2  b1   b0

Historically, links and memory often used an extra 8th bit (parity or just 0).
Character sets included:
- Control characters: 0–31 and 127 (NUL, BEL, CR, LF, ESC, DEL)
- Printable: space (32), punctuation, digits 0–9, A–Z, a–z

Reading the classic table

Column gives the zone (upper 3 bits), row gives the digit (lower 4 bits).
Example for 'A': zone 100, digit 0001 → 1000001₂ → 0x41 → decimal 65.

Common Ranges (you’ll use these daily)

Group	Range (hex)	Notes
Control chars	`0x00–0x1F`, `0x7F`	NUL, BEL, CR, LF, ESC, DEL
Space	`0x20`	`' '`
Digits `0–9`	`0x30–0x39`	`'0' = 0x30`
Uppercase `A–Z`	`0x41–0x5A`	`'A' = 0x41`
Lowercase `a–z`	`0x61–0x7A`	`'a' = 0x61`

Case bit trick
Upper/lower differ by 0x20 (bit 5).

'a' - 'A' == 32
c ^ 0x20 toggles case when c is an ASCII letter.

C Snippets

Print ASCII code of a character

#include <stdio.h>

int main(void) {
    char c = 'A';
    printf("char: %c, dec: %d, hex: 0x%02X, bin: ", c, (unsigned char)c, (unsigned char)c);

    unsigned char u = (unsigned char)c;
    for (int i = 7; i >= 0; --i) {
        putchar((u & (1u << i)) ? '1' : '0');
    }
    putchar('\n');
    return 0;
}

Lowercase/uppercase (ASCII only)

#include <stdbool.h>

static inline bool is_upper_ascii(char c) { return c >= 'A' && c <= 'Z'; }
static inline bool is_lower_ascii(char c) { return c >= 'a' && c <= 'z'; }

static inline char to_lower_ascii(char c) {
    return is_upper_ascii(c) ? (char)(c | 0x20) : c;   // set bit 5
}

static inline char to_upper_ascii(char c) {
    return is_lower_ascii(c) ? (char)(c & ~0x20) : c;  // clear bit 5
}

Minimal ASCII table (printables only)

#include <stdio.h>

int main(void) {
    for (int i = 32; i <= 126; ++i) {
        printf("%3d 0x%02X '%c'%s", i, i, (char)i, (i % 8 == 7) ? "\n" : "   ");
    }
    printf("\n");
    return 0;
}

Why ASCII Wasn’t Enough

ASCII covers 128 symbols. That’s fine for basic English, but it cannot represent:

Accented Latin letters (é, ü), math symbols, emoji
Non-Latin scripts (한국어, 日本語, العربية, …)

Historic workarounds:

Extended ASCII (8-bit code pages): incompatible sets above 0x7F
EBCDIC (IBM mainframes): different 8-bit layout
BCD (Binary-Coded Decimal): encodes digits in 4 bits; not a character set

These lacked global interoperability.

Unicode: One Code Space for All Writing Systems

Unicode assigns a unique code point to each character:
U+0041 'A', U+AC00 '가', U+1F600 '😀', …

Unicode is a catalog of code points; you still need an encoding to store/transmit:

UTF-8

Variable length (1–4 bytes)
ASCII 0x00–0x7F maps to exactly the same single byte → backward compatible
Dominant on the web and UNIX-like systems

UTF-16

2 or 4 bytes (surrogate pairs)
Common in Windows and some language runtimes

UTF-32

Fixed 4 bytes per code point (simple, larger memory footprint)

Rule of thumb: use UTF-8 unless a legacy interface requires otherwise.

ASCII vs Unicode in Practice

ASCII is a subset of Unicode: U+0000–U+007F.
In UTF-8, those code points use the same bytes as ASCII.
The moment you need anything beyond ASCII (accents, CJK, emoji), store and serve text as UTF-8.

Quick Reference

ASCII   : 7-bit (0x00–0x7F). Often stored in 8 bits.
EBCDIC  : IBM 8-bit alternative; different assignments.
BCD     : Decimal digits in 4 bits (not a character set).
Unicode : Universal code points (U+0000…); needs an encoding.
UTF-8   : 1–4 bytes, ASCII-compatible. Default choice today.
UTF-16  : 2/4 bytes with surrogates.
UTF-32  : 4 bytes fixed width.

Decode (practice)

0x41 → 0100 0001₂ → 'A'
0x30–0x39 → '0'…'9'
0x61 (0110 0001₂) → 'a'; toggle case: 0x61 ^ 0x20 = 0x41 ('A')

Liam

System Software Engineer

이전 포스트