[TIL] HTTP : The Definitive Guide "p370 ~ p372"

시윤·2025년 6월 4일

TIL http

[TIL] Two Pages Per Day

목록 보기

145/162

Chapter 16. Internationalization

(해석 또는 이해가 잘못된 부분이 있다면 댓글로 편하게 알려주세요.)

✏️ 원문 번역

Preface

Every day, billions of people write documents in hundreds of languages. To live up to the vision of a truly world-wide Web, HTTP needs to support the transport and processing of international documents, in many languages and alphabets.

매일 수십억 명의 사람들이 수백 가지의 언어로 된 문서를 작성하고 있습니다.
진정한 World-wide Web의 비전에 부응하기 위해 HTTP는 다양한 언어와 알파벳으로 구성된 국제적인 문서를 전송하고 처리할 수 있도록 지원할 필요가 있습니다.

This chapter covers two primary internationalization issues for the Web: character set encodings and language tags. HTTP applications use character set encodings to request and display text in different alphabets, and they use language tags to describe and restrict content to languages the user understands. We finish with a brief chat about multilingual URIs and dates.

이번 챕터에서는 웹에서의 주요한 두 가지 국제화 이슈에 대해 다룹니다.
바로 문자집합의 인코딩과 언어 태그에 대한 문제입니다.
HTTP 응용 프로그램은 문자 집합 인코딩을 사용하여 다양한 언어의 텍스트를 요청하고 표현합니다.
언어 태그를 사용하면 사용자가 이해할 수 있는 언어로 콘텐츠를 묘사하고 제한할 수 있습니다.
마지막으로는 다국어 URI와 날짜에 대해 짧게 이야기하고 마치도록 하겠습니다.

This chapter:

• Explains how HTTP interacts with schemes and standards for multilingual alphabets

• Gives a rapid overview of the terminology, technology, and standards to help HTTP programmers do things right (readers familiar with character encodings can skip this section)

• Explains the standard naming system for languages, and how standardized language tags describe and select content

• Outlines rules and cautions for international URIs

• Briefly discusses rules for dates and other internationalization issues

이번 챕터에서 다루는 내용은 다음과 같습니다.
- HTTP가 다양한 언어로 된 기술 및 표준과 상호작용하는 방식을 설명합니다.
- HTTP 개발자가 올바른 개발을 진행할 수 있도록 용어, 기술, 표준을 간단히 요약합니다(문자 인코딩에 익숙한 독자들은 해당 섹션을 건너뛰어도 무방합니다).
- 언어에 관한 표준 명명 시스템을 설명하고 표준화된 언어 태그가 콘텐츠를 묘사하고 선택하는 방식에 대해 설명합니다.
- 국제 URI 사용에 관한 규칙과 주의점을 설명합니다.
- 날짜와 기타 국제화 이슈에 대한 규칙을 간략히 논의합니다.

HTTP Support for International Content

HTTP messages can carry content in any language, just as it can carry images, movies, or any other kind of media. To HTTP, the entity body is just a box of bits.

HTTP 메시지는 임의의 언어로 된 콘텐츠를 전송할 수 있습니다.
콘텐츠는 이미지나 영화일 수도 있고, 그 밖의 다른 종류의 미디어일 수도 있습니다.
HTTP에서 엔티티 본문은 비트들의 상자일 뿐입니다.

To support international content, servers need to tell clients about the alphabet and languages of each document, so the client can properly unpack the document bits into characters and properly process and present the content to the user.

국제적인 콘텐츠를 지원하기 위하여 서버는 클라이언트에게 각 문서의 알파벳과 언어에 대해 전달할 필요가 있습니다.
클라이언트가 메시지를 열었을 때 문서의 비트를 적절한 문자로 변환하고 가공하여 사용자에게 콘텐츠를 표현할 수 있게 하기 위함입니다.

Servers tell clients about a document’s alphabet and language with the HTTP Content-Type charset parameter and Content-Language headers. These headers describe what’s in the entity body’s “box of bits,” how to convert the contents into the proper characters that can be displayed on screen, and what spoken language the words represent.

서버는 클라이언트에게 HTTP Content-Type charset 파라미터와 Content-Language 헤더를 통해 문서의 알파벳과 언어를 표현할 수 있습니다.
이러한 헤더들은 엔티티 본문의 "비트 상자"에 무엇이 들어있으며 표현되는 언어는 무엇이고, 어떻게 스크린상의 언어로 적절하게 변환할 수 있는지를 나타냅니다.

At the same time, the client needs to tell the server which languages the user understands and which alphabetic coding algorithms the browser has installed. The client sends Accept-Charset and Accept-Language headers to tell the server which character set encoding algorithms and languages the client understands, and which of them are preferred.

반대로 클라이언트는 사용자가 이해할 수 있는 언어가 무엇이고, 브라우저에 어떤 인코딩 알고리즘이 설치되어 있는지 서버에게 전달해야 합니다.
클라이언트는 Accept-Charset과 Accept-Language 헤더를 사용하여 자신이 이해할 수 있는 문자집합 인코딩 알고리즘과 언어, 선호하는 알고리즘과 언어에 대한 정보까지 서버에 전달할 수 있습니다.

The following HTTP Accept headers might be sent by a French speaker who prefers his native language (but speaks some English in a pinch) and who uses a browser that supports the iso-8859-1 West European charset encoding and the UTF-8 Unicode charset encoding:
Accept-Language: fr, en;q=0.8
Accept-Charset: iso-8859-1, utf-8

다음의 HTTP Accept 헤더는 모국어를 선호하는(하지만 아주 조금은 영어를 사용할 수 있는) 프랑스인에 의해 작성되었을 수 있습니다.
해당 클라이언트는 iso-885901 West European 문자집합 인코딩과 UTF-8 유니코드 문자집합 인코딩을 지원하는 브라우저를 사용하고 있습니다.

The parameter “q=0.8” is a quality factor, giving lower priority to English (0.8) than to French (1.0 by default).

"q=0.8" 파라미터는 Quality Factor입니다.
0.8의 파라미터를 가진 영어는 프랑스어에 비해 우선순위가 낮습니다.

Charater Sets and HTTP

So, let’s jump right into the most important (and confusing) aspects of web internationalization—international alphabetic scripts and their character set encodings.

이제 웹 국제화에서 가장 중요한 측면인 "국제 알파벳 스크립트와 문자집합 인코딩"에 대해 이야기해봅시다.

Web character set standards can be pretty confusing. Lots of people get frustrated when they first try to write international web software, because of complex and inconsistent terminology, standards documents that you have to pay to read, and unfamiliarity with foreign languages. This section and the next section should make it easier for you to use character sets with HTTP.

웹 문자집합 표준은 상당히 혼란스럽습니다.
처음 국제 웹 소프트웨어를 작성하려고 했을 무렵 많은 사람들이 복잡하고 일관되지 않은 용어와 대가를 지불해야 하는 표준 문서, 친숙하지 않은 외국어로 인해 고초를 겪었습니다.
이번 섹션과 다음 섹션을 통해 HTTP와 함께 문자집합을 쉽게 사용할 수 있게 되기를 바랍니다.

Charset Is a Character-to-Bits Encoding

The HTTP charset values tell you how to convert from entity content bits into characters in a particular alphabet. Each charset tag names an algorithm to translate bits to characters (and vice versa). The charset tags are standardized in the MIME character set registry, maintained by the IANA (see http://www.iana.org/assignments/character-sets). Appendix H summarizes many of them.

HTTP 문자집합의 값은 엔티티의 콘텐츠 비트를 특정한 알파벳의 문자로 변환하는 방식을 의미합니다.
각각의 문자집합 태그는 비트를 문자로 변환하기 위한 알고리즘을 명명합니다.
문자집합 태그는 MIME 문자 집합 레지스트리에 표준화되어 있으며 IANA에 의해 유지 및 관리됩니다. (http://www.iana.org/assignments/character-sets)
Appendix H는 이 중 일부를 요약하고 있습니다.

The following Content-Type header tells the receiver that the content is an HTML file, and the charset parameter tells the receiver to use the iso-8859-6 Arabic character set decoding scheme to decode the content bits into characters:
Content-Type: text/html; charset=iso-8859-6

다음의 Content-Type 헤더는 수신자에게 콘텐츠가 HTML 파일이라는 사실을 알립니다.
charset 파라미터는 iso-8859-6 아랍어 문자 집합을 사용하여 콘텐츠 비트를 문자로 디코딩해야 함을 의미합니다.

The iso-8859-6 encoding scheme maps 8-bit values into both the Latin and Arabic alphabets, including numerals, punctuation and other symbols.* For example, in Figure 16-1, the highlighted bit pattern has code value 225, which(under iso-8859-6) maps into the Arabic letter “FEH” (a sound like the English letter “F”).

iso-8859-6 인코딩 기법은 8비트의 값을 숫자, 구두점, 기타 기호를 포함하여 라틴어와 아랍어로 매핑합니다.
예를 들어, Figure 16-1에서 표시된 비트 패턴의 코드는 225입니다.
225는 iso-8859-6상에서 아랍어 문자 "FEH(영문자 "F"와 비슷한 소리를 내는 문자)"와 매핑됩니다.

Some character encodings (e.g., UTF-8 and iso-2022-jp) are more complicated, variable-length codes, where the number of bits per character varies. This type of coding lets you use extra bits to support alphabets with large numbers of characters(such as Chinese and Japanese), while using fewer bits to support standard Latin characters.

UTF-8과 iso-2022-jp과 같은 일부 문자 인코딩은 더욱 복잡한 가변 길이의 코드를 가지고 있습니다.
길이가 가변이라는 것은 문자 하나당 비트의 개수가 상이함을 의미합니다.
이러한 유형의 코드는 추가적인 비트를 사용하여 더 많은 알파벳 문자를 지원할 수 있게 합니다.
표준 라틴 문자는 적은 비트로 지원 가능하지만, 중국어, 일본어 등의 문자 인코딩에는 더 많은 비트가 사용됩니다.

✏️ 요약

Charset

: Character-to-Bits 인코딩

클라이언트 : 이해할 수 있는 언어와 인코딩 알고리즘 전달
- Quality Factor를 활용해 선호도를 표현할 수 있다 (default 1)
- Accept-Language : 허용하는 언어
- Accept-Charset : 인코딩 알고리즘
서버 : 엔티티 본문의 언어와 인코딩 알고리즘 전달
- Content-Language : 엔티티 본문의 언어
- Content-Type: charset : 엔티티 본문에 적용된 인코딩 알고리즘
HTTP Charset은 MIME 문자 집합 레지스트리에 표준화, IANA에서 관리

✏️ 감상

국제화 표준을 위한 노력

예전에 언어론 수업을 들을 때 유니코드 표준을 위한 합의를 보는 데 굉장히 오랜 시간이 걸렸다고 들었다. 대학에서 하는 간단한 팀플조차 의견을 맞추는 것이 어려운데 하물며 국제적인 수준의 합의는 어떻겠는가. 어떤 일이든 항상 공평한 표준을 정하는 것은 어려운 일인 것 같다.

후속 챕터에서 유니코드에 대해서 자세히 다룰지는 모르겠지만, 조만간 '감상' 부분에 유니코드에 대해서도 짧게 남겨보려고 한다!