Code Pages
The term code page refers to any of the many different schemas and standards used to represent character sets for various languages. Unicode combines the various code pages into the Unicode Standard (which is equivalent to ISO 10646). This section provides the various code pages and their corresponding aliases involved in internationalisation process.
The following table lists the examples of languages and their corresponding code pages.
|
Language(s) |
Code |
|---|---|
|
English |
ASCII |
|
French, German |
Latin1, Windows1252 |
|
Cyrillic |
Latin5, Windows1251 |
|
Chinese, Japanese, Korean, and Vietnamese |
CJKV, Win950, Win932, Win949, Win1258 |
Code Pages and Aliases
When internationalising applications, it is mandatory to ensure that the data is represented in the expected character set for the end user. This depends on the code page capabilities of the input and output devices involved. The communication is supposedly to be in UTF-8 format. However, for applications that deal with other character sets, the data need to be converted to and from UTF-8.
The International Components for Unicode (ICU) library package provides the following.
- Comprehensive character set conversion framework
- Mapping tables
- Implementation for encodings, including Unicode
The mapping tables originate mostly from the IBM code page repository. For non-IBM code pages, an equivalent code is configured.The textual data format is generic and data for other code page mapping tables can be added, if required. There is no single, authoritative source of precise definitions of many of the encodings and their names.
An authoritative source of precise definitions of the encodings and their names is unavailable. The following table lists the recommended sources for names and encodings.
|
Source |
Recommended for |
|---|---|
|
Internet Assigned Numbers Authority (IANA) |
Names |
|
Character Set repository (ICU) |
Encoding definitions for each platform |
The following table lists the UTF and ASCII code pages and their corresponding aliases.
|
Code Page |
Aliases |
|---|---|
|
utf-8 |
cp1208, ibm-1208, utf-8, UTF8 |
|
utf-16be |
utf-16be, UTF16_BigEndian |
|
utf-16le |
utf-16le, UTF16_LittleEndian |
|
ISO-10646-UCS-2 |
utf-16, ucs-2, cp1200, ibm-1200, utf-16, csUnicode, ISO-10646-UCS-2 |
|
utf-32be |
utf-32be, UTF32_BigEndian |
|
utf-32le |
utf-32le, UTF32_LittleEndian |
|
ISO-10646-UCS-4 |
utf-32, ucs-4, utf-32, csUCS4, ISO-10646-UCS-4 |
|
ANSI_X3.4-1968 |
us-ascii, iso-ir-6, 646, csASCII, us, iso646-us, ISO_646.irv:1991, ANSI_X3.4-1986, ANSI_X3.4-1968, US-ASCII, ascii-7, ascii, u s-ascii, ibm-367 |
The following table lists the Microsoft Windows code pages and their corresponding aliases.
|
Code Page |
Aliases |
|---|---|
|
windows-1250 |
windows-1250, cp1250, windows-1250, ibm-5346 |
|
windows-1251 |
windows-1251, cp1251, windows-1251, ibm-5347 |
|
windows-1252 |
windows-1252, cp1252, windows-1252, ibm-5348 |
|
windows-1253 |
windows-1253, cp1253, windows-1253, ibm-5349 |
|
windows-1254 |
windows-1254, cp1254, windows-1254, ibm-5350 |
|
windows-1255 |
windows-1255, cp1255, windows-1255, ibm-5351 |
|
windows-1256 |
windows-1256, cp1256, windows-1256, ibm-5352 |
|
windows-1257 |
windows-1257, cp1257, windows-1257, ibm-5353 |
|
windows-1258 |
windows-1258, cp1258, windows-1258, ibm-5354 |
The following table lists the Latin code pages and their corresponding aliases.
|
Code Page |
Aliases |
|---|---|
|
ISO_8859-1:1987 |
iso-8859-1, ANSI_X3.110-1983, l1, ISO_8859-1:1987, cp367, iso-ir-100, csisolatin1, 8859-1, latin1, cp819, ibm-819, iso-8859-1, LATIN_1 |
|
ISO_8859-2:1987 |
iso-8859-2, l2, ISO_8859-2:1987, iso-ir-101, csisolatin2, 8859-2, latin2, cp912, iso-8859-2, ibm-912 |
|
ISO_8859-3:1988 |
iso-8859-3, l3, ISO_8859-3:1988, iso-ir-109, csisolatin3, 8859-3, cp913, latin3, iso-8859-3, ibm-913 |
|
ISO_8859-4:1988 |
iso-8859-4, l4, ISO_8859-4:1988, iso-ir-110, csisolatin4, 8859-4, cp914, latin4, iso-8859-4, ibm-914 |
|
ISO_8859-5:1988 |
iso-8859-5, ISO_8859-5:1988, iso-ir-144, csisolatincyrillic, 8859-5, cp915, cyrillic, iso-8859-5, ibm-915 |
|
ISO_8859-6:1987 |
iso-8859-6, asmo-708, ecma-114,ISO_8859-6:1987, iso-ir-127, csisolatinarabic, 8859-6, cp1089, arabic, iso-8859-6, ibm-1089 |
|
ISO_8859-7:1987 |
iso-8859-7, ISO_8859-7:1987, iso-ir-126, csisolatingreek, 8859-7, ecma-118, elot_928, greek8, greek, iso-8859-7, cp813, ibm-4909 |
|
ISO_8859-8:1988 |
iso-8859-8, ISO_8859-8:1988, iso-ir-138, csisolatinhebrew, 8859-8, cp916, hebrew, iso-8859-8, ibm-916 |
|
ISO_8859-9:1989 |
iso-8859-9, l5, ISO_8859-9:1989, iso-ir-148, csisolatin5, 8859-9, cp920, latin5, ECMA-128, iso-8859-9, ibm-920 |
|
ISO-8859-15 |
csisolatin9, csisolatin0, latin0, 8859-15, cp923, latin9, iso-8859-15, ibm-923 |
The following table lists Japanese, Chinese and Korean code pages and their corresponding aliases.
|
Code Page |
Aliases |
|---|---|
|
ISO-2022 |
cp2022, 2022, ISO-2022, ISO_2022 |
|
ISO-2022-JP |
ISO-2022-JP, csISO2022JP, ISO-2022-JP, ISO_2022,locale=ja,version=0 |
|
ISO-2022-KR |
ISO-2022-KR, csISO2022KR, ISO-2022-KR, ISO_2022,locale=ko |
|
ISO-2022-CN |
csISO2022CN, ISO-2022-CN, ISO_2022,locale=zh,version=0 |
|
Shift_JIS |
x-sjis, windows-31j, csshiftjis, ms_kanji, cp932, cp943, sjis, csWindows31J, Shift_JIS, ibm-943 |
|
Big5 |
cp950, x-big5, csBig5, Big5, ibm-1370 |
|
GB_2312-80 |
GB2312, zh_cn, cp936, gb2312-1980, GB2312, gb, chinese, gbk, csISO58GB231280, iso-ir-58, GB_2312-80, ibm-1386 |
|
EUC-JP |
X-EUC-JP, extended_unix_code_packed_format_for_japanese, eucjis, ibm-eucJP, EUC-JP, ibm-33722 |
|
EUC-KR |
EUC-KR, csEUCKR, ibm-eucKR, EUC-KR, ibm-970 |
|
EUC-TW |
cns11643, ibm-eucTW, EUC-TW, ibm-964 |
|
EUC-CN |
ibm-eucCN, EUC-CN, ibm-1383 |
|
KOI8-R |
KOI8-R, cskoi8r, koi8, cp878, KOI8-R, ibm-878 |
|
korean |
ksc, cp949, cp1363, ibm-1363 |
In this topic