Code page

Code page: Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. The term code page originated from IBM's EBCDIC-based mainframe systems,^[1] but many vendors use this term including Microsoft, SAP,^[2] and Oracle Corporation.^[3] Vendors often allocate their own code page number to a character encoding, even if it is better known by another name (for example UTF-8 character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP).

Contents

1 The code page numbering system

1.1 Relationship to ASCII

1.2 Relationship to Unicode

2 Noteworthy code pages

2.1 IBM PC (OEM) code pages

2.2 Code pages for DBCS character sets

2.3 Microsoft code page numbers for various other character encodings

2.4 Miscellaneous

2.5 Windows (ANSI) code pages

3 Criticism

4 Private code pages

5 See also

6 References

7 External links

The code page numbering system

IBM introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.

With the release of PC-DOS version 3.3 (and the near identical MS-DOS 3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.^[4]

After IBM and Microsoft ceased to cooperate in the 1990-s the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one 3rd party vendor (Oracle) also has its own different list of numeric assignments.^[5] IBM's current assignments are listed in their CCSID repository. Microsoft's assignments seem not to be documented anywhere, but a list of the names and approximate IANA abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer).

Most well-known code pages, excluding those for the CJK languages and Vietnamese, fit all their code-points into 8 bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.

The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to 8 may be stored in the display adaptor for easy switching [1]. There were a selection of 3rd party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.

Relationship to ASCII

The vast majority of code pages in current use are supersets of ASCII, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these ‘extended character sets’ and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC encodings.

Relationship to Unicode

Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes. In the process, duplicate characters are eliminated and new variants are introduced, like Fullwidth ASCII. While consistent use of any single Unicode encoding would theoretically eliminate the need to keep track of different code pages or character encodings, the existence of multiple encodings of Unicode as well as the need to remain compatible with existing documents and systems that use the older encodings remains. In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode.

Noteworthy code pages

IBM PC (OEM) code pages

These code pages were originally embedded directly in the text mode hardware of the graphic adapters used with the IBM PC and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). Since the original IBM PC code page (number 437) was not really designed for international use, several partially compatible country or region specific variants emerged. Microsoft refers to these as the OEM code pages because they were defined by the OEM's who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standard body. Examples include:

437 — The original IBM PC code page

720 — Arabic

737 — Greek

775 — Estonian, Lithuanian and Latvian

850 — "Multilingual (Latin-1)" (Western European languages)

852 — "Slavic (Latin-2)" (Central and Eastern European languages)

855 — Cyrillic

857 — Turkish

858 — "Multilingual" with euro symbol

860 — Portuguese

861 — Icelandic

862 — Hebrew

863 — French (Quebec French)

865 — Danish/Norwegian Differs from 437 only in the letter Ø (ø) in place of ¥ and ¢

866 — Cyrillic

869 — Greek

874 — Thai^[6]

When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but use of newer code pages, in particular Unicode, is encouraged for new designs.

Code pages for DBCS character sets

These code pages represent DBCS character encodings for various CJK languages. In Microsoft operating systems, these are used as both the "OEM" and "ANSI" code page for the applicable locale.

932 — Supports Japanese

936 — GBK Supports Simplified Chinese

949 — Supports Korean

950 — Supports Traditional Chinese

Microsoft code page numbers for various other character encodings

The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages.

1200 — UTF-16LE Unicode little-endian

1201 — UTF-16BE Unicode big-endian

65000 — UTF-7 Unicode

65001 — UTF-8 Unicode

10000 — Macintosh Roman encoding (followed by several other Mac character sets)

10007 — Macintosh Cyrillic encoding

10029 — Macintosh Central European encoding

20127 — US-ASCII The classic US 7 bit character set with no char larger than 127

28591 — ISO-8859-1 (followed by ISO-8859-2 to ISO-8859-15)

Miscellaneous

(number missing) — ASMO449+ Supports Arabic

(number missing) — MIK Supports Bulgarian and Russian as well

Windows (ANSI) code pages

Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.

1250 — Central and East European Latin

1251 — Cyrillic

1252 — West European Latin

1253 — Greek

1254 — Turkish

1255 — Hebrew

1256 — Arabic

1257 — Baltic

1258 — Vietnamese

874 — Thai

Microsoft recommends applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.^[7]

Criticism

Many older character encodings, except Unicode, suffer from several problems.

Some code page vendors insufficiently document the meaning of all code point values. This decreases the reliability of handling textual data through various computer systems consistently.

Some vendors add proprietary extensions to some code pages to add or change certain code point values. For example, byte \x5C in Shift JIS can represent either a back slash or a yen currency symbol depending on the platform.

In order to support several languages in a program that does not use Unicode, the code page used for each string/document needs to be stored.

Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, these problems are rarely a concern for Unicode.

Applications may also mislabel text in Windows-1252 as ISO-8859-1. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1.

Private code pages

When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local code pages were created using Terminate and Stay Resident utilities or by re-programming BIOS EPROMs. In some cases, unofficial code page numbers were invented (e.g., cp895).

When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický or KEYBCS2 encoding for the Czech and Slovak alphabets. Another character set is Iran System encoding standard that was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.

See also

Windows code page

Character encoding

CCSID IBM's official "code page" definitions and assignments.

References

^ IBM i Globalization - EBCDIC Code Pages

^ SAP ABAP glossary

^ Oracle Hyperion glossary

^ The MS-DOS Encyclopaedia, Microsoft press (someone please add year and ISBN)

^ Oracle Hyperion glossary

^ PC CP874 at Unicode.org (PC Codepages)

^ MSDN reference for code pages

External links

IBM CDRA glossary

IBM code pages

IBM code pages by encoding scheme

IBM/ICU Charset Information

Microsoft Code Page Identifiers (Microsoft's list contains only code pages actively used by normal apps on Windows. See also Torsten Mohrin's list for the full list of supported code pages)

Shorter Microsoft list containing only the ANSI and OEM code pages but with links to more detail on each

Character Sets And Code Pages At The Push Of A Button

Microsoft Chcp command: Display and set the console active code page

v · d · eCharacter encodings

Character sets

Early telecommunications
ASCII · ISO/IEC 646 · ISO/IEC 6937 · T.61 · sixbit code pages · Baudot code · Morse code · Chinese telegraph code

ISO/IEC 8859
-1 · -2 · -3 · -4 · -5 · -6 · -7 · -8 · -9 · -10 · -11 · -12 · -13 · -14 · -15 · -16

Bibliographic use
ANSEL · ISO 5426 / 5426-2 / 5427 / 5428 / 6438 / 6861 / 6862 / 10585 / 10586 / 10754 / 11822 · MARC-8

National standards
ArmSCII · CNS 11643 · GOST 10859 · GB 2312 · HKSCS · ISCII · JIS X 0201 · JIS X 0208 · JIS X 0212 · JIS X 0213 · KPS 9566 · KS X 1001 · PASCII · TIS-620 · TSCII · VISCII · YUSCII

EUC
CN · JP · KR · TW

ISO/IEC 2022
CN · JP · KR · CCCII

MacOS codepages ("scripts")
Arabic · CentralEurRoman · ChineseSimp / EUC-CN · ChineseTrad / Big5 · Croatian · Cyrillic · Devanagari · Dingbats · Farsi · Greek · Gujarati · Gurmukhi · Hebrew · Icelandic · Japanese / ShiftJIS · Korean / EUC-KR · Roman · Romanian · Symbol · Thai / TIS-620 · Turkish · Ukrainian

DOS codepages
437 · 720 · 737 · 775 · 850 · 852 · 855 · 857 · 858 · 860 · 861 · 862 · 863 · 864 · 865 · 866 · 869 · Kamenický · Mazovia · MIK · Iran System

Windows codepages
874 / TIS-620 · 932 / ShiftJIS · 936 / GBK · 949 / EUC-KR · 950 / Big5 · 1250 · 1251 · 1252 · 1253 · 1254 · 1255 · 1256 · 1257 · 1258 · 1361 · 54936 / GB18030

EBCDIC codepages
37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEIS

Platform specific
ATASCII · CDC display code · DEC-MCS · DEC Radix-50 · Fieldata · GSM 03.38 · HP roman8 · PETSCII · TI calculator character sets · WISCII · ZX Spectrum character set

Unicode / ISO/IEC 10646
UTF-8 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-7 · UTF-1 · UTF-EBCDIC · GB 18030 · SCSU · BOCU-1

Miscellaneous codepages
APL · Cork · HZ · IBM code page 1133 · KOI8 · TRON

Related topics
control character (C0 C1) · CCSID · Character encodings in HTML · charset detection · Han unification · ISO 6429/IEC 6429/ANSI X3.64 · mojibake

Categories:
Character encoding

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

code page — kodų lentelė statusas T sritis informatika apibrėžtis ↑Koduotė, pateikta lentelės pavidalu. Paveiksle pavaizduota „Windows 1257“ kodų lentelė. Iliustraciją žr. priede. priedas( ai) Grafinis formatas atitikmenys: angl. character map; code page;… … Enciklopedinis kompiuterijos žodynas
code page — koduotė statusas T sritis informatika apibrėžtis Ženklų kodavimas, vienareikšmiškai apibrėžiantis tam tikro rinkinio ženklų kodus. Pagrindinė Lietuvoje vartojama 8 bitų koduotė yra apibrėžta Lietuvos standartu LST ISO/IEC 8859 13, atitinkančiu… … Enciklopedinis kompiuterijos žodynas
Code page 850 — character set with 9×16 glyphs, as it usually rendered by VGA Code page 850 (also known as CP 850, IBM 00850,[1] OEM 850,[2] MS DOS Latin 1[3]) is a … Wikipedia
Code page 437 — Code page 437, as rendered by the IBM PC using a VGA adapter. IBM PC or MS DOS code page 437, often abbreviated CP437 and also known as DOS US, OEM US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII,[1][2] is… … Wikipedia
Code page 852 — (also known as CP 852, IBM 00852,[1] OEM 852 (Latin II),[2][3] MS DOS Latin 2[4]) is a code page used under MS DOS to write Central European languages that use Latin script (such as Bosnian, Croatian, Czech, Hungarian … Wikipedia
Code page 865 — (also known as CP 865, IBM 00865,[1] OEM 865, MS DOS Nordic[2]) is a code page used under MS DOS to write Nordic languages (except Icelandic, for which code page 861 is used). Code page 865 differs from code page 437 in three points: 0x9B (‹ø›… … Wikipedia
Code page 857 — (also known as CP 857, IBM 00857,[1] OEM 857,[2] MS DOS Turkish[3]) is a code page used under MS DOS to write Turkish. Code page 857 is based on code page 850, but with many changes. It includes all characters from ISO 8859 9. Code page layout… … Wikipedia
Code page 855 — (also known as CP 855, IBM 00855,[1] OEM 855,[2] MS DOS Cyrillic[3]) is a code page used under MS DOS to write Cyrillic script. This code page is not used much. Code page layout The following table shows code page 855.[2] … Wikipedia
Code page 737 — (also known as CP 737, IBM 00737,[1] OEM 737,[2] MS DOS Greek[3]) is a code page used under MS DOS to write Greek language. It was much more popular than code page 869. Code page layout The following table shows code page 737.[2] … Wikipedia
Code page 869 — (CP 869, IBM 869, OEM 869) is a code page used under MS DOS to write Greek language. It is also called MS DOS Greek 2.[1] It was designed to include all characters from ISO 8859 7. Code page 869 was not as popular as code page 737. Code page… … Wikipedia

Academic Dictionaries and Encyclopedias

Code page

Contents

The code page numbering system

Relationship to ASCII

Relationship to Unicode

Noteworthy code pages

IBM PC (OEM) code pages

Code pages for DBCS character sets

Microsoft code page numbers for various other character encodings

Miscellaneous

Windows (ANSI) code pages

Criticism

Private code pages

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Code page

Contents

The code page numbering system

Relationship to ASCII

Relationship to Unicode

Noteworthy code pages

IBM PC (OEM) code pages

Code pages for DBCS character sets

Microsoft code page numbers for various other character encodings

Miscellaneous

Windows (ANSI) code pages

Criticism

Private code pages

See also

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link