- Western Latin character sets (computing)
-
Several binary representations of character sets for common Western European languages are compared in this article. These encodings were designed for representation of Italian, Spanish, Portuguese, French, German, Dutch, English, Danish, Swedish, Norwegian, and Icelandic, which use the Latin alphabet, a few additional letters and ones with precomposed diacritics, some punctuation, and various symbols (including some Greek letters). Although they're called "Western European" many of these languages are spoken all over the world. Also, these character sets happen to support many other languages such as Malay, Swahili, or Classical Latin.
Contents
Summary
The ISO-8859 series of 8-bit character sets encodes all Latin character sets used in Europe, albeit that the same code points have multiple uses that caused some difficulty. The arrival of Unicode, with a unique code point for every glyph, resolved these issues.
- ISO/IEC 8859-1 or Latin-1 is the most used and also defines the first 256 codes in Unicode
- ISO/IEC 8859-15 modifies ISO-8859-1 to support Finnish and French and add the euro sign.
- In terms of printable characters Windows-1252 has everything ISO-8859-1 and ISO-8859-15 have and more.
- IBM CP437, being intended for English only, has very little in the way of accented letters but has far more graphics characters than the others and also some Greek characters that are useful as technical symbols.
- IBM CP850 has all the printable characters that ISO-8859-1 has (albeit arranged differently) and still manages to have enough graphics characters to build a usable text-mode user interface.
- IBM CP858 differs from CP850 only by one character — a rarely-used dotless i (ı) was replaced by euro currency sign (€).
- IBM code pages 037, 500, and 1047 are EBCDIC encodings that include all of the ISO-8859-1 characters.
- The Mac OS Roman character set (often referred to as MacRoman and known by the IANA as simply MACINTOSH) has most, but not all, of the same characters as ISO-8859-1 but in a very different arrangement; and it also adds many technical and mathematical characters and more diacritics. Older Macintosh web browsers were known to munge the few characters that were in ISO-8859-1 but not their native Macintosh character set when editing text from Web sites. Conversely, in Web material prepared on an older Macintosh, many characters were displayed incorrectly when read by other operating systems.
- The euro sign post-dates these (ISO-8859 specifications: conflicting ways to retrofit it led to significant difficulty until Unicode became more generally adopted.
Notes
- The mappings for the IBM code pages are from the Unicode site supplied by Microsoft. Refer to the Unicode Consortium's document on the differences between IBM's and Microsoft's mappings for these code pages.
- The old PC code pages actually defined printable characters for the control code ranges. While these could not be used when printing text through DOS, as they would be trapped before reaching the screen, they could be used by applications that used screen memory directly.
- Position F0HEX was used in the Macintosh character sets for the Apple logo. The Apple logo was not accepted into Unicode due to its trademarked nature, and so Apple mapped it to a code point (U+F8FF) in the private use area. Therefore it may not display correctly in the table.
- In Windows-1252, positions 81, 8D, 8F, 90, and 9D are unused according to the mapping tables on the Unicode site. However the conversion routines in Windows seem to convert them to the C1 control codes that are at those positions in ISO-8859-1.
- It is common that web page tools for Windows use Windows-1252 but label the web page as using ISO-8859-1. The effect is that many non-Windows systems will not display the extra characters of Windows-1252, like € and the special quotation marks correctly.
History
The earlier seven-bit U.S. ASCII encoding has characters sufficient to properly represent only US-English, Latin, and Swahili. It is missing some letters and letter-diacritic combinations used in other Latin-alphabet languages. However, since there was no other choice on most U.S.-supplied computer platforms, ASCII was unavoidable in most of the non-English-speaking world (seven-bit encoding was necessitated by the limitations of early computing networks). There was the ISO 646 group of encodings which replaced some of the symbols in ASCII with local characters, but space was very limited, and some of the symbols replaced were quite common in things like programming languages.
Although seven-bit communication was the norm, most computers internally used eight-bit bytes, and they mostly put some form of characters in the 128 higher byte positions. In the early days most of these were system specific, but gradually a few standards were settled on.
In recent years, as storage and memory costs fall, the issues associated with multiple meanings of a given eight-bit code (there are seven ISO-Latin code sets alone) have ceased to be justified. All major operating systems have moved to Unicode as their main internal representation. However at least on Windows many applications continue to use the non-Unicode versions of the API calls.
The euro sign
The coming of the euro introduced significant pressure to support the euro sign (€), and most character sets had to be adapted in some way.
- MacRoman simply replaced the generic currency sign (¤). This caused significant difficulty because organisations had found other uses for it, such as the company logo.
- ISO introduced ISO 8859-15, which replaced the generic currency sign with the euro sign as well as making some other replacements of symbols with letters with diacritics.
- Windows-1252 simply placed the euro sign in a gap (position 80hex) in the existing C1 control codes.
Comparison table
Code points U+0000 to U+007F are not shown in this table currently, as they are directly mapped in all character sets listed here. The ASCII coding standard defines the original specification for the mapping of the first 0-127 characters.
The table is arranged by Unicode code point. Character sets are referred to here by their IANA names in upper case.
Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH NBSP U+00A0 A0 A0 A0 FF FF CA ¡ U+00A1 A1 A1 A1 AD AD C1 ¢ U+00A2 A2 A2 A2 9B BD A2 £ U+00A3 A3 A3 A3 9C 9C A3 ¤ U+00A4 A4 A4 CF ¥ U+00A5 A5 A5 A5 9D BE B4 ¦ U+00A6 A6 A6 DD § U+00A7 A7 A7 A7 F5 A4 ¨ U+00A8 A8 A8 F9 AC © U+00A9 A9 A9 A9 B8 A9 ª U+00AA AA AA AA A6 A6 BB « U+00AB AB AB AB AE AE C7 ¬ U+00AC AC AC AC AA AA C2 SHY U+00AD AD AD AD F0 ® U+00AE AE AE AE A9 A8 ¯ U+00AF AF AF AF EE F8 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ° U+00B0 B0 B0 B0 F8 F8 A1 ± U+00B1 B1 B1 B1 F1 F1 B1 ² U+00B2 B2 B2 B2 FD FD ³ U+00B3 B3 B3 B3 FC ´ U+00B4 B4 B4 EF AB µ U+00B5 B5 B5 B5 E6 E6 B5 ¶ U+00B6 B6 B6 B6 F4 A6 · U+00B7 B7 B7 B7 FA FA E1 ¸ U+00B8 B8 B8 F7 FC ¹ U+00B9 B9 B9 B9 FB º U+00BA BA BA BA A7 A7 BC » U+00BB BB BB BB AF AF C8 ¼ U+00BC BC BC AC AC ½ U+00BD BD BD AB AB ¾ U+00BE BE BE F3 ¿ U+00BF BF BF BF A8 A8 C0 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH À U+00C0 C0 C0 C0 B7 CB Á U+00C1 C1 C1 C1 B5 E7  U+00C2 C2 C2 C2 B6 E5 à U+00C3 C3 C3 C3 C7 CC Ä U+00C4 C4 C4 C4 8E 8E 80 Å U+00C5 C5 C5 C5 8F 8F 81 Æ U+00C6 C6 C6 C6 92 92 AE Ç U+00C7 C7 C7 C7 80 80 82 È U+00C8 C8 C8 C8 D4 E9 É U+00C9 C9 C9 C9 90 90 83 Ê U+00CA CA CA CA D2 E6 Ë U+00CB CB CB CB D3 E8 Ì U+00CC CC CC CC DE ED Í U+00CD CD CD CD D6 EA Î U+00CE CE CE CE D7 EB Ï U+00CF CF CF CF D8 EC Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH Ð U+00D0 D0 D0 D0 D1 Ñ U+00D1 D1 D1 D1 A5 A5 84 Ò U+00D2 D2 D2 D2 E3 F1 Ó U+00D3 D3 D3 D3 E0 EE Ô U+00D4 D4 D4 D4 E2 EF Õ U+00D5 D5 D5 D5 E5 CD Ö U+00D6 D6 D6 D6 99 99 85 × U+00D7 D7 D7 D7 9E Ø U+00D8 D8 D8 D8 9D AF Ù U+00D9 D9 D9 D9 EB F4 Ú U+00DA DA DA DA E9 F2 Û U+00DB DB DB DB EA F3 Ü U+00DC DC DC DC 9A 9A 86 Ý U+00DD DD DD DD ED Þ U+00DE DE DE DE E8 ß U+00DF DF DF DF E1 E1 A7 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH à U+00E0 E0 E0 E0 85 85 88 á U+00E1 E1 E1 E1 A0 A0 87 â U+00E2 E2 E2 E2 83 83 89 ã U+00E3 E3 E3 E3 C6 8B ä U+00E4 E4 E4 E4 84 84 8A å U+00E5 E5 E5 E5 86 86 8C æ U+00E6 E6 E6 E6 91 91 BE ç U+00E7 E7 E7 E7 87 87 8D è U+00E8 E8 E8 E8 8A 8A 8F é U+00E9 E9 E9 E9 82 82 8E ê U+00EA EA EA EA 88 88 90 ë U+00EB EB EB EB 89 89 91 ì U+00EC EC EC EC 8D 8D 93 í U+00ED ED ED ED A1 A1 92 î U+00EE EE EE EE 8C 8C 94 ï U+00EF EF EF EF 8B 8B 95 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ð U+00F0 F0 F0 F0 D0 ñ U+00F1 F1 F1 F1 A4 A4 96 ò U+00F2 F2 F2 F2 95 95 98 ó U+00F3 F3 F3 F3 A2 A2 97 ô U+00F4 F4 F4 F4 93 93 99 õ U+00F5 F5 F5 F5 E4 9B ö U+00F6 F6 F6 F6 94 94 9A ÷ U+00F7 F7 F7 F7 F6 F6 D6 ø U+00F8 F8 F8 F8 9B BF ù U+00F9 F9 F9 F9 97 97 9D ú U+00FA FA FA FA A3 A3 9C û U+00FB FB FB FB 96 96 9E ü U+00FC FC FC FC 81 81 9F ý U+00FD FD FD FD EC þ U+00FE FE FE FE E7 ÿ U+00FF FF FF FF 98 98 D8 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ı U+0131 D5 F5 Œ U+0152 BC 8C CE œ U+0153 BD 9C CF Š U+0160 A6 8A š U+0161 A8 9A Ÿ U+0178 BE 9F D9 Ž U+017D B4 8E ž U+017E B8 9E ƒ U+0192 83 9F 9F C4 ˆ U+02C6 88 F6 ˇ U+02C7 FF ˘ U+02D8 F9 ˙ U+02D9 FA ˚ U+02DA FB ˛ U+02DB FE ˜ U+02DC 98 F7 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ˝ U+02DD FD Γ U+0393 E2 Θ U+0398 E9 Σ U+03A3 E4 Φ U+03A6 E8 Ω U+03A9 EA BD α U+03B1 E0 δ U+03B4 EB ε U+03B5 EE π U+03C0 E3 B9 σ U+03C3 E5 τ U+03C4 E7 φ U+03C6 ED – U+2013 96 D0 — U+2014 97 D1 ‗ U+2017 F2 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ‘ U+2018 91 D4 ’ U+2019 92 D5 ‚ U+201A 82 E2 “ U+201C 93 D2 ” U+201D 94 D3 „ U+201E 84 E3 † U+2020 86 A0 ‡ U+2021 87 E0 • U+2022 95 A5 … U+2026 85 C9 ‰ U+2030 89 E4 ‹ U+2039 8B DC › U+203A 9B DD ⁄ U+2044 DA ⁿ U+207F FC ₧ U+20A7 9E Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH € U+20AC A4 80 DB ™ U+2122 99 AA ∂ U+2202 B6 ∆ U+2206 C6 ∏ U+220F B8 ∑ U+2211 B7 ∙ U+2219 F9 √ U+221A FB C3 ∞ U+221E EC B0 ∩ U+2229 EF ∫ U+222B BA ≈ U+2248 F7 C5 ≠ U+2260 AD ≡ U+2261 F0 ≤ U+2264 F3 B2 ≥ U+2265 F2 B3 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ⌐ U+2310 A9 ⌠ U+2320 F4 ⌡ U+2321 F5 ─ U+2500 C4 C4 │ U+2502 B3 B3 ┌ U+250C DA DA ┐ U+2510 BF BF └ U+2514 C0 C0 ┘ U+2518 D9 D9 ├ U+251C C3 C3 ┤ U+2524 B4 B4 ┬ U+252C C2 C2 ┴ U+2534 C1 C1 ┼ U+253C C5 C5 ═ U+2550 CD CD ║ U+2551 BA BA Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ╒ U+2552 D5 ╓ U+2553 D6 ╔ U+2554 C9 C9 ╕ U+2555 B8 ╖ U+2556 B7 ╗ U+2557 BB BB ╘ U+2558 D4 ╙ U+2559 D3 ╚ U+255A C8 C8 ╛ U+255B BE ╜ U+255C BD ╝ U+255D BC BC ╞ U+255E C6 ╟ U+255F C7 ╠ U+2560 CC CC ╡ U+2561 B5 Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ╢ U+2562 B6 ╣ U+2563 B9 B9 ╤ U+2564 D1 ╥ U+2565 D2 ╦ U+2566 CB CB ╧ U+2567 CF ╨ U+2568 D0 ╩ U+2569 CA CA ╪ U+256A D8 ╫ U+256B D7 ╬ U+256C CE CE ▀ U+2580 DF DF ▄ U+2584 DC DC █ U+2588 DB DB ▌ U+258C DD ▐ U+2590 DE Character Code point ISO-8859-1 ISO-8859-15 WINDOWS-1252 IBM437 IBM850 MACINTOSH ░ U+2591 B0 B0 ▒ U+2592 B1 B1 ▓ U+2593 B2 B2 ■ U+25A0 FE FE ◊ U+25CA D7 U+F8FF F0 fi U+FB01 DE fl U+FB02 DF Character encodings Early telecommunications ASCII · ISO/IEC 646 · ISO/IEC 6937 · T.61 · sixbit code pages · Baudot code · Morse code · Chinese telegraph codeISO/IEC 8859 Bibliographic use National standards ArmSCII · CNS 11643 · GOST 10859 · GB 2312 · HKSCS · ISCII · JIS X 0201 · JIS X 0208 · JIS X 0212 · JIS X 0213 · KPS 9566 · KS X 1001 · PASCII · TIS-620 · TSCII · VISCII · YUSCIIEUC CN · JP · KR · TWISO/IEC 2022 CN · JP · KR · CCCIIMacOS codepages ("scripts") DOS codepages Windows codepages EBCDIC codepages 37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEISPlatform specific ATASCII · CDC display code · DEC-MCS · DEC Radix-50 · Fieldata · GSM 03.38 · HP roman8 · PETSCII · TI calculator character sets · WISCII · ZX Spectrum character setUnicode / ISO/IEC 10646 Miscellaneous codepages Related topics control character (C0 C1) · CCSID · Character encodings in HTML · charset detection · Han unification · ISO 6429/IEC 6429/ANSI X3.64 · mojibakeCategories:- Character sets
Wikimedia Foundation. 2010.