- Charset detection
-
Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system would mis-detect the phrase "Bush hid the facts" in ASCII as Chinese UTF-16LE.
One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. Unfortunately badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.
Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents can declare their encoding in a
meta
element, thus:<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band using the Content-type header.
See also
- International Components for Unicode - A library that can perform charset detection.
External links
- Frequency distributions of English trigraphs
- IMultiLanguage2::DetectInputCodepage
- API reference for ICU charset detection
- Mozilla Charset Detectors
- Java port of Mozilla Charset Detectors
- Delphi/Pascal port of Mozilla Charset Detectors
Character encodings Character sets Early telecommunications ASCII · ISO/IEC 646 · ISO/IEC 6937 · T.61 · sixbit code pages · Baudot code · Morse code · Chinese telegraph codeISO/IEC 8859 Bibliographic use National standards ArmSCII · CNS 11643 · GOST 10859 · GB 2312 · HKSCS · ISCII · JIS X 0201 · JIS X 0208 · JIS X 0212 · JIS X 0213 · KPS 9566 · KS X 1001 · PASCII · TIS-620 · TSCII · VISCII · YUSCIIEUC CN · JP · KR · TWISO/IEC 2022 CN · JP · KR · CCCIIMacOS codepages ("scripts") DOS codepages Windows codepages EBCDIC codepages 37/1140 · 273/1141 · 277/1142 · 278/1143 · 280/1144 · 284/1145 · 285/1146 · 297/1147 · 420/16804 · 424/12712 · 500/1148 · 838/1160 · 871/1149 · 875/9067 · 930/1390 · 933/1364 · 937/1371 · 935/1388 · 939/1399 · 1025/1154 · 1026/1155 · 1047/924 · 1112/1156 · 1122/1157 · 1123/1158 · 1130/1164 · JEF · KEISPlatform specific ATASCII · CDC display code · DEC-MCS · DEC Radix-50 · Fieldata · GSM 03.38 · HP roman8 · PETSCII · TI calculator character sets · WISCII · ZX Spectrum character setUnicode / ISO/IEC 10646 Miscellaneous codepages Related topics control character (C0 C1) · CCSID · Character encodings in HTML · charset detection · Han unification · ISO 6429/IEC 6429/ANSI X3.64 · mojibakeThis character encoding article is a stub. You can help Wikipedia by expanding it.