Charset detection

Charset detection

Character encoding detection, charset detection, or code page detection is the process of heuristically guessing the character encoding of a series of bytes that represent text. This algorithm usually involves statistical analysis of byte patterns. This type of analysis can require frequency distribution of trigraphs of various languages encoded in each code page that will be detected. This process is not foolproof because it depends on statistical data; for example, some versions of the Windows operating system would mis-detect the phrase "Bush hid the facts" in ASCII as Chinese UTF-16LE.

One of the few cases where charset detection works reliably is detecting UTF-8. This is due to the large percentage of invalid byte sequences in UTF-8, so that text in any other encoding that uses bytes with the high bit set is extremely unlikely to pass a UTF-8 validity test. Unfortunately badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.

Due to the unreliability of charset detection, it is usually better to properly label datasets with the correct encoding. For example, HTML documents can declare their encoding in a meta element, thus:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

Alternatively, when documents are conveyed through HTTP, the same metadata can be conveyed out-of-band using the Content-type header.

See also

External links