- Extended Unix Code
The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the "lower half" of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).
To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a
character stringbelongs to the ISO-646 code or the ISO-2022 (EUC) code.
The most commonly-used EUC codes are
variable-width encodings with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.
EUC-CN is the usual way to use the
GB2312standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022form of GB2312 is not normally used, though a variant form called HZ was sometimes used on USENET.
EUC-CN can also be used to encode the Unicode-based
GB18030character encoding, which includes traditional characters, although GB18030 is more frequently used without EUC encoding, since GB18030 is already a Unicodeencoding. However, GB18030 encoded in EUC-CN is a variable-width encoding, because GB18030 contains more than 8836 (94×94) characters.
Related encoding systems
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of
GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5and other non–ISO 2022–compliant DBCSencoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
EUC-JP is a
variable-width encodingused to represent the elements of three Japanese character set standards, namely JIS X 0208, JIS X 0212, and JIS X 0201.
* A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE.
* A character from JIS-X-0212 (code set 3) is represented by three bytes, the first being 0x8F, the following two in the range 0xA1 – 0xFE.
* A character from the "upper half" of JIS-X-0201 (
half-width kana, code set 2) is represented by two bytes, the first being 0x8E, the second in the range 0xA1 – 0xDF.
* A character from the "lower half" of JIS-X-0201 (
ASCII, code set 0) is represented by one byte, in the range 0x21 – 0x7E.
This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by
ISO-2022-JP, which is based on the same character set standards.
In Japan, the EUC-JP encoding is heavily used by
Unixor Unix-like operating systems (except for HP-UX), while Shift_JISor its extensions (Windows code page 932and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.
EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS-X-0213 take place of JIS-X-0208 and JIS-X-0212. There is a similar relationship between Shift_JIS and Shift-JISX0213.
EUC-KR is a
variable-width encodingto represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601)cite web |url=http://examples.oreilly.com/cjkvinfo/AppL/ksx1001.pdf |title=KS X 1001:1992] cite web |url=http://www.itscj.ipsj.or.jp/ISO-IR/149.pdf |title=KS C 5601:1987|date=1988-10-01] and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E).
It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as
UTF-8gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in South Korea. The default Korean codepage for Windows ( code page 949) is a proprietary, but upward compatible extension of EUC-KR referred to as Unified Hangeul Code (통합 완성형, Tonghab Wansunghyung). Mac Korean used in classic Mac OS is also compatible with EUC-KR.
EUC-TW is a
variable-width encodingthat supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding for traditional Chinese charactersas used on Taiwan. Big5is much more common. A character in US-ASCII (G0, code set 0) is encoded as a single byte in GL( 0x21-0x7E) and a character in CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1-0xFE). A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes with the first byte always being 0x8E(Single Shift 2) and the second byte indicating the plane (the plane number is obtained by subtracting 0xA0 from the second byte). The third and fourth bytes are in GR (0xA1-0xFE). Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.
Japanese language and computers
Korean language and computers
Chinese character encoding
* [http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml EUC-JP codeset table] (non-ascii part)
* [http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html GB18030-2000 — The New Chinese National Standard]
* [http://www.jagat.or.jp/asia/report/China3.htm The New Generation of Pre-Press Software in China] — mentions the 748 code
* [http://www.cns11643.gov.tw/web/word.jsp#euc Description of the EUC-TW code] (in Chinese)
* [http://search.cpan.org/~dankogai/Encode-JIS2K-0.02/JIS2K.pm Manual page of EUC-JISX0213] in Perl Encode module
* [http://www.opengroup.or.jp/jvc/cde/euc-e.html EUC-JP code range chart] at Opengroup Japan
* [http://www.itscj.ipsj.or.jp/ISO-IR/2-4.htm International Register of Coded Character Sets] — The coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
* [http://examples.oreilly.com/cjkvinfo/doc/cjk.inf Chinese, Japanese, and Korean character set standards and encoding systems]
Wikimedia Foundation. 2010.