- Extended Unix Code
Extended Unix Code (EUC) is a multibyte
character encoding system used primarily for Japanese, Korean, andsimplified Chinese .The structure of EUC is based on the ISO-2022 standard, which specifies a way to represent character sets containing a maximum of 94 characters, or 8836 (942) characters, or 830584 (943) characters, as sequences of 7-bit codes. Only ISO-2022 compliant character sets can have EUC forms. Up to four coded character sets (referred to as G0, G1, G2, and G3 or as code sets 0, 1, 2, and 3) can be represented with the EUC scheme. G0 is almost always an ISO-646 compliant coded character set (e.g. US-ASCII/KS X 1003/ISO 646:KR in EUC-KR and US-ASCII/the "lower half" of JIS X 0201 in EUC-JP) that is invoked on GL (i.e. with the most significant bit cleared).
To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a
character string belongs to the ISO-646 code or the ISO-2022 (EUC) code.The most commonly-used EUC codes are
variable-width encoding s with a character belonging to G0 (ISO-646 compliant coded character set) taking one byte and a character belonging to G1 (taken by a 94x94 coded character set) represented in two bytes. The EUC-CN form of GB2312 and EUC-KR are examples of such two-byte EUC codes. EUC-JP includes characters represented by up to three bytes whereas a single character in EUC-TW can take up to four bytes.EUC-CN
EUC-CN is the usual way to use the
GB2312 standard forsimplified Chinese characters . Unlike the case of Japanese, theISO-2022 form of GB2312 is not normally used, though a variant form called HZ was sometimes used onUSENET .EUC-CN can also be used to encode the Unicode-based
GB18030 character encoding, which includes traditional characters, although GB18030 is more frequently used without EUC encoding, since GB18030 is already aUnicode encoding. However, GB18030 encoded in EUC-CN is avariable-width encoding , because GB18030 contains more than 8836 (94×94) characters.Related encoding systems
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of
GB2312 , but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure toBig5 and other non–ISO 2022–compliantDBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.EUC-JP
EUC-JP is a
variable-width encoding used to represent the elements of three Japanese character set standards, namelyJIS X 0208 ,JIS X 0212 , andJIS X 0201 .* A character from JIS-X-0208 (code set 1) is represented by two bytes, both in the range 0xA1 – 0xFE.
* A character from JIS-X-0212 (code set 3) is represented by three bytes, the first being 0x8F, the following two in the range 0xA1 – 0xFE.
* A character from the "upper half" of JIS-X-0201 (half-width kana , code set 2) is represented by two bytes, the first being 0x8E, the second in the range 0xA1 – 0xDF.
* A character from the "lower half" of JIS-X-0201 (ASCII , code set 0) is represented by one byte, in the range 0x21 – 0x7E.This encoding scheme allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by
ISO-2022-JP , which is based on the same character set standards.In Japan, the EUC-JP encoding is heavily used by
Unix or Unix-likeoperating system s (except forHP-UX ), whileShift_JIS or its extensions (Windowscode page 932 and MacJapanese) are used on other platforms. Therefore, whether Japanese web sites use EUC-JP or Shift_JIS often depends on what OS the author uses.EUC-JISX0213 is similar to but different from EUC-JP in that two planes of JIS-X-0213 take place of JIS-X-0208 and JIS-X-0212. There is a similar relationship between Shift_JIS and Shift-JISX0213.
EUC-KR
EUC-KR is a
variable-width encoding to represent Korean text using two coded character sets, KS X 1001 (formerly KS C 5601)cite web |url=http://examples.oreilly.com/cjkvinfo/AppL/ksx1001.pdf |title=KS X 1001:1992] cite web |url=http://www.itscj.ipsj.or.jp/ISO-IR/149.pdf |title=KS C 5601:1987|date=1988-10-01] and KS X 1003 (formerly KS C 5636)/ISO 646:KR/US-ASCII. KS X 2901 (formerly KS C 5861) stipulates the encoding and RFC 1557 dubbed it as EUC-KR. A character drawn from KS X 1001 (G1, code set 1) is encoded as two bytes in GR (0xA1-0xFE) and a character from KS X 1003/US-ASCII (G0, code set 0) takes one byte in GL (0x21-0x7E).It is the most widely used legacy character encoding in Korea on all three major platforms (Unix-like OS, Windows and Mac), but its use has been very slowly decreasing as
UTF-8 gains popularity, especially on Linux and Mac OS X. It is usually referred to as Wansung (완성) in South Korea. The default Korean codepage for Windows (code page 949 ) is a proprietary, but upward compatible extension of EUC-KR referred to as Unified Hangeul Code (통합 완성형, Tonghab Wansunghyung). Mac Korean used in classic Mac OS is also compatible with EUC-KR.EUC-TW
EUC-TW is a
variable-width encoding that supports US-ASCII and 16 planes of CNS 11643, each of which is 94x94. It is a rarely used encoding fortraditional Chinese characters as used onTaiwan .Big5 is much more common. A character in US-ASCII (G0, code set 0) is encoded as a single byte in GL( 0x21-0x7E) and a character in CNS 11643 plane 1 (code set 1) is encoded as two bytes in GR (0xA1-0xFE). A character in plane 1 through 16 of CNS 11643 (code set 2) is encoded as four bytes with the first byte always being 0x8E(Single Shift 2) and the second byte indicating the plane (the plane number is obtained by subtracting 0xA0 from the second byte). The third and fourth bytes are in GR (0xA1-0xFE). Note that the plane 1 of CNS 11643 is encoded twice as code set 1 and a part of code set 2.ee also
*
CJK
*Japanese language and computers
*Korean language and computers
*Chinese character encoding References
External links
* [http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml EUC-JP codeset table] (non-ascii part)
* [http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html GB18030-2000 — The New Chinese National Standard]
* [http://www.jagat.or.jp/asia/report/China3.htm The New Generation of Pre-Press Software in China] — mentions the 748 code
* [http://www.cns11643.gov.tw/web/word.jsp#euc Description of the EUC-TW code] (in Chinese)
* [http://search.cpan.org/~dankogai/Encode-JIS2K-0.02/JIS2K.pm Manual page of EUC-JISX0213] in Perl Encode module
* [http://www.opengroup.or.jp/jvc/cde/euc-e.html EUC-JP code range chart] at Opengroup Japan
* [http://www.itscj.ipsj.or.jp/ISO-IR/2-4.htm International Register of Coded Character Sets] — The coded character sets of China, Japan, South Korea, North Korea and Taiwan (ISO/IEC)
* [http://examples.oreilly.com/cjkvinfo/doc/cjk.inf Chinese, Japanese, and Korean character set standards and encoding systems]
Wikimedia Foundation. 2010.