Punycode

Punycode

Punycode is a computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted in network host names. The encoding syntax is published on the Internet in Request for Comments .

The encoding is used as part of IDNA, which is a system enabling the use of internationalized domain names in all scripts that are supported by Unicode, where the burden of translation lies entirely with the user application (a web browser for example). IDNA performs some significant pre- and post-processing in addition to its use of Punycode. For further information on IDNA including the pre- and post-processing and the spoofing concerns see the Internationalized domain name article.

Encoding procedure

This section demonstrates the procedure for Punycode encoding, showing how the string "bücher" is encoded as "bcher-kva".

Separation of ASCII characters

First all basic (ASCII) characters in the string are copied directly from input to output skipping over other characters (e.g. "bücher" → "bcher"). If and only if there was one or more basic characters copied, an ASCII hyphen is added to the output next (e.g. "bücher" → "bcher-") .

Encoding of non-ASCII character insertions as code numbers

To understand the next part of the encoding process we first need to understand the behaviour of the decoder. The decoder is a state machine with two state variables "i" and "n". "i" is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end).

"i" starts at zero while "n" starts at 128 (the first non-ASCII code point). The state progression is monotonic. A state change either increments "i" or if "i" is at its maximum resets "i" to zero and increments "n". At each state change either the code point denoted by "n" is inserted or it is not inserted.

The code numbers generated by the encoder encode how many possibilities the decoder should skip before an insertion is made. "ü" has code point 252. So before we get to the possibility of inserting ü in position one it is necessary to skip over six potential insertions of each of the 124 preceding non-ASCII code points (252 - 128, the upper limit of ASCII) and one possible insertion (at position zero) of code point 252. That is why it is necessary to tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required.

Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits varies.

In this case a number system with 36 "digits" is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0".

To decode this string of "digits", the threshold starts out as 1 and the weight is 1. The first digit is the units digit; 10 with a weight of 1 equals 10. After this, the threshold value is adjusted. For the sake of simplicity, let's assume it is now 2. The second digit has a weight of 36 minus the previous threshold value, in this case, 35. Therefore the sum of the first two "digits" is 10 × 1 + 21 × 35. Since the second "digit" is not less than the threshold value of 2, there is more to come. The weight for the third "digit" is the previous weight times 36 minus the new threshold value; 35 × 34. The third "digit" in this example is 0, which is less than 2, meaning that it is the last (most significant) part of the number. Therefore "kva" represents the number 10 × 1 + 21 × 35 + 0 × 35 × 34 = 745.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Compare an ASCII 'punycoded' URL [http://xn--tdali-d8a8w.lv/ http://xn--tdali-d8a8w.lv/] that includes the Unicode representation of the Latvian "u with a macron", and "n with cedilla", instead of the unmarked base characters: http://tūdaliņ.lv.

Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.

External links

* RFC 3492 (see also a slightly clarified draft)
* [http://rfc-ref.org/RFC-TEXTS/3492/kw-algorithm.html Punycode encoding and decoding]
* [http://www.nameisp.com/puny.asp Punycode converter]
* [http://www.motobit.com/util/punycode-decoder-encoder.asp Online Punycode/IDN Decoder/Encoder]
* [http://www.gnu.org/software/libidn/ GNU IDN Library—Libidn]
* [http://demo.icu-project.org/icu-bin/idnbrowser ICU IDNA Demonstration] An online demonstration of how ICU performs IDN operations
* [http://punycode.bluerider.com/idn/ Punycode for Domains] Convert Unicode to Punycode
* [http://www.mozilla.org/projects/security/tld-idn-policy-list.html List of TLDs considered by the Mozilla developers to have an effective anti-spoofing policy for name registration]
* [http://blogs.msdn.com/ie/archive/2006/07/31/684337.aspx IDN and Punycode in IE7]
* [http://webnavi.nida.or.kr/idnconv/ Punycode converter for Korean]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Punycode — Saltar a navegación, búsqueda Punycode (código púny) es una sintaxis de codificación usada en programación que usa una cadena Unicode que puede ser traducida en una cadena de caracteres más limitada compatible con los nombres de red. La sintáxis… …   Wikipedia Español

  • Punycode — ist ein im RFC 3492 standardisiertes Kodierungsverfahren zum Umwandeln von Unicode Zeichenketten in sogenannte ACE Zeichenketten, die nur noch aus alphanumerischen Zeichen bestehen, wie sie in Domains erlaubt sind. Punycode wurde entworfen, um… …   Deutsch Wikipedia

  • Punycode — Punycode (произносится как «пуникод» или «пьюникод»)  стандартизированный метод преобразования последовательностей Unicode символов в так называемые ACE последовательности, которые состоят только из алфавитно цифровых символов, как это… …   Википедия

  • Punycode — (littéralement « code chétif ») est une syntaxe de codage définie dans la RFC 3492 et conçue pour être utilisée en adéquation avec les noms de domaines internationalisés dans les applications les supportant (cf RFC 3490). Sommaire 1 L… …   Wikipédia en Français

  • Punycode — noun A mapping from Unicode to the simpler ASCII character set, intended for the representation of international domain names where Unicode is not available …   Wiktionary

  • Punycode — ● np. m. ►NORM►CHAR Syntaxe de codage de caractères Unicode en caractères ASCII (et réciproquement), défini par la RFC 3492, dans le cadre de l utilisation des IDNA …   Dictionnaire d'informatique francophone

  • Internationalized domain name — An internationalized domain name (IDN) is an Internet domain name that contains one or more non ASCII characters. Such domain names could contain letters with diacritics, as required by many non English languages, or characters from non Latin… …   Wikipedia

  • ASCII Compatible Encoding — Die Artikel Punycode und Internationalizing Domain Names in Applications überschneiden sich thematisch. Hilf mit, die Artikel besser voneinander abzugrenzen oder zu vereinigen. Beteilige dich dazu an der Diskussion über diese Überschneidungen.… …   Deutsch Wikipedia

  • IDNA — Die Artikel Punycode und Internationalizing Domain Names in Applications überschneiden sich thematisch. Hilf mit, die Artikel besser voneinander abzugrenzen oder zu vereinigen. Beteilige dich dazu an der Diskussion über diese Überschneidungen.… …   Deutsch Wikipedia

  • Internationalized Domain Name — Die Artikel Punycode und Internationalizing Domain Names in Applications überschneiden sich thematisch. Hilf mit, die Artikel besser voneinander abzugrenzen oder zu vereinigen. Beteilige dich dazu an der Diskussion über diese Überschneidungen.… …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”