Collating sequence

Collating sequence

The term collating sequence refers to the order in which individual characters should be taken when sorting a collection of character strings using dictionary order. This article is concerned with the order of the alphabetical characters comprising variants of the Latin alphabet in various languages. For other writing systems, see Collation.

Contents

1 General issues
2 The basic collating sequence of the Latin alphabet
3 Collating sequences in various languages that use a Latin-derived alphabet
4 See also
5 External links and references
6 Notes

General issues

In a computer system, each character is assigned a unique numeric code (as in the ASCII or Unicode character set), but the proper and customary ordering of strings is not performed by a simple numeric comparison of those codes. Rather, the ordering is determined by reference to the collating sequence.

A general issue in sorting in dictionary order is whether two characters having different shapes are considered the same letter or different letters. In particular:

Majuscules (capital letters) and minuscules (lower-case letters): an upper-case A and a lower-case a are usually considered to be the same letter, so in sorting the name Abraham then comes between aardvark and accu.
Diacritics: various languages use marks over and around letters, but again for sorting purposes the characters may be considered to be the same letter. For example, in French dictionaries the word école comes between ecmnésie and ectoplasme, and in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch. On the other hand, Turkish dictionaries treat o and ö as different letters, and oyun comes before öbür.

In same cases a digraph or trigraph is considered a single letter; for example, in Welsh the combination ch is one letter, and in dictionaries cymal comes before chwaer. Conversely, sometimes single characters may be sorted as if they are a sequence of other characters.

In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German dictionaries and telephone directories use different approaches.

The basic collating sequence of the Latin alphabet

The collating sequence of the standard 26-letter Latin alphabet is as follows:

A · B · C · D · E · F · G · H · I · J · K · L · M · N · O · P · Q · R · S · T · U · V · W · X · Y · Z

Collating sequences in various languages that use a Latin-derived alphabet

Some languages use a Latin-derived alphabet that includes modified letters, ligatures, or digraphs, for orthographic and collation purposes. This varies from language to language, and sometimes from symbol to symbol, within the same language. Listed below are the collation orders in various languages.

In Azerbaijani, there are 8 additional letters. 5 of them are vowels: i, ı, ö, ü, ə and 3 are consonants: ç, ş, ğ. The alphabet is the same as the Turkish alphabet, with the same sounds written with the same letters, except for three additional letters: q, x and ə for sounds that do not exist in Turkish. Although all the "Turkish letters" are collated in their "normal" alphabetical order like in Turkish, the three extra letters are collated arbitrarily after letters whose sounds approach theirs. So, q is collated just after k, x (pronounced like a German ch) is collated just after h and ə (pronounced roughly like an English short a) is collated just after e.
In Breton, there is no "c" but there are the digraphs "ch" and "c'h", which are collated between "b" and "d". For example: « buzhugenn, chug, c'hoar, daeraouenn » (earthworm, juice, sister, teardrop).
In Bosnian, Croatian and Serbian and other related South Slavic languages, the five accented characters and three conjoined characters are sorted after the originals: ..., C, Č, Ć, D, DŽ, Đ, E, ..., L, LJ, M, N, NJ, O, ..., S, Š, T, ..., Z, Ž.
In Czech and Slovak, accented vowels have secondary collating weight - compared to other letters, they are treated as their unaccented forms (A-Á, E-É-Ě, I-Í, O-Ó-Ô, U-Ú-Ů, Y-Ý), but then they are sorted after the unaccented letters (for example, the correct lexicographic order is baa, baá, báa, bab, báb, bac, bác, bač, báč). Accented consonants (the ones with caron) have primary collating weight and are collocated immediately after their unaccented counterparts, with exception of Ď, Ň and Ť, which have again secondary weight. CH is considered to be a separate letter and goes between H and I. In Slovak, DZ and DŽ are also considered separate letters and are positioned between Ď and E (A-Á-Ä-B-C-Č-D-Ď-DZ-DŽ-E-É…).
In the Danish and Norwegian alphabets, the same extra vowels as in Swedish (see below) are also present but in a different order and with different glyphs (..., X, Y, Z, Æ, Ø, Å). Also, "Aa" collates as an equivalent to "Å". The Danish alphabet has traditionally seen "W" as a variant of "V", but today "W" is considered a separate letter.
In Dutch the combination IJ (representing Ĳ) was formerly to be collated as Y (or sometimes, as a separate letter Y < IJ < Z), but is currently mostly collated as 2 letters (II < IJ < IK). Exceptions are phone directories; IJ is always collated as Y here because in many Dutch family names Y is used where modern spelling would require IJ. Note that a word starting with ij that is written with a capital I is also written with a capital J, for example, the town IJmuiden and the river IJssel.
In English, diacritics may occur in loanwords, such as the word rôle. Increasingly, however, these are omitted in modern orthography. When written, the word is nevertheless sorted as if the mark is absent: rôle comes between rock and rose.
In Esperanto, consonants with circumflex accents (ĉ, ĝ, ĥ, ĵ, ŝ), as well as ŭ (u with breve), are counted as separate letters and collated separately (c, ĉ, d, e, f, g, ĝ, h, ĥ, i, j, ĵ ... s, ŝ, t, u, ŭ, v, z).
In Estonian õ, ä, ö and ü are considered separate letters and collate after w. Letters š, z and ž appear in loanwords and foreign proper names only and follow the letter s in the Estonian alphabet, which otherwise does not differ from the basic Latin alphabet.
The Faroese alphabet also has some of the Danish, Norwegian, and Swedish extra letters, namely Æ and Ø. Furthermore, the Faroese alphabet uses the Icelandic eth, which follows the D. Five of the six vowels A, I, O, U and Y can get accents and are after that considered separate letters. The consonants C, Q, X, W and Z are not found. Therefore the first five letters are A, Á, B, D and Ð, and the last five are V, Y, Ý, Æ, Ø
In Filipino (Tagalog) and other Philippine languages, the letter Ng is treated as a separate letter. It is pronounced as in sing, ping-pong, etc. By itself, it is pronounced nang, but in general Filipino orthography, it is spelled as if it were two separate letters (n and g). Also, letter derivatives (such as Ñ) immediately follow the base letter. Filipino also is written with diacritics, but their use is very rare (except the tilde). (Philippine orthography also includes spelling.)
The Finnish alphabet and collating rules are the same as in Swedish, except for the addition of the letters Š and Ž, which are considered variants of S and Z.
For French, the last accent in a given word determines the order^[1]. For example, in French, the following four words would be sorted this way: cote < côte < coté < côté.
In German letters with umlaut (Ä, Ö, Ü) are treated generally just like their non-umlauted versions; ß is always sorted as ss. This makes the alphabetic order Arg, Ärgerlich, Arm, Assistent, Aßlar, Assoziation. For phone directories and similar lists of names, the umlauts are to be collated like the letter combinations "ae", "oe", "ue". This makes the alphabetic order Udet, Übelacker, Uell, Ülle, Ueve, Üxküll, Uffenbach.
The Hungarian vowels have accents, umlauts, and double accents, while consonants are written with single, double (digraphs) or triple (trigraph) characters. In collating, accented vowels are equivalent with their non-accented counterparts and double and triple characters follow their single originals. Hungarian alphabetic order is: A=Á, B, C, CS, D, DZ, DZS, E=É, F, G, GY, H, I=Í, J, K, L, LY, M, N, NY, O=Ó, Ö=Ő, P, Q, R, S, SZ, T, TY, U=Ú, Ü=Ű, V, W, X, Y, Z, ZS. (For example, the correct lexicographic order is baa, baá, bab, báb, bac, bác, bacs, bács, bad, bád, ...). (Before approx. 1988, dz and dzs were not considered single letters for collation, but two letters each, d+z and d+zs instead.)
In Icelandic, Þ is added, and D is followed by Ð. Each vowel (A, E, I, O, U, Y) is followed by its correspondent with acute: Á, É, Í, Ó, Ú, Ý. There is no Z, so the alphabet ends: ... X, Y, Ý, Þ, Æ, Ö.
- Both letters were also used by Anglo-Saxon scribes who also used the Runic letter Wynn to represent /w/.
- Þ (called thorn; lowercase þ) is also a Runic letter.
- Ð (called eth; lowercase ð) is the letter D with an added stroke.
In Lithuanian, specifically Lithuanian letters go after their Latin originals. Another change is that Y comes just before J: ... G, H, I, Į, Y, J, K...
In Polish, specifically Polish letters derived from the Latin alphabet are collated after their originals: A, Ą, B, C, Ć, D, E, Ę, ..., L, Ł, M, N, Ń, O, Ó, P, ..., S, Ś, T, ..., Z, Ź, Ż. The digraphs for collation purposes are treated as if they were two separate letters.
In Portuguese, the collating order is just like in English, including the three letters not native to Portuguese: A, B, C, D, E, F, G, H, I, J, (K), L, M, N, O, P, Q, R, S, T, U, V, (W), X, (Y), Z. Digraphs and letters with diacritics are not included in the alphabet.
In Romanian, special characters derived from the Latin alphabet are collated after their originals: A, Ă, Â, ..., I, Î, ..., S, Ș, T, Ț, ..., Z.
In the Swedish alphabet, there are three extra vowels placed at its end (..., X, Y, Z, Å, Ä, Ö), similar to the Danish and Norwegian alphabet, but with different glyphs and a different collating order. The letter "W" has been treated as a variant of "V", but in the 13th edition of Svenska Akademiens ordlista (2006) "W" was considered a separate letter.
Spanish treated (until 1997) "CH" and "LL" as single letters, giving an ordering of cinco, credo, chispa and lomo, luz, llama. This is not true anymore since in 1997 RAE adopted the more conventional usage, and now LL is collated between LK and LM, and CH between CG and CI. The six accented or umlauted characters Á, É, Í, Ó, Ú, Ü are treated as the original letters A, E, I, O, U, for example: radio, ráfaga, rana, rápido, rastrillo. The only Spanish specific collating question is Ñ (eñe) as a different letter collated after N.
In the Turkish alphabet there are 6 additional letters: ç, ğ, ı, ö, ş, and ü (but no q, w, and x). They are collated with ç after c, ğ after g, ı before i, ö after o, ş after s, and ü after u. Originally, when the alphabet was introduced in 1928, ı was collated after i, but the order was changed later so that letters having shapes containing dots, cedilles or other adorning marks always follow the letters with corresponding bare shapes. Note that in Turkish orthography the letter I is the majuscule of dotless ı, whereas İ is the majuscule of dotted i.
In many Turkic languages (such as Azeri or the Jaŋalif orthography for Tatar), there used to be the letter Gha (Ƣƣ), which came between G and H. It is now come in disuse.
Welsh also has complex rules: the combinations CH, DD, FF, NG, LL, PH, RH and TH are all considered single letters, and each is listed after the letter which is the first character in the combination, with the exception of NG which is listed after G. However, the situation is further complicated by these combinations not always being single letters. An example ordering is LAWR, LWCUS, LLONG, LLOM, LLONGYFARCH: the last of these words is a juxtaposition of LLON and GYFARCH, and, unlike LLONG, does not contain the letter NG.

The Unicode Collation Algorithm can be used to get any of the collation sequences described above, by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.

See also

External links and references

ICU Locale Explorer An online demonstration of sorting in different languages that uses the Unicode Collation Algorithm with International Components for Unicode

Notes

^ "Unicode Technical Standard #10". Unicode, Inc. (unicode.org). 2008-03-28. http://www.unicode.org/unicode/reports/tr10/. Retrieved 2008-08-27.

Categories:

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

collating sequence — rikiavimo eilė statusas T sritis informatika apibrėžtis Taisyklės, nusakančios, kokia eile turi būti išdėstytos ↑rikiavimo parametro ↑reikšmės, neturinčios aritmetiškai palyginamų ↑verčių, pavyzdžiui, ↑rašmenys. Gali būti išvardijami rašmenys (jų … Enciklopedinis kompiuterijos žodynas
machine collating sequence — The sequence in which the computer orders characters. Because most systems use ASCII, except the large IBM systems that use EBCEDIC, the machine collating sequence is usually based on the ordering of characters in the ASCII character set; see… … Dictionary of networking
Diacritic — For the academic journal, see Diacritics (journal). The letter a with acute Diacritical marks … Wikipedia
Bengali alphabet — Bengali abugida Type Abugida Languages Bengali Time period 11th Century to the present … Wikipedia
Collation — This article is about collation in library, information, and computer science. For other uses, see Collation (disambiguation). Alphabetical redirects here. For the type of writing system, see Alphabet. For the album, see Alphabetical (album). A–Z … Wikipedia
Architecture of Btrieve — Btrieve is a database developed by Pervasive. The architecture of Btrieve has been designed with record management in mind. This means that Btrieve only deals with the underlying record creation, data retrieval, record updating and data deletion… … Wikipedia
C syntax — The syntax of the C programming language is a set of rules that specifies whether the sequence of characters in a file is conforming C source code. The rules specify how the character sequences are to be chunked into tokens (the lexical grammar) … Wikipedia
XPath 2.0 — is the current version of the XPath language defined by the World Wide Web Consortium, W3C. It became a recommendation on 23 January 2007.XPath is used primarily for selecting parts of an XML document. For this purpose the XML document is… … Wikipedia
Virtual storage access method — (VSAM) is an IBM disk file storage access method, first used in the OS/VS2 operating system, later used throughout the Multiple Virtual Storage (MVS) architecture and now in z/OS. Originally a record oriented filesystem, VSAM comprises four data… … Wikipedia
Bengali language — Bangla redirects here. For Bangla speaking people, see Bengali people. Bengali বাংলা Bangla The word Bangla in Bangla Assamese alphabet … Wikipedia

Mark and share
Search through all dictionaries
Translate…
Search Internet

Share the article and excerpts