Unicode scripts

Unicode scripts

In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems. [ [http://unicode.org/glossary/ Glosary of Unicode Terms] ] For example the Latin script supports alphabets such as: English, French, Vietnamese and many others. Some scripts support one and only one writing system and language, for example: Armenian. Other scripts, like Latin, support many different writing systems: English, French, German, Italian, and Latin to name just some of the alphabets supported by the Latin script. Some languages also make use of multiple alternate writing systems. Turkish, for example, used Arabic script before the 20th century and transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

While all characters have the property of belonging to a script, many characters, such as symbols, indicate “common” or “inherited” for their script property. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode already includes over 60 scripts supporting hundreds or even thousands of languages throughout the World. Unicode is actively working on many more as indicated by its roadmap.

Writing system

main|Writing system

Writing system is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.

Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.

"See also": "phonemic" and "phonetic" orthography.

Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.

Table of Unicode scripts

The following table lists the 75 scripts that are defined in Unicode 5.1. [ [http://www.unicode.org/Public/UNIDATA/Scripts.txt Unicode Character Database : Scripts] ]

Common and inherited scripts

Unicode assigns every character in the UCS to a single script only. However, many characters — those that are not part of a formal natural language writing system or are unified across many writing systems (e.g. most symbols including music notation, currency signs, etc., as well as some numerals and many punctuation marks) — may be used in more than one script. In these cases Unicode defines them as belonging to the common script.

In addition, many diacritics and non-spacing combining characters may be applied to characters from more than one script, and in these cases Unicode assigns them to the inherited script, which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308 Combining Diaeresis may combine with either U+0065 latin Small Letter E (ë) or U+0435 Cyrillic Small Letter IE (ё), and in the former case it inherits the Latin script of the preceding base character whereas in the latter case it inherits the Cyrillic script of the preceding base character.

Character categories within scripts

Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as Dz (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.

Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as “other letter” or “modifier letter”. Ideographs such as Unihan ideographs are also categorized as “other letters”. A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are nether uppercase nor lowercase.

Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that scripts. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.

ee also

*Mapping of Unicode characters
*

References

* [http://www.unicode.org/versions/Unicode5.0.0/ The Unicode Standard 5.0]
* [http://www.unicode.org/reports/tr24/ Unicode Standard Annex #24 : Unicode Script Property]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Unicode character property — Unicode assigns character properties to each code point.[1] These properties can be used to handle characters (code points) in processes, like in line breaking, script direction right to left or applying controls. Slightly inconsequently, some… …   Wikipedia

  • Unicode equivalence — is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting standard character… …   Wikipedia

  • Unicode — For the 1889 Universal Telegraphic Phrase book, see Commercial code (communications). The Unicode official logo since October 2009 …   Wikipedia

  • Unicode font — A Unicode font (also known as UCS font and Unicode typeface) is a computer font that contains a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc., which are collectively mapped into the standard Universal… …   Wikipedia

  • Unicode symbols — v · Character Types Scripts Unihan ideographs, etc. Phonetic characters Punctuation and separators Diacritics and other marks Symbols Numerals Compatibility characters …   Wikipedia

  • Unicode typefaces — (also known as UCS fonts and Unicode fonts) are typefaces containing a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc., which are collectively mapped into the standard Universal Character Set, derived from… …   Wikipedia

  • Unicode cuneiform — Unicode (as of version 5.0) assigns to the Sumero Akkadian Cuneiform script the following ranges of the Supplementary Multilingual Plane: *U+12000 to U+1236E (879 characters) Sumero Akkadian Cuneiform *U+12400 to U+12473 (103 characters)… …   Wikipedia

  • Phonetic symbols in Unicode — Unicode supports several phonetic scripts and notations through the existing writing systems and the addition of extra blocks with phonetic characters. These phonetic extras are derived of an existing script, usually Latin, Greek or Cyrillic. In… …   Wikipedia

  • Mapping of Unicode characters — Unicode’s Universal Character Set has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point which is an integer between 0 and 1,114,111 used to represent each character within the internal logic… …   Wikipedia

  • Unicode and e-mail — Many E mail clients now offer some support for Unicode in E mail bodies. Most do not send in Unicode by default, but as time passes, more and more systems are likely to be set up with fonts capable of displaying the full range of Unicode… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”