Unicode control characters

Unicode control characters

Many characters are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the null character (U+0000, aka NUL) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a string (as opposed to a starting address and a length), since the string ends once the program reads the null character.

ISO 6429 control characters

The control characters U+0000 to U+001F and U+007F come from ASCII, additionally U+0080 to U+009F were used in conjunction with ISO 8859 character sets (among others). They are specified in ISO 6429 and often referred to as C0 and C1 control codes respectively. Most of these characters play no explicit role in Unicode text handling. The null character (U+0000), horizontal tabulation or tab (U+0009), linefeed (U+000A), carriage return (U+000D), and newline (U+0085) arecommonly used in text processing.

Unicode introduced separators

In an attempt to simplify the several newline characters used in legacy text, UCS introduces its own newline characters to separate either lines or paragraphs: the line separator (U+2028) and paragraph separator (U+2029) characters.

Language tags

Unicode includes 128 characters as language tags. The characters essentially mirror the 128 ASCII characters except, when used they identify the subsequent text as belonging to a particular language according to BCP 47. For example, for indicating subsequent text as the variant of English as written in the United States, the initiating ‘Language Tag character’ (U+E0001) followed by the sequence ‘Tag Small Letter e’ (U+E0065), ‘Tag Small Letter n’ (U+E006E), “Tag Hyphen-minus’ (U+E002D), ‘Tag Small Letter u’ (U+E0075) and ‘Tag Small Letter s’ (U+E0073).

These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example the display of Unihan ideographs might substitute different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might influence the display of decimal digits 0 through 9 differently depending on the language they appeared in.

As of 2008 (Unicode 5.1) deprecates the tag characters [See the Unicode 5.1.0 [http://www.unicode.org/Public/5.1.0/ucd/PropList.txt property list] ]

Interlinear annotation

Three formatting characters provide support for interlinear annotation (U+FFF9, U+FFFA, U+FFFB). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C ruby markup recommendation is an example of an alternate protocol supporting more advanced interlinear annotation.

Bidirectional text control

Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, the Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسملة”) right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, support for bidirectional text becomes more complicated when text flowing in opposite directions is embedded hierarchically, for example if one quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides seven characters (U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E) to help control these embedded bidirectional text levels up to 61 levels deep.

Variation selectors

Many characters map to alternate glyphs depending on the context. For example Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute.

However, for other glyph substitution, the author's intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant. Well as of Unicode 3.2, and 4.0 the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character.

References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

  • Control character — In computing and telecommunication, a control character or non printing character is a code point (a number) in a character set, that does not in itself represent a written symbol. It is in band signaling in the context of character encoding. All …   Wikipedia

  • Unicode character property — Unicode assigns character properties to each code point.[1] These properties can be used to handle characters (code points) in processes, like in line breaking, script direction right to left or applying controls. Slightly inconsequently, some… …   Wikipedia

  • Mapping of Unicode characters — Unicode’s Universal Character Set has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point which is an integer between 0 and 1,114,111 used to represent each character within the internal logic… …   Wikipedia

  • Unicode — For the 1889 Universal Telegraphic Phrase book, see Commercial code (communications). The Unicode official logo since October 2009 …   Wikipedia

  • Unicode symbols — v · Character Types Scripts Unihan ideographs, etc. Phonetic characters Punctuation and separators Diacritics and other marks Symbols Numerals Compatibility characters …   Wikipedia

  • Control-Z — In computing, Control+Z is a control character in ASCII code, also known as the substitute (SUB) character or a keyboard shortcut. Strictly speaking …   Wikipedia

  • control character —    A nonprinting character with a special meaning. Control characters, such as Carriage Return, Line Feed, Bell, or Escape, perform a specific operation on a terminal, printer, or communications line. They are grouped together as the first 32… …   Dictionary of networking

  • Control-V — In computing, Control V is a control character in ASCII code, also known as the synchronous idle (SYN) character. It is generated by pressing the V key while holding down the …   Wikipedia

  • UNICODE — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

  • Unicode Consortium — Юникод, или Уникод (англ. Unicode)  стандарт кодирования символов, позволяющий представить знаки практически всех письменных языков. Стандарт предложен в 1991 году некоммерческой организацией «Консорциум Юникода» (англ. Unicode Consortium,… …   Википедия

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”