Numeric character reference

Numeric character reference: A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. NCRs are typically used in order to represent characters that are not directly encodable in a particular document. When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents.

Contents

1 Example

2 Discussion

3 Restrictions

4 See also

Example

In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma

U+03A3 Σ greek capital letter sigma (3A3₁₆ = 931₁₀)

Unicode character Numerical base Numerical reference in markup Effect

U+03A3 Decimal Σ Σ

U+03A3 Decimal Σ Σ

U+03A3 Hexadecimal Σ Σ

U+03A3 Hexadecimal Σ Σ

U+03A3 Hexadecimal Σ Σ

Discussion

Markup languages are typically defined in terms of UCS or Unicode characters. That is, a document consists, at its most fundamental level of abstraction, of a sequence of characters, which are abstract units that exist independently of any encoding.

Ideally, when the characters of a document utilizing a markup language are encoded for storage or transmission over a network as a sequence of bits, the encoding that is used will be one that supports representing each and every character in the document, if not in the whole of Unicode, directly as a particular bit sequence.

Sometimes, though, for reasons of convenience or due to technical limitations, documents are encoded with an encoding that cannot represent some characters directly. For example, the widely used encodings based on ISO 8859 can only represent, at most, 256 unique characters as one 8-bit byte each.

Documents are rarely, in practice, ever allowed to use more than one encoding internally, so the onus is usually on the markup language to provide a means for document authors to express unencodable characters in terms of encodable ones. This is generally done through some kind of "escaping" mechanism.

The SGML-based markup languages allow document authors to use special sequences of characters from the ASCII range (the first 128 code points of Unicode) to represent, or reference, any Unicode character, regardless of whether the character being represented is directly available in the document's encoding. These special sequences are character references.

Character references that are based on the referenced character's UCS or Unicode "code point" are called numeric character references. In HTML 4 and in all versions of XHTML and XML, the code point can be expressed either as a decimal (base 10) number or as a hexadecimal (base 16) number. The syntax is as follows:

Character U+0026 (ampersand), followed by character U+0023 (number sign), followed by one of the following choices:

one or more decimal digits zero (U+0030) through nine (U+0039); or

character U+0078 ("x") followed by one or more hexadecimal digits, which are zero (U+0030) through nine (U+0039), Latin capital letter A (U+0041) through F (U+0046), and Latin small letter a (U+0061) through f (U+0066);

all followed by character U+003B (semicolon). Older versions of HTML disallowed the hexadecimal syntax.

The characters that comprise a numeric character reference can be represented in every character encoding used in computing and telecommunications today, so there is no risk of the reference itself being unencodable.

There is another kind of character reference called a character entity reference, which allows a character to be referred to by a name instead of a number. (Naming a character creates a character entity.) HTML defines some character entities, but not many; all other characters can only be included by direct encoding or using NCRs.

Restrictions

The Universal Character Set defined by ISO 10646 is the "document character set" of SGML, HTML 4, so by default, any character in such a document, and any character referenced in such a document, must be in the UCS.

While the syntax of SGML does not prohibit references to unassigned code points, such as , SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters or that have not been permanently left unassigned.

Restrictions may also apply for other reasons. For example, in HTML 4, , which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference. As another example, , which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers—some of which attempt to interpret it as a reference to the character represented by code value 128 in the Windows-1252 encoding: "€", which actually should be represented as €. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like 𐀀 (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended.

Markup languages also place restrictions on where character references can occur.

See also

Character entity reference

List of XML and HTML character entity references

v · d · eUnicode

Unicode
Unicode Consortium · ISO/IEC 10646 (Universal Character Set)

Code points
Code point · Plane · Block · Mapping characters · Character property · Character charts

Characters

Special purpose

BOM · Combining grapheme joiner · Left-to-right mark and Right-to-left mark · Soft hyphen · Zero-width non-breaking space · Zero-width joiner · Zero-width non-joiner · Zero-width space

Miscellaneous lists

Combining character · Duplicate characters · Graphic characters

Processing

Algorithms

Bi-directional text · Collation (ISO 14651) · Equivalence

Transformation

BOCU-1 · CESU-8 · UTF-1 · UTF-7 · UTF-8 · UTF-9/UTF-18 · UTF-16/UCS-2 · UTF-32/UCS-4 · UTF-EBCDIC · Punycode · SCSU · Comparison

On pairs
of code points
Equivalence · Combining character · Duplicates · Homoglyph · Precomposed character (List) · Compatibility characters · Z-variant

Usage
Unicode and e-mail · Unicode and HTML · Character entity references · Unicode input · Internationalized domain name · Numeric character reference · Private Use U+F8FF · Typefaces (fonts) ·

Related standards
Common Locale Data Repository (CLDR) · GB 18030 · Han unification · ISO/IEC 8859 (8-bit encodings) · ISO 14651 (Collation) · ISO 15924 (Script codes)

Related topics
Anomalies · ConScript Unicode Registry · Ideographic Rapporteur Group · International Components for Unicode · MUFI · People related to Unicode

Scripts and symbols in Unicode

Common and
inherited scripts
Combining marks · Diacritics · Punctuation · Space

Modern scripts
Arabic (diacritics · Unicode blocks) · Armenian · Balinese · Batak · Bamum · Bengali · Bopomofo · Braille · Buginese · Buhid · Canadian Aboriginal · Cham · Cherokee · CJK Unified Ideographs (Han) · Cyrillic · Deseret · Devanagari · Ethiopic · Georgian · Greek · Gujarati · Gurmukhi · Kanji · Hanja · Hán tự · Hangul · Hanunoo · Hebrew (diacritics) · Hiragana · Javanese · Kannada · Katakana · Kayah Li · Khmer · Lao · Latin · Lepcha · Limbu · Lisu · Malayalam · Mandaic · Meetei Mayek · Mongolian · Manchu · Myanmar · N'Ko · New Tai Lue · Ol Chiki · Oriya · Osmanya · Rejang · Samaritan · Saurashtra · Shavian · Sinhala · Sundanese · Syloti Nagri · Syriac · Tagalog · Tagbanwa · Tai Le · Tai Tham · Tai Viet · Tamil · Telugu · Thaana · Thai · Tibetan · Tifinagh · Vai · Yi

Ancient and
historic scripts
Avestan · Brāhmī · Carian · Coptic · Sumero-Akkadian · Cypriot · Egyptian Hieroglyphs · Glagolitic · Gothic · Imperial Aramaic · Inscriptional Pahlavi · Inscriptional Parthian · Kaithi · Kharoshthi · Linear B · Lycian · Lydian · Ogham · Old Italic · Old Persian · Phags-pa · Phoenician · Old South Arabian · Old Turkic · Runic · Ugaritic

Symbols
Cultural, political, and religious symbols · Currency · Mathematical operators and symbols · Phonetic symbols (including IPA)

Categories:
HTML
Unicode
XML

U+03A3 Σ greek capital letter sigma (3A3₁₆ = 931₁₀)
Unicode character	Numerical base	Numerical reference in markup	Effect
U+03A3	Decimal	Σ	Σ
U+03A3	Decimal	Σ	Σ
U+03A3	Hexadecimal	Σ	Σ
U+03A3	Hexadecimal	Σ	Σ
U+03A3	Hexadecimal	Σ	Σ

Игры ⚽ Поможем написать курсовую

Look at other dictionaries:

Character encodings in HTML — For a list of character entity references, see List of XML and HTML character entity references. HTML HTML and HTML5 Dynamic HTML XHTML XHTML Mobile Profile and C HTML Canvas element Character encodings Document Object Model Font family HTML… … Wikipedia
Character entity reference — In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition (DTD). The replacement text of the… … Wikipedia
List of XML and HTML character entity references — In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series of… … Wikipedia
Unicode character property — Unicode assigns character properties to each code point.[1] These properties can be used to handle characters (code points) in processes, like in line breaking, script direction right to left or applying controls. Slightly inconsequently, some… … Wikipedia
Numeric Annotation Glyphs — or NAGs are used to annotate chess games when using a computer, typically providing an assessment of a chess move or a chess position. NAGs exist to indicate a simple annotation in a language independent manner. NAGs were first formally… … Wikipedia
HTML decimal character rendering — Not all web browsers or email clients used by receivers of HTML documents, or text editors used by authors of HTML documents, will be able to render all HTML characters. Most modern web browsers are able to display many more characters than the… … Wikipedia
Combining character — In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks (including combining accents). Unicode also… … Wikipedia
Precomposed character — A precomposed character (alternatively composite character or decomposable character) is a Unicode entity that can be defined as a combination of two or more other characters. A precomposed character may typically represent a letter with a… … Wikipedia
Character (computing) — In computer and machine based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural… … Wikipedia
Reference — For help in citing references, see Wikipedia:Citing sources. For the Wikipedia Reference Desk, see Wikipedia:Reference desk. Reference is derived from Middle English referren, from Middle French rèférer, from Latin referre, to carry back , formed … Wikipedia

Academic Dictionaries and Encyclopedias

Numeric character reference

Contents

Example

Discussion

Restrictions

See also

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Numeric character reference

Contents

Example

Discussion

Restrictions

See also

Look at other dictionaries:

Share the article and excerpts

Direct link