Digraphs and trigraphs

Digraphs and trigraphs

In computer programming, digraphs and trigraphs are sequences of two and three characters respectively, appearing in source code, which a programming language specification requires an implementation of that language to treat as if they were one other character.

Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as { and }.

Contents

History

The basic character set of the C programming language is a superset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the keyboard being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set.

Implementations

Trigraphs are not commonly encountered outside compiler test suites.[1] Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor, to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).

Language support

Different systems have different sets of defined trigraphs:

Pascal

Pascal programming language supports digraphs (., .), (* and *) for [, ], { and } respectively. Unlike all other cases mentioned here, (* and *) were in wide use.

Vim

Vim text editor supports digraphs for actual entry of text characters, following RFC 1345.

GNU Screen

GNU Screen has a digraph command, bound to ^A ^V by default.

J

The J programming language uses dot and colon characters to extend the meaning of the basic characters available. These however are not digraphs, because the resulting sequences do not have a single-character equivalent.

C

C programming language supports digraphs in ISO C 94 mode of compiling.

The C preprocessor replaces all occurrences of the following nine trigraph sequences by their single-character equivalents before any other processing.

A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ? tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literals, and comments. To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..." or an escape sequence "...?\?...".

Trigraph Equivalent
??= #
??/ \
??' ^
??( [
??) ]
??! |
??< {
??> }
??- ~

??? is not itself a trigraph sequence.

The ??/ trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:

 // Will the next line be executed????????????????/
 a++;

which is a single logical comment line (used in C++ and C99), and

 /??/
 * A comment *??/
 /

which is a correctly formed block comment.

In 1994 a normative amendment to the C standard, included in C99, supplied digraphs as more readable alternatives to five of the trigraphs. They are:

Digraph Equivalent
<: [
:> ]
<% {
%> }
%: #

Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token %:%: replacing the preprocessor concatenation token ##. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.

Notes

  1. ^ "The New C Standard: An Economic and Cultural Commentary" by Derek M. Jones, sentence 117

References


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • List of digraphs in Latin alphabets — This is a list of digraphs used in various Latin alphabets. (See also List of Cyrillic digraphs.) Capitalization involves only the first letter (ch – Ch) unless otherwise stated (ij – IJ). Letters with diacritics are arranged in alphabetic order… …   Wikipedia

  • Differences between Afrikaans and Dutch — Dutch and Afrikaans Afrikaans is a daughter language of Dutch[1][2][3] …   Wikipedia

  • Cyrillic digraphs — Further information: List of Cyrillic digraphs Cyrillic script Slavic letters А Б В Г Ґ …   Wikipedia

  • Alphabets derived from the Latin — Variants of the Latin alphabet are used by the writing systems of many languages throughout the world. The tables below summarize and compare some of the alphabets known to the various contributors. In this article, the word alphabet is… …   Wikipedia

  • Romance languages — Romance Geographic distribution: Originally Southern Europe and parts of Africa; now also Latin America, Canada, parts of Lebanon and much of Western Africa Linguistic classification: Indo European Italic …   Wikipedia

  • French orthography — This article is part of the series on: French language Langues d oïl Dialects Creoles Francophonie History Oaths of Strasbourg Ordinance of Villers Cotterêts Anglo Norman Grammar …   Wikipedia

  • Hangul — For other uses, see Hangul (disambiguation). Hangul 한글 Type …   Wikipedia

  • Digraph (orthography) — For other uses, see Digraph (disambiguation). In Welsh, the digraph ⟨Ll⟩, ⟨ll⟩ fused for a time into a ligature. Further information: list of Latin digraphs and list of Cyrillic digraphs A digraph or digram (from the Greek: δίς dís… …   Wikipedia

  • General Chinese — (GC) is a phonetic system invented by Yuen Ren Chao to represent the pronunciations of all major Chinese dialects simultaneously. It can also be used for the Korean, Japanese, and Vietnamese pronunciations of Chinese characters, and challenges… …   Wikipedia

  • Corsican alphabet — The modern Corsican alphabet (Corsican u santacroce or u salteriu) uses 22 basic letters taken from the Latin alphabet with some changes, plus some multigraphs. The pronunciations of the English, French, Italian or Latin forms of these letters… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”