Nucleic acid notation

Nucleic acid notation

The nucleic acid notation currently in use was first formalized by the International Union of Pure and Applied Chemistry (IUPAC) in 1970.[1] This universally accepted notation uses the Roman characters G, C, A, and T, to represent the four nucleotides commonly found in deoxyribonucleic acids (DNA). Given the rapidly expanding role for genetic sequencing, synthesis, and analysis in biology, researchers have been compelled to develop alternate notations to further support the analysis and manipulation of genetic data. These notations generally exploit size, shape, symmetry to accomplish these objectives.

Contents

IUPAC notation

Symbol [2] Description Bases represented
A adenosine A
C cytidine C
G guanosine G
T thymidine T
U uridine U
W weak A T
S strong C G
M amino A C
K keto G T
R purine A G
Y pyrimidine C T
B not A (B comes after A) C G T
D not C (D comes after C) A G T
H not G (H comes after G) A C T
V not T (V comes after T(and U)) A C G
N any base (not a gap) A C G T

Degenerate base symbols in biochemistry are an IUPAC[2] representation for a position on a DNA sequence that can be have multiple possible alternatives. These should not be confused with non-canonical bases because each particular sequence will have in fact one of the regular bases. These are used to encode the consensus sequence of a population of aligned sequences and are used for example in phylogenetic analysis to summarise into one multiple sequences or for BLAST searches, even though IUPAC degenerate symbols are masked (as they are not coded).

Under the commonly used IUPAC system, nucleic acids are represented by the first letters of their chemical names: [G]uanine, [C]ytosine, [A]denosine, and [T]hymine.[1] This shorthand also includes eleven "ambiguity" characters associated with every possible combination of the four DNA bases.[3] The ambiguity characters were designed to encode positional variations found among families of related genes. The IUPAC notation, including ambiguity characters and suggested mnemonics, is shown in Table 1.

Despite its broad and nearly universal acceptance, the IUPAC system has a number of limitations, which stem from its reliance on the Roman alphabet. The poor legibility of upper-case Roman characters, which are generally used when displaying genetic data, may be chief among these limitations. The value of external projections in distinguishing letters has been well documented.[4] However, these projections are absent from upper case letters, which in some cases are only distinguishable by subtle internal cues. Take for example the upper case C and G used to represent cytosine and guanine. These characters generally comprise half the characters in a genetic sequence but are differentiated by a small internal tick (depending on the typeface).

Another shortcoming of the IUPAC notation arises from the fact that its eleven ambiguity characters have been selected from the remaining characters of the Roman alphabet. The authors of the notation endeavored to select ambiguity characters with logical mnemonics. For example, S is used to represent the possibility of finding cytosine or guanine at genetic loci, both of which form [S]trong cross-strand binging interactions. Conversely, the weaker interactions of thymine and adenine are represented by a W. However, convenient mnemonics are not as readily available for the other ambiguity characters displayed in Table 1. This has made ambiguity characters difficult to use and may account for their limited application.

Visually enhanced notations

Legibility issues associated with IUPAC-encoded genetic data have led biologists to consider alternate strategies for displaying genetic data. These creative approaches to visualizing DNA sequences have generally relied on the use of spatially distributed symbols and/or visually distinct shapes to encode lengthy nucleic acid sequences. Several of these approaches are summarized below.

Stave projection

The Stave Projection uses spatially distributed dots to enhance the legibility of DNA sequences.

In 1986, Cowin et al. described a novel method for visualizing DNA sequence known as the Stave Projection.[5] Their strategy was to encode nucleotides as circles on series of horizontal bars akin to notes on musical stave. As illustrated in Figure 1, each gap on the five-line staff corresponded to one of the four DNA bases. The spacial distribution of the circles made it far easier to distinguish individual bases and compare genetic sequences than IUPAC-encoded data.

Geometric symbols

Zimmerman et al. took a different approach to visualizing genetic data.[6] Rather than relying on spatially distributed circles to highlight genetic features, they exploited four geometrically diverse symbols found in a standard computer font to distinguish the four bases. The authors developed a simple WordPerfect macro to translate IUPAC characters into the more visually distinct symbols.

DNA skyline

With the growing availability of font editors, Jarvius and Landegren devised a novel set of genetic symbols, known as the DNA Skyline font, which uses increasingly taller blocks to represent the different DNA bases.[7] While reminiscent of Cowin et al.'s spatially distributed Stave Projection, the DNA Skyline font is easy to download and permits translation to and from the IUPAC notation by simply changing the font in most standard word processing applications.

Functional ambigraphic notations

Additional functionality can be found in nucleic acid notations that use ambigrams to mirror structural symmetries found in the DNA double helix. As defined by Douglas Hofstadter, ambigrams are words or symbols that convey the same or different meaning when viewed in a different orientation.[8] It turns out that by assigning ambigraphic characters to complementary bases (i.e. guanine = b, cytosine = q, adenine = n, and thymine = u), it is possible to complement entire DNA sequences by simply rotating the text 180 degrees.[9] An ambigraphic nucleic acid notation also makes it easy to identify genetic palindromes, such as endonuclease restriction sites, as sections of text that can be rotated 180 degrees without changing the sequence.

AmbiScript

AmbiScript uses ambigrams to reflect DNA symmetries and support the manipulation and analysis of genetic data.

The latest in a series of rationally designed nucleic acid notations, AmbiScript combines many of the visual and functional features of its predecessors.[10] As its name implies, AmbiScript is an ambigraphic nucleic acid notation that permits rapid complementation of genetic sequence and identification of biologically significant palindromes. However, notation also uses spatially offset characters to facilitate the visual review and analysis of genetic data. One novel feature that AmbiScript brings to the world of genetic notations is its use of compound symbols to convey the possibility of finding two or more different bases at a given position. This strategy appears to an offer far less cumbersome solution to the use of ambiguity characters first proposed by the IUPAC.[3] As with Jarvius and Landegren's DNA Skyline fonts, AmbiScript fonts are easily downloaded and applied to IUPAC-encoded sequence data.

References

  1. ^ a b Iupac-Iub Comm. On Biochem. Nomencl (1970). "Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents". Biochemistry 9: 4022–4027. doi:10.1021/bi00822a023.  edit
  2. ^ a b Nomenclature Committee of the International Union of Biochemistry (NC-IUB). "Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences". http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html. Retrieved 2008-02-04. 
  3. ^ a b 1986. Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified basis in nucleic acid sequences. Recommendations 1984. Proc. Natl. Acad. Sci. USA 83:4-8.
  4. ^ Tinker, M. A. 1963. Legibility of Print. Iowa Sate University Press, Ames IA.
  5. ^ Cowin, J. E.; Jellis, C. H.; Rickwood, D. (1986). "A new method of representing DNA sequences which combines ease of visual analysis with machine readability". Nucleic Acids Research 14 (1): 509. doi:10.1093/nar/14.1.509. PMC 339435. PMID 3003680. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=339435.  edit
  6. ^ Zimmerman, P. A., M. L. Spell, J. Rawls, and T. R. Unnasch. 1991. Transformation of DNA sequence data into geometric symbols. BioTechniques 11:22-27
  7. ^ Jarvius, J.; Landegren, U. (2006). "DNA Skyline: fonts to facilitate visual inspection of nucleic acid sequences". BioTechniques 40 (6): 740. doi:10.2144/000112180. PMID 16774117.  edit
  8. ^ Hofstadter, D. R. 1985 Metamagical Themas: Questioning the Essence of ind and Pattern. Basic Books, NY.
  9. ^ Rozak, D. A. 2006. The practical and pedagogical advantages of an ambigraphic nucleic acid notation. Nucleosides Nucleotides and Nucleic Acids 25:807-813.
  10. ^ Flower, R. H.; Knoll, A. H.; Yuan, X. (1955). "Status of Endoceroid Classification". Journal of Paleontology 29 (3): 329–371. doi:10.2144/000112727.  edit

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Nucleic Acid Notations — The International Union of Pure and Applied Chemistry (IUPAC) first formalized the currently used nucleic acid notation in 1970 1970. IUPAC IUB Commission on Biochemical Nomenclaure (CBN). Abbreviations and symbols for nucleic acids,… …   Wikipedia

  • Nucleic acid sequence — A series of codons in part of a mRNA molecule. Each codon consists of three nucleotides, usually representing a single amino acid. The sequence or primary structure of a nucleic acid is the composition of atoms that make up the nucleic acid and… …   Wikipedia

  • Coding theory approaches to nucleic acid design — DNA code construction refers to the application of coding theory to the design of nucleic acid systems for the field of DNA–based computation. Contents 1 Introduction 2 Definitions 2.1 Property U 2 …   Wikipedia

  • Notation — The term notation can refer to: Contents 1 Written communication 1.1 Biology and Medicine 1.2 Chemistry 1.3 Dance and movement …   Wikipedia

  • DNA — For a non technical introduction to the topic, see Introduction to genetics. For other uses, see DNA (disambiguation). The structure of the DNA double helix. The atoms in the structure are colour coded by element and the detailed structure of two …   Wikipedia

  • Nucleotide — Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling (cGMP and cAMP), and are incorporated into important cofactors of enzymatic reactions… …   Wikipedia

  • IUPAC nomenclature of organic chemistry — The IUPAC nomenclature of organic chemistry is a systematic method of naming organic chemical compounds as recommended[1] by the International Union of Pure and Applied Chemistry (IUPAC). Ideally, every possible organic compound should have a… …   Wikipedia

  • Nucleobase — Base pairing in RNA. Nucleobases in blue. Hydrogen bonds in red. Nucleobases are a group of nitrogen based molecules that are required to form nucleotides, the basic building blocks of DNA and RNA. Nucleobases provide the molecular structure… …   Wikipedia

  • Sequence motif — In genetics, a sequence motif is a nucleotide or amino acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed …   Wikipedia

  • Genetics — This article is about the general scientific term. For the scientific journal, see Genetics (journal). Part of a series on Genetics Key components Chromosome DNA • RNA Genome …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”