 Coding theory approaches to nucleic acid design

DNA code construction refers to the application of coding theory to the design of nucleic acid systems for the field of DNA–based computation.
Contents
Introduction
DNA sequences are known to appear in the form of double helices in living cells, in which one DNA strand is hybridized to its complementary strand through a series of hydrogen bonds. For the purpose of this entry, we shall focus on only oligonucleotides. DNA computing involves allowing synthetic oligonucleotide strands to hybridize in such a way as to perform computation. DNA computing requires that the selfassembly of the oligonucleotide strands happen in such a way that hybridization should occur in a manner compatible with the goals of computation.
The field of DNA computing was established in Leonard M. Adelman’s seminal paper.^{[1]} His work is significant for a number of reasons:
 It shows how one could use the highly parallel nature of computation performed by DNA to solve problems that are difficult or almost impossible to solve using the traditional methods.
 It's an example of computation at a molecular level, on the lines of nanocomputing, and this potentially is a major advantage as far as the information density on storage media is considered, which can never be reached by the semiconductor industry.
 It demonstrates unique aspects of DNA as a data structure.
This capability for massively parallel computation in DNA computing can be exploited in solving many computational problems on an enormously large scale such as cellbased computational systems for cancer diagnostics and treatment, and ultrahigh density storage media.^{[2]}
This selection of codewords (sequences of DNA oligonucleotides) is a major hurdle in itself due to the phenomenon of secondary structure formation (in which DNA strands tend to fold onto themselves during hybridization and hence rendering themselves useless in further computations. This is also known as selfhybridization). The NussinovJacobson^{[3]} algorithm is used to predict secondary structures and also to identify certain design criteria that reduce the possibility of secondary structure formation in a codeword. In essence this algorithm shows how the presence of a cyclic structure in a DNA code reduces the complexity of the problem of testing the codewords for secondary structures.
Novel constructions of such codes include using cyclic reversible extended Goppa codes, generalized Hadamard matrices, and a binary approach. Before diving into these constructions, we shall revisit certain fundamental genetic terminology. The motivation for the theorems presented in this article, is that they concur with the Nussinov  Jacobson algorithm, in that the existence of cyclic structure helps in reducing complexity and thus prevents secondary structure formation. i.e. these algorithms satisfy some or all the design requirements for DNA oligonucleotides at the time of hybridization (which is the core of the DNA computing process) and hence do not suffer from the problems of self  hybridization.
Definitions
A DNA code is simply a set of sequences over the alphabet .
Each purine base is the WatsonCrick complement of a unique pyrimidine base (and vice versa) – adenine and thymine form a complementary pair, as do guanine and cytosine. This pairing can be described as follows – .
Such pairing is chemically very stable and strong. However, pairing of mismatching bases does occur at times due to biological mutations.
Most of the focus on DNA coding has been on constructing large sets of DNA codewords with prescribed minimum distance properties. For this purpose let us lay down the required groundwork to proceed further.
Let q = q_{1}q_{2}....q_{n} be a word of length n over the alphabet . For , we will use the notation q_{[i,j]} to denote the subsequence q_{i}q_{i + 1}...q_{j}. Furthermore, the sequence obtained by reversing q will be denoted as q_{R}. The WatsonCrick complement, or the reversecomplement of q, is defined to be , where denotes the WatsonCrick complement base pair of q_{i}.
For any pair of lengthn words p and q over , the Hamming distance d_{H}(p,q) is the number of positions i at which . Further, define reverseHamming distance as . Similarly, reversecomplement Hamming distance is . (where RC stands for reverse complement)
Another important code design consideration linked to the process of oligonucleotide hybridization pertains to the GC content of sequences in a DNA code. The GCcontent, w_{GC}(q), of a DNA sequence q = q_{1}q_{2}....q_{n} is defined to be the number of indices i such that . A DNA code in which all codewords have the same GCcontent, w, is called a constant GCcontent code.
A generalized Hadamard matrix ) is an n n square matrix with entries taken from the set of mth roots of unity, = {e ^{− 2πil} / m, l = 0, ..., m − 1}, that satisfies HH ^{*} = nI. Here I denotes the identity matrix of order n, while * stands for complexcongugation. We will only concern ourselves with the case m = p for some prime p. A necessary condition for the existence of generalized Hadamard matrices is that p  n. The exponent matrix, E , of is the matrix with the entries in Z_{p} = {0,1,2,...,p − 1}, is obtained by replacing each entry (e ^{− 2πil}) in by the exponent l.
The elements of the Hadamard exponent matrix lie in the Galois field GF(p), and its row vectors constitute the codewords of what shall be called a generalized Hadamard code.
Here, the elements of E lie in the Galois field GF(p).
By definition, a generalized Hadamard matrix H in its standard form has only 1s in its first row and column. The square matrix formed by the remaining entries of H is called the core of H, and the corresponding submatrix of the exponent matrix E is called the core of construction. Thus, by omission of the allzero first column cyclic generalized Hadamard codes are possible, whose codewords are the row vectors of the punctured matrix.
Also, the rows of such an exponent matrix satisfy the following two properties: (i) in each of the nonzero rows of the exponent matrix, each element of appears a constant number, n / p, of times; and (ii) the Hamming distance between any two rows is n(p − 1) / p.^{[4]}
Property U
Let C_{p} = 1,x,x2,...,xp − I be the cyclic group generated by x, where x = exp(2πj / p) is a complex primitive pth root of unity, and p > 2 is a fixed prime. Further, let , denote arbitrary vectors over C_{p} which are of length N = pt, where t is a positive integer. Define the collection of differences between exponents , where n_{q} is the multiplicity of element q of GF(p) which appears in Q.^{[4]}
Vector Q is said to satisfy Property U iff each element q of GF(p) appears in Q exactly t times (n_{q} = t,q = 0,1,...,p − 1)
The following lemma is of fundamental importance in constructing generalized Hadamard codes.
Lemma. Orthogonality of vectors over C_{p}  For fixed primes p, arbitrary vectors A,B of length N = pt, whose elements are from C_{p}, are orthogonal if the vector Q satisfies Property U, where Q is the collection of modp differences between the Hadamard exponents associated with A,B.
M sequences
Let V be an arbitrary vector of length N whose elements are in the finite field GF(p), where p is a prime. Let the elements of vector V constitute the first period of an infinite sequence a(V) which is periodic of period N. If N is the smallest period for conceiving a subsequence, the sequence is called an Msequence, or a sequence of maximal least period obtained by cycling N elements. If, when the elements of the ordered set V are permuted arbitrarily to yield V ^{*} , the sequence a(V ^{*} ) is an Msequence, the sequence a(V) is called Minvariant. The theorems that follow present conditions that ensure invariance in an M sequence. In conjunction with a certain uniformity property of polynomial coeffecients, these conditions yield a simple method by which complex Hadamard matrices with cyclic core can be constructed.
The goal as outlined at the head of this article is to find cyclic matrix E = E_{c} whose elements are in Galois field GF(p) and whose dimension is N = p^{n} − 1. The rows of E will be the nonzero codewords of a linear cyclic code K, if and only if there is polynomial g(x) with coefficients in GF(p), which is a proper divisor of x^{N} − 1 and which generates K. In order to haveN nonzero codewords, g(x) must be of degree N − n. Further, in order to generate a cyclic Hadamard core, the vector (of coefficients of) g(x) when operated upon with the cyclic shift operation must be of period N, and the vector difference of two arbitrary rows of E (augmented with zero) must satisfy the uniformity condition of Butson,^{[5]} previously referred to as Property U. One necessary condition for Nperidoicity is that x^{N} − 1 = g(x)h(x), where h(x) is monic irreducible over.^{[6]} The approach here is to replace the last requirement with the condition that the coefficients of the vector [0,g(x)] be uniformly distributed over GF(p), each residue 0,1,...,p − 1 appears the same number of times (Property U). This heuristic approach has succeeded for all cases tried, and a proof that it always produces a cyclic core is given below.
Examples of code construction
1. Code construction using complex Hadamard matrices
Construction algorithm
Consider all monic irreducible polynomials h(x) over GF(p) which are of degree n , and which permit a suitable companion g(x) of degree N − n such that g(x)h(x) = x^{N} − 1, where also vector [0,g(x)] satisfies Property U. This requires only a simple computer algorithm for long division over GF(p). Since h(x)  x^{N} − 1, the ideal generated by g(x) , x^{N} − 1, will be a cyclic code K. Moreover, Property U guarantees the nonzero codewords form a cyclic matrix, each row being of period N under cyclic permutation, which serves as a cyclic core for Hadamard matrix H(p,pn). As an example, a cyclic core for H(3,9) results from the companions h(x) = x^{2} + x + 2 and g(x) = x^{6} + 2x^{5} + 2x^{4} + 2x^{2} + x + 1. The coefficients of g indicate that 0,1,6 is the relative difference set, .
Theorem
Let p be a prime and N + 1 = pn, with g(x) a monic polynomial of degree N − n whose extended vector of coefficients C = [c_{0},c_{1},...,c_{N − 1}] are elements of GF(p). The conditions are as follows:
(1) vector C = [c_{0},c_{1},...,c_{N − 1}] satisfies the property U explained above,
(2) g(x)h(x) = xN − 1, where h(x) is a monic irreducible polynomial of degree n, guarantee the existence of a pary, linear cyclic code : of blocksize N, such that the augmented code is the Hadamard exponent, for Hadamard matrix H(p,p_{n}) = xK, with x = e^{2πi / p}, where the core of H is cyclic matrix.
Proof:
First, we note that since g(x) is monic, it divides x^{N − 1}, and has degree = N − n. Now, we need to show that the matrix E_{c} whose rows are the nonzero codewords, constitutes a cyclic core for some complex Hadamard matrix H.
Given: we know that C satisfies property U. Hence, all of the nonzero residues of GF(p) lie in C. By cycling through C, we get the desired exponent matrix E_{c} where we can get every codeword in E_{c} by cycling the first codeword. (This is because the sequence obtained by cycling through C is an Minvariant sequence.)
We also see that augmentation of each codeword of E_{c} by adding a leading zero element produces a vector which satisfies Property U. Also, since the code is linear, the vector difference of two arbitrary codewords is also a codeword and thus satisfy Property U. Therefore, the row vectors of the augmented code K form a Hadamard exponent. Thus, xK is the standard form of some complex Hadamard matrix H.
Thus from the above property, we see that the core of E is a circulant matrix consisting of all the N = p^{k} − 1 cyclic shifts of its first row. Such a core is called a cyclic core where in each element of appears in each row of E exactly (N + 1) / p = p^{k − 1} times, and the Hamming distance between any two rows is exactly (N + 1)(p − 1) / p = (p − 1)p^{k − 1}. The N rows of the core E form a constantcomposition code  one consisting of N cyclic shifts of some length N over the set . Hamming distance between any two codewords in is (p − 1)p^{k − 1}.
The following can be inferred from the theorem as explained above. (For more detailed reading, the reader is referred to the paper by Heng and Cooke.^{[4]}) Let N = p^{k} − 1 for p prime and . Let g(x) = c_{0} + c_{1}x + c_{2}x^{2} + ... + c_{N − k}x^{N − k} be a monic polynomial over , of degree N  k such that g(x)h(x) = x^{N} − 1 over , for some monic irreducible polynomial . Suppose that the vector (c_{0},c_{1},....,c_{N − k},c_{N − k + 1},...,c_{N − 1}), with c_{i} = 0 for (N  k) < i < N, has the property that it contains each element of the same number of times. Then, the N cyclic shifts of the vector g = (c_{0},c_{1},...,c_{N − 1}) form the core of the exponent matrix of some Hadamard matrix .
DNA codes with constant GCcontent can obviously be constructed from constantcomposition codes (A constant composition code over a kary alphabet has the property that the numbers of occurrences of the k symbols within a codeword is the same for each codeword) over by mapping the symbols of to the symbols of the DNA alphabet, . For example, using cyclic constant composition code of length 3^{k} − 1 over guaranteed by the theorem proved above and the resulting property, and using the mapping that takes 0 to A, 1 to T and 2 to G, we obtain a DNA code with 3^{k} − 1 and a GCcontent of 3^{k − 1}. Clearly d_{H} = 2.3^{k − 1} and in fact since and no codeword in contains no symbol C, we also have . This is summarized in the following corollary.^{[4]}
Corollary
For any , there exists DNA codes with 3^{k} − 1 codewords of length 3^{k} − 1, constant GCcontent 3^{k − 1}, and in which every codeword is a cyclic shift of a fixed generator codeword g.
Each of the following vectors generates a cyclic core of a Hadamard matrix H(p,p^{n}) (where N + 1 = p^{n}, and n = 3 in this example)^{[4]} :
g^{(1)} = (22201221202001110211210200);
g^{(2)} = (20212210222001012112011100).
Where, g(x) = a_{0} + a_{1}x + .... + a_{n}x^{n}.
Thus, we see how DNA codes can be obtained from such generators by mapping 0,1,2 onto A,T,G. The actual choice of mapping plays a major role in secondary structure formations in the codewords.
We see that all such mappings yield codes with essentially the same parameters. However the actual choice of mapping has a strong influence on the secondary structure of the codewords. For example, the codeword illustrated was obtained from g^{(1)} via the mapping 0 − A;1 − T;2 − G, while the codeword g^{(2)} was obtained from the same generator g^{(1)} via the mapping 0 − G;1 − T;2 − A.
2. Code construction via a Binary Mapping
Perhaps a simpler approach to building/designing DNA codewords is by having a binary mapping by looking at the design problem as that of constructing the codewords as binary codes. i.e. map the DNA codeword alphabet onto the set of 2bit length binary words as shown: A > 00, T > 01, C > 10, G >11.
As we can see, the first bit of a binary image clearly determines which complementary pair it belongs to.
Let q be a DNA sequence. The sequence b(q) obtained by applying the mapping given above to q, is called the binary image of q.
Now, let b(q) = b_{0}b_{1}b_{2}...b_{2n − 1}.
Now, let the subsequence e(q) = b_{0}b_{2}...b_{2n − 2} be called the even subsequence of b(q), and o(q) = b_{1}b_{3}b_{5}...b_{2n − 1} be called the odd subsequence of b(q).
Thus, for example, for q = ACGTCC, then, b(q) = 001011011010.
e(q) will then be = 011011 and o(q) = 001100.
Let us define an even component as , and an odd component as .
From this choice of binary mapping, the GCcontent of DNA sequence q = Hamming weight of e(q).
Hence, a DNA code is a constant GCcontent codeword if and only if its even component is a constantweight code.
Let be a binary code consisting of M codewords of length n and minimum distance d_{min}, such that implies that .
For w > 0, consider the constantweight subcode , where w_{H}(.) denotes Hamming weight. Choose w > 0 such that , and consider a DNA code, , with the following choice for its even and odd components:
, <_{lex}b}.
Where < _{lex} denotes lexicographic ordering. The a < _{lex}b in the definition of ensures that if , then , so that distinct codewords in cannot be reversecomplements of each other.
The code has codewords of length 2n and constant weight n.
Furthermore, and ( this is because is a subset of the codewords in ).
Also, .
Note that b and d both have weight w. This implies that b^{RC} and d^{RC} have weight n − w.
And due to the weight constraint on w, we must have for all , .
Thus, the code has M(M − 1) / 2 codewords of length 2n.
From this, we see that (because of the fact that the component codewords of are taken from ).
Similarly, .
Therefore, the DNA code
with , has codewords of length 2n, and satisfies and .
From the examples listed above, one can wonder what could be the future potential of DNAbased computers?
Despite its enormous potential, this method is highly unlikely to be implemented in home computers or even computers at offices, etc. because of the sheer flexibility and speed as well as cost factors that favor silicon chip based devices used for the computers today.^{[2]}
However, such a method could be used in situations where the only available method is this and requires the accuracy associated with the DNA hybridization mechanism; applications which require operations to be performed with a high degree of reliability.
Currently, there are several software packages, such as the Vienna package,^{[7]} which can predict secondary structure formations in single stranded DNAs (i.e. oligonucleotides) or RNA sequences.
See also
References
 ^ Molecular computation of solutions to combinatorial problem. http://www.usc.edu/dept/molecularscience/papers/fpsci94.pdf.
 ^ ^{a} ^{b} M. Mansuripur, P.K. Khulbe, S.M. Kuebler, J.W. Perry, M.S. Giridhar and N.Peyghambarian Information storage and retrieval using macromolecules as storage media. http://www.opticsinfobase.org/view_article.cfm?gotourl=http%3A%2F%2Fwww.opticsinfobase.org%2FDirectPDFAccess%2F17ADF6D8BDB9137EC485E432FB3BB804_113204.pdf%3Fda%3D1%26id%3D113204%26seq%3D0&org=State%20University%20of%20New%20York%20at%20Buffalo%20Lockwood%20Library., University of Arizona Technical Report, 2003.
 ^ Olgica Milenkovic and Navin Kashyap, On the Design of codes for DNA computing. http://www.springerlink.com/content/d81776p5222nm638/., Coding and cryptography (international workshop, WCC 2005, Bergen, Norway, March 14–18, 2005)
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} Cooke, C (1999). "Polynomial construction of complex Hadamard matrices with cyclic core"]. Applied Mathematics Letters 12: 87–93. doi:10.1016/S08939659(98)001311. http://www.springerlink.com/content/d81776p5222nm638/.
 ^ J. Adamek (1991). Foundations of Coding. New York: John Wiley.
 ^ Zierler, N. (1959). "Linear recurring sequences". J. Soc. Indust. Appl. Math. 7: 31–48. doi:10.1137/0107003.
 ^ The Vienna RNA secondary structure package
External links
Categories:
Wikimedia Foundation. 2010.