- Secondary structure prediction
Secondary structure prediction is a set of techniques in
bioinformatics that aim to predict the localsecondary structure s ofprotein s andRNA sequences based only on knowledge of theirprimary structure -amino acid ornucleotide sequence, respectively. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands (often noted as "extended" conformations), or turns. The success of a prediction is determined by comparing it to the results of theDSSP algorithm applied to the crystal structure of the protein; for nucleic acids, it may be determined from thehydrogen bond ing pattern. Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices andcoiled coil s in proteins, or canonicalmicroRNA structures in RNA.Mount DM (2004). Bioinformatics: Sequence and Genome Analysis, 2, Cold Spring Harbor Laboratory Press. ISBN 0879697121.]The best modern methods of secondary structure prediction in proteins reach about 80% accuracy; this high accuracy allows the use of the predictions in
fold recognition and "ab initio"protein structure prediction , classification ofstructural motif s, and refinement ofsequence alignment s. The accuracy of current protein secondary structure prediction methods is assessed in weeklybenchmark s such asLiveBench and EVA.The problems of predicting RNA secondary structure are broadly related but dependent mainly on
base pair ing andbase stacking interactions; many RNA molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of RNA molecules, such astransfer RNA ormicroRNA , is observed. Many RNA secondary structure prediction methods rely on variations ofdynamic programming and therefore are unable to efficiently identifypseudoknot s.Protein structure
Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, [cite journal | last = Guzzo | first = AV | year = 1965 | title = Influence of Amino-Acid Sequence on Protein Structure | journal = Biophysical Journal | volume = 5 | pages = 809–822
cite journal | last = Prothero | first = JW | year = 1966 | title = Correlation between Distribution of Amino Acids and Alpha Helices | journal = Biophysical Journal | volume = 6 | pages = 367–370
cite journal | last = Schiffer | first = M | coauthors = Edmundson AB | year = 1967 | title = Use of Helical Wheels to Represent Structures of Proteins and to Identify Segments with Helical Potential | journal = Biophysical Journal | volume = 7 | pages = 121–?
cite journal | last = Kotelchuck | first = D | coauthors = Scheraga HA | year = 1969 | title = The Influence of Short-Range Interactions on Protein Conformation, II. A Model for Predicting the α-Helical Regions of Proteins | journal = Proceedings of the National Academy of Science USA | volume = 62 | pages = 14–21 | doi = 10.1073/pnas.62.1.14 | pmid = 5253650
cite journal | last = Lewis | first = PN | coauthors = Gō N, Gō M, Kotelchuck D, Scheraga HA | year = 1970 | title = Helix Probability Profiles of Denatured Proteins and Their Correlation with Native Structures | journal = Proceedings of the National Academy of Science USA | volume = 65 | pages = 810–815 | doi = 10.1073/pnas.65.4.810 | pmid = 5266152] focused on identifying likely alpha helices and were based mainly onhelix-coil transition model s.Froimowitz M, Fasman GD. (1974). Prediction of the secondary structure of proteins using the helix-coil transition theory. "Macromolecules" 7(5):583-9.] Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60-65% accurate, and often underpredict beta sheets. Theevolution ary conservation of secondary structures can be exploited by simultaneously assessing many homologous sequences in amultiple sequence alignment , by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modernmachine learning methods such as neural nets andsupport vector machine s, these methods can achieve up 80% overall accuracy inglobular protein s.Dor O, Zhou Y. (2006). Achieving 80% tenfold cross-validated accuracy for secondary structure prediction by large-scale training. "Proteins" Epub. PMID 17177203 ] The theoretical upper limit of accuracy is around 90%, partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Limitations are also imposed by secondary structure prediction's inability to account fortertiary structure ; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.Chou-Fasman method
The
Chou-Fasman method was among the first secondary structure prediction algorithms developed and relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure.Chou PY, Fasman GD. (1974). Prediction of protein conformation. Biochemistry. 13(2):222-45.] The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughtly 50-60% accurate in predicting secondary structures.GOR method
The
GOR method , named for the three scientists who developed it - "G"arnier, "O"sguthorpe, and "R"obson - is aninformation theory -based method developed not long after Chou-Fasman that uses more powerful probabilistic techniques ofBayesian inference .Garnier J, Osguthorpe DJ, Robson B. (1978). Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 120:97-120.] The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also theconditional probability of the amino acid assuming each structure given that its neighbors assume the same structure. This method is both more sensitive and more accurate because amino acid structural propensities are only strong for a small number of amino acids such asproline andglycine . The original GOR method is roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicts as loops or disorganized regions.Machine learning
Neural network methods use training sets of solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of
hydrogen bonding patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet.Support vector machine s have proven particularly useful for predicting the locations of turns, which are difficult to identify with statistical methods.Pham TH, Satou K, Ho TB. (2005). Support vector machines for prediction and analysis of beta and gamma-turns in proteins. "J Bioinform Comput Biol" 3(2):343-58. PMID 15852509 ] The requirement of relatively small training sets has also been cited as an advantage to avoid overfitting to existing structural data.Zhang Q, Yoon S, Welsh WJ. (2005). Improved method for predicting beta-turn using support vector machine. "Bioinformatics" 21(10):2370-4. PMID 15797917 ]Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as backbone
dihedral angle s in unassigned regions. Both SVMs Zimmermann O, Hansmann UH. (2006). Support vector machines for prediction of dihedral angle regions. "Bioinformatics" 22(24):3009-15. PMID 17005536] and neural networksKuang R, Leslie CS, Yang AS. (2004). Protein backbone angle prediction with machine learning approaches. "Bioinformatics" 20(10):1612-21. PMID 14988121] have been applied to this problem.RNA structure
Dynamic programming algorithm s are commonly used to detectbase pair ing patterns that are "well-nested", that is, formhydrogen bond s only to bases that do not overlap one another in sequence position. Secondary structures that fall into this category include double helices,stem-loop s, and variants of the "cloverleaf" pattern found intransfer RNA molecules. These methods rely on precalculated parameters estimating the free energy associated with particular types of base-pairing interactions, including Watson-Crick andHoogsteen base pair s. Depending on the complexity of the method, single base pairs may be considered, or short two- or three-base segments to incorporate the effects of base stacking. This method cannot identifypseudoknot s, which are not well nested, without substantial algorithmic modifications that are extremely computationally expensive.Rivas E, Eddy S. (1999). A dynamic programming algorithm for RNA structure prediction including pseudoknots, J Mol Biol, 285(5): 2053-2068.]Sequence covariation methods rely on the existence of a data set composed of multiple homologous RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in
evolution ; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to beNP-complete .Lyngsø RB, Pedersen CN. (2000). RNA pseudoknot prediction in energy-based models. J Comput Biol 7(3-4): 409-427.]References
ee also
*
protein structure prediction software
* Molecular modeling softwareExternal links
* [http://www.predictprotein.org/ PredictProtein]
* [http://bioweb.pasteur.fr/seqanal/interfaces/mfold-simple.html Mfold] RNA structure prediction
* [http://scratch.proteomics.ics.uci.edu/ SCRATCH] Protein structure prediction suite that includes SSpro
Wikimedia Foundation. 2010.