- Pseudo amino acid composition
**Pse**udo**a**mino**a**cid composition, or PseAA composition, was originally introduced by [*http://gordonlifescience.org/members/kcchou/index.html Professor Kuo-Chen Chou*] [1] in 2001 to represent protein samples for statistical prediction. In contrast with the conventionalamino acid (AA) composition that contains 20 components with each reflecting the occurrence frequency for one of the 20 nativeamino acids in aprotein , the PseAA composition contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional AA composition while the additional factors incorporate some sequence-order information via various modes. Typically, these additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAA composition is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model. Ever since the concept of PseAA composition was introduced, it has been widely utilized to predict various protein attributes, such as proteinsubcellular localization ,membrane protein type, enzyme functional class,GPCR type,protease type, protein structural class, and protein secondary structural content, among many others (see the references cited in [2] and [3] ). Meanwhile, various different modes to formulate the PseAA composition have also been developed [2] .**Background**In the history of developing methods for predicting

subcellular localization of proteins and their other attributes, two kinds of models were generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model. The most typical sequential representation for a protein sample is its entire amino acid sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. However, this kind of approach failed to work when a query protein did not have significant homology to the attribute-known proteins. Thus, various discrete models were proposed.The simplest discrete model is using the AA (

amino acid ) composition to represent protein samples, as formulated as follows. Given a protein sequence**P**with $L$ amino acid resides, i.e.,

$mathbf\{P\}=\; mbox\{R\}\_1\; mbox\{R\}\_2\; mbox\{R\}\_3\; mbox\{R\}\_4\; mbox\{R\}\_5\; mbox\{R\}\_6\; mbox\{R\}\_7\; cdots\; mbox\{R\}\_L\; qquad\; mbox\{(1)\}$where R

_{1}represents the 1st residue of the protein**P**, R_{2}the 2nd residue, and so forth, according to the AA composition model, the protein**P**of Eq.1 can be expressed by

$mathbf\{P\}=\{egin\{bmatrix\}\; f\_1\; f\_2\; cdots\; f\_\{20\}\; end\{bmatrix^\{mathbf\{T\; qquad\; mbox\{(2)\}$

where $,\; f\_u\; ,\; (u=1,\; 2,\; cdots,\; 20)$ are the normalized occurrence frequencies of the 20 native amino acids in**P**, and**T**the transposing operator. Owing to its simplicity, the AA composition model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information would be lost by using the AA composition to represent a protein. This is its main shortcoming. To avoid completely losing the sequence-order information, the concept of PseAA (**pse**udo**a**mino**a**cid) composition was proposed by Professor Kuo-Chen Chou [1] . According to the PseAA composition model, the protein**P**of Eq.1 can be formulated as

$mathbf\{P\}=\{egin\{bmatrix\}\; p\_1,\; ,\; p\_2,\; ,\; cdots,,\; p\_\{20\},\; ,\; p\_\{20+1\},\; ,\; cdots,\; ,\; p\_\{20+lambda\}\; end\{bmatrix^\{mathbf\{T,\; ,,,\; (lambda\; L\; )\; qquad\; mbox\{(3)\}$

where $20+lambda$ the components are given by

$p\_u\; =\; egin\{cases\}dfrac\; \{f\_u\}\; \{sum\_\{i=1\}^\{20\}f\_i\; ,\; +\; ,\; wsum\_\{k=1\}^\{lambda\}\; au\_k\},\; (1\; le\; u\; le\; 20)\backslash \; \backslash dfrac\; \{w\; au\_\{u-20\; \{sum\_\{i=1\}^\{20\}\; f\_i\; ,\; +\; ,\; wsum\_\{k=1\}^\{lambda\}\; au\_k\},\; (20+1\; le\; u\; le\; 20+lambda)end\{cases\}\; qquad\; mbox\{(4)\}$where $w$ is the weight factor, and $au\_k$ the $k$-th tier correlation factor that reflects the sequence order correlation between all the $k$-th most contiguous residues (Fig.1) as formulated by

$au\_k\; =\; frac\; \{1\}\{L-k\}\; sum\_\{i=1\}^\{L-k\}\; ,\; mbox\{J\}\_\{i,\; i+k\},\; ,,,\; (k\; <\; L)\; qquad\; mbox\{(5)\}$

with

$mbox\{J\}\_\{i,\; i+k\}\; =\; frac\{1\}\{Gamma\}\; sum\_\{g=1\}^\{Gamma\}\; left\; [Phi\_\{xi\}left(mbox\{R\}\_\{i+k\}\; ight)\; -\; Phi\_\{xi\}left(mbox\{R\}\_\{i\}\; ight\; )\; ight]\; ^2\; qquad\; mbox\{(6)\}$

where $Phi\_\{xi\}left(mbox\{R\}\_\{i\}\; ight)$ is the $xi$-th function of the amino acid $mbox\{R\}\_i\; ,$, and $Gamma,$ the total number of the functions considered. For example, in the original paper by [

*http://gordonlifescience.org/members/kcchou/index.html Professor Kuo-Chen Chou*] [1] , $Phi\_\{1\}left(mbox\{R\}\_\{i\}\; ight)$, $Psi\_\{2\}left(mbox\{R\}\_\{i\}\; ight)$ and $Psi\_\{3\}left(mbox\{R\}\_\{i\}\; ight)$ are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid $mbox\{R\}\_i\; ,$; while $Phi\_\{1\}left(mbox\{R\}\_\{i+1\}\; ight)$, $Phi\_\{2\}left(mbox\{R\}\_\{i+1\}\; ight)$ and $Phi\_\{3\}left(mbox\{R\}\_\{i+1\}\; ight)$ the corresponding values for the amino acid $mbox\{R\}\_\{i+1\}\; ,$. Therefore, the total number of functions considered there is $Gamma\; =3\; ,$. It can be seen from Eq.3 that the first 20 components, i.e. $p\_1,\; ,\; p\_2,\; ,\; cdots,,\; p\_\{20\}$ are associated with the conventional AA composition of protein , while the remaining components $p\_\{20+1\},\; ,\; cdots,\; ,\; p\_\{20+lambda\}$ are the correlation factors that reflect the 1st tier, 2nd tier, …, and the $lambda\; ,$-th tier sequence order correlation patterns (Fig.1). It is through these additional $lambda\; ,$ factors that some important sequence-order effects are incorporated.**Web server**Note that $lambda\; ,$ in Eq.3 is a parameter of integer and that choosing a different integer for $lambda\; ,$ will lead to a dimension-different PseAA composition [2] . Also note that using Eq.6 is just one of the modes for deriving the correlation factors or PseAA components. The others, such as the physicochemical distance mode [4] and amphiphilic pattern mode [5] , can also be used to derive different types of PseAA composition. In 2008 a free server called PseAAC [6] is provided at the website http://chou.med.harvard.edu/bioinf/PseAAC/. By using the web server, users can generate the PseAA composition for any given

protein sequence by selecting the mode as desired.Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (3) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R

_{1}represent the amino acid residue at the sequence position 1, R_{2}at position 2, and so forth, and the coupling factors J_{i,j}are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues. Adapted from [1] with permission.**References**[1] Kuo-Chen Chou, [

*http://gordonlifescience.org/members/kcchou/paper/Proteins_43.pdf Prediction of protein cellular attributes using pseudo amino acid composition*] , PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, Vol.44, 60) 43 (2001) 246-255.

[2] Kuo-Chen Chou, Hong-Bin Shen, [*http://gordonlifescience.org/members/kcchou/paper/AB_review_2007.pdf Review: Recent progresses in protein subcellular location prediction*] , Analytical Biochemistry 370 (2007) 1-16.

[3] Kuo-Chen Chou, Hong-Bin Shen, Cell-PLoc: [*http://chou.med.harvard.edu/bioinf/Cell-PLoc/ A package of web-servers for predicting subcellular localization of proteins in various organisms*] , [*http://gordonlifescience.org/members/kcchou/paper/Nature-Protocols_2008.pdf Nature Protocols 3 (2008) 153-162*] .

[4] Kuo-Chen Chou, [*http://gordonlifescience.org/members/kcchou/paper/BBRC_quasi.pdf Prediction of protein subcellular locations by incorporating quasi-sequence-order effect*] , Biochemical & Biophysical Research Communications 278 (2000) 477-483.

[5] Kuo-Chen Chou, [*http://gordonlifescience.org/members/kcchou/paper/Bioinf_21.pdf Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes*] , Bioinformatics 21 (2005) 10-19.

[6] Hong-Bin Shen, Kuo-Chen Chou, PseAAC: [*http://gordonlifescience.org/members/kcchou/paper/AB_2008_PseAAC.pdf a flexible web-server for generating various kinds of protein pseudo amino acid composition*] , Analytical Biochemistry 373 (2008) 386-388.

*Wikimedia Foundation.
2010.*