Sulston score

Sulston score

The Sulston Score is an equation used in DNA mapping to numerically assess the likelihood that a given "fingerprint" similarity between two DNA clones is merely a result of chance. Used as such, it is a test of statistical significance. That is, low values imply that similarity is "significant", suggesting that two DNA clones overlap one another and that the given similarity is not just a chance event. The name is an eponym that refers to John Sulston by virtue of his being the lead author of the paper that first proposed the equation's useJ. Sulston, F. Mallett, R. Staden, R. Durbin, T. Horsnell, and A. Coulson (1988) [http://www.ncbi.nlm.nih.gov/pubmed/2838135?ordinalpos=2&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum "Software for genome mapping by fingerprinting techniques"] , Computer Applications in the Biosciences 4(1), 125-132.] .

The Overlap Problem in Mapping

Each clone in a DNA mapping project has a "fingerprint", "i.e." a set of DNA fragment lengths inferred from (1) enzymatically digesting the clone, (2) separating these fragments on a gel, and (3) estimating their lengths based on gel location. For each pairwise clone comparison, one can establish how many lengths from each set match-up. Cases having at least 1 match indicate that the clones "might" overlap because matches "may" represent the same DNA. However, the underlying sequences for each match are not known. Consequently, two fragments whose lengths match may still represent different sequences. In other words, matches do not conclusively indicate overlaps. The problem is instead one of using matches to probabilistically classify overlap status.

Mathematical Scores in Overlap Assessment

Biologists have used a variety of means (often in combination) to discern clone overlaps in DNA mapping projects. While many are biological, "i.e." looking for shared markers, others are basically mathematical, usually adopting probabilistic and/or statistical approaches.

Sulston Score Exposition

The Sulston Score is rooted in the concepts of Bernoulli and Binomial processes, as follows. Consider two clones, alpha and eta, having m and n measured fragment lengths, respectively, where m ge n. That is, clone alpha has at least as many fragments as clone eta, but usually more. The Sulston score is the probability that at least h fragment lengths on clone eta will be matched by any combination of lengths on alpha. Intuitively, we see that, at most, there can be n matches. Thus, for a given comparison between two clones, one can measure the statistical significance of a match of h fragments, "i.e." how likely it is that this match occurred simply as a result of random chance. Very low values would indicate a significant match that is highly unlikely to have arisen by pure chance, while higher values would suggest that the given match could be just a coincidence.

One of the basic assumptions is that fragments are uniformly distributed on a gel, "i.e." a fragment has an equal likelihood of appearing anywhere on the gel. Since gel position is an indicator of fragment length, this assumption is equivalent to presuming that the fragment lengths are uniformly distributed. The measured location of any fragment x, has an associated error tolerance of pm t, so that its true location is only known to lie within the segment x pm t.

Derivation

In what follows, let us refer to individual fragment lengths simply as "lengths". Consider a specific length j on clone eta and a specific length i on clone alpha. These two lengths are arbitrarily selected from their respective sets i in {1, 2, dots, m} and j in {1, 2, dots, n}. We assume that the gel location of fragment j has been determined and we wantthe probability of the event E_{ij} that the location of fragment i will match that of j. Geometrically, i will be declared to match j if it falls inside the window of size 2 t around j. Since fragment i could occur anywhere in the gel of length G, we have P langle E_{ij} angle = 2 t / G. The probability that i "does not" match j is simply the complement, i.e. P langle E_{i,j}^C angle = 1 - 2 t / G, since it must either match or not match.

Now, let us expand this to compute the probability that no length on clone alpha matches the single particular length j on clone eta. This is simply the intersection of all individual trials i in {1, 2, dots, m} where the event E_{i,j}^C occurs, "i.e." P langle E_{1,j}^C cap E_{2,j}^C cap cdots cap E_{m,j}^C angle. This can be restated verbally as: length 1 on clone alpha does not match length j on clone eta "and" length 2 does not match length j "and" length 3 does not match, etc. Since each of these trials is assumed to be independent, the probability is simply

:P langle E_{1,j}^C angle imes P langle E_{2,j}^C angle imes cdots imes P langle E_{m,j}^C angle = left(1 - 2 t / G ight)^m.

Of course, the actual event of interest is the complement: "i.e." there is "not" "no matches". In other words, the probability of one or more matches is p = 1 - left(1 - 2 t / G ight)^m. Formally, p is the probability that at least one band on clone alpha matches band j on clone eta.

This event is taken as a Bernoulli trial having a "success" (matching) probability of p for band j. However, we want to describe the process over "all" the bands on clone eta. Since p is constant, the number of matches is distributed binomially. Given h observed matches, the Sulston score Sis simply the probability of obtaining "at least" h matches by chance according to

:S = sum_{j=h}^n C_{n,j} p^j (1-p)^{n-j},

where C_{n,j} are binomial coefficients.

Mathematical Refinement

In a 2005 paperM.C. Wendl (2005) [http://www.ncbi.nlm.nih.gov/pubmed/15857243?dopt=Abstract "Probabilistic assessment of clone overlaps in DNA fingerprint mapping via a priori models"] , Journal of Computational Biology 12(3), 283-297.] , Michael Wendl gave an example showing that the assumption of independent trials is not valid. So, although the traditional Sulston score does indeed represent a Probability distribution, it is not the distribution characteristic of the fingerprint problem. Wendl went on to give the general solution for this problem in terms of the Bell polynomials, showing the traditional score overpredicts P-values by orders of magnitude. (P-values are very small in this problem, so we are talking, for example, about probabilities on the order of 10e-14 versus 10e-12, the latter Sulston value being 2 orders of magnitude too high.) This solution provides a basis for determining when a problem has sufficient information content to be treated by the probabilistic approach and is also a general solution to the birthday problem of 2 types.

A disadvantage of the exact solution is that its evaluation is computationally intensive and, in fact, is not feasible for comparing large clones. Some fast approximations for this problem have been proposedM.C. Wendl (2007) [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=ShowDetailView&TermToSearch=17442113&ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_RVDocSum "Algebraic correction methods for computational assessment of clone overlaps in DNA fingerprint mapping"] , BMC Bioinformatics 8, article 127.] .

References

ee also

* [http://www.agcol.arizona.edu/software/fpc/ FPC] : a widely-used fingerprint mapping program that utilizes the Sulston Score


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • John E. Sulston — Infobox Scientist box width = name = John E. Sulston image size = caption = birth date = March 27 1942 birth place = death date = death place = residence = citizenship = nationality = Britain ethnicity = field = Biology workplaces = alma mater =… …   Wikipedia

  • Bioinformatics — For the journal, see Bioinformatics (journal). Map of the human X chromosome (from the NCBI website). Assembly of the human genome is one of the greatest achievements of bioinformatics. Bioinformatics …   Wikipedia

  • Gene mapping — Genome mapping is the creation of a genetic map assigning DNA fragments to chromosomes.When a genome is first investigated, this map is nonexistent. The map improves with the scientific progress and is perfect when the genomic DNA sequencing of… …   Wikipedia

  • DNA sequencing theory — is the broad body of work that attempts to lay analytical foundations for DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects (known as strategic genomics ), predicting project performance,… …   Wikipedia

  • University of Manchester — The University of Manchester Arms of the University of Manchester. The Manchester bee is the main symbol of Manchester. Motto Latin: Cognitio, sapientia, humanitas Motto in English Knowledge, Wisdom, Humanity …   Wikipedia

  • Nobel Prizes — ▪ 2009 Introduction Prize for Peace       The 2008 Nobel Prize for Peace was awarded to Martti Ahtisaari, former president (1994–2000) of Finland, for his work over more than 30 years in settling international disputes, many involving ethnic,… …   Universalium

  • Calendar of 2002 — ▪ 2003 January I will not wait on events while dangers gather. I will not stand by as peril draws closer and closer. The United States of America will not permit the world s most dangerous regimes to threaten us with the world s most destructive… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”