- N50 statistic
-
In Computational Biology, the N50 statistic is a measure of the average length of a set of sequences, with greater weight given to longer sequences. It is used widely in genome assembly, especially in reference to contig lengths within a draft assembly. Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.
N50 may also be defined as the contig length such that using equal or longer contigs produces half the bases of the genome. The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome. For example, for a genome of 600Mb, if the assembled sequences add up to 500Mb, the N50 would the calculated by sorting the contigs from largest to smallest and finding the length of the contig where the cumulative size is 250Mb. Thus, N50 is calculated in the context of the assembly size rather than the genome size. The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size.
N50 can be found mathematically as follows: Take a list L of positive integers. Create another list L' , which is identical to L, except that every element n in L has been replaced with n copies of itself. Then the median of L' is the N50 of L. For example: If L = {2, 2, 2, 3, 3, 4, 8, 8}, then L' consists of six 2's, six 3's, four 4's, and sixteen 8's (e.g. We replaced every 2 in L with 2, 2, so in L' there are six 2s in L') ; the N50 of L is the median of L' , which is the average of the 16th element 4 and 17th element 8, so it is (4+8)/2 = 6.
Contradictory definitions
There has been identified some contradictions in the definition(s) of the N50 value, as discussed in a thread on the SEQ Answers forum.
References
- Broad institute wiki "http://www.broad.harvard.edu/crd/wiki/index.php/N50"
- "Assembly algorithms for next-generation sequencing data", Miller JR, Koren S, Sutton G ("http://www.ncbi.nlm.nih.gov/pubmed/20211242")
Categories:
Wikimedia Foundation. 2010.