Given the importance of the use of these scores both in FASTQ and MAQ (for MAQ (for me), specifically using alignment quality scores from Illumina sequencing runs to monitor run and sample quality), I was a bit surprised to not find some complete work-up of the meanings, the scores, the glyphs coordinated to the scores, and the encoding interpretations of these scores in one location. The two (three) tables shown here hopefully provide a meaningful summary.

I should qualify that much of the background for this page was taken from four key places. First is the wikipedia entry for FASTQ. Second is the wikipedia entry for Phred quality score. Third is the Rosetta Stone of Phred Score interpretation in the form of the open access article: P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer and P. M. Rice, "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants." Nucleic Acids Research, 2010, Vol. 38, No. 6, 1767-1771 doi:10.1093/nar/gkp1137. Fourth is seqanswers.com in various forms.

## (Sanger) Phred Quality Scores

I refer you to the two wikipedia articles on FASTQ and Phred Quality Scores for historical content (and for a brief discussion of the processing of chromatogram data for the production of quality scores). Table 1 shows the Q[Phred] (**Phred Q**) from P[Phred] values (**Probability (P) Of Wrong Base**), then adds the ASCII glyph codes (**Sanger "Q + 33" Shift**) and characters (**Sanger "Q + 33" ASCII GLYPH**) for the original Phred scores (Phred scores 0-to-93 use ASCII characters 33-to-126 in the Sanger method – this is performed to keep the single-character associated letters readable) and the Illumina 1.3+ codes (**Illumina 1.3+ "Q + 64" Shift**, using ASCII glyphs 64-to-126 to score from 0-to-62 on the "P" scale) and corresponding ASCII glyphs (**Illumina 1.3+ "Q + 64" ASCII GLYPH**). This is all likely completely self-explanatory (or hopefully will be by the bottom of the post). For review, the relationship between Phred quality score **Q[Sanger]** and the base-calling error probability **P** is

Q[Sanger]= âˆ’10 * log_{10}P

_{10}P

or, re-written for the logarithmically challenged…

P = 10^[-Q/10]

Table 1. Phred Quality Scores (Q), Wrong Base Probabilities, And Sanger And Illumina 1.3+ ASCII Glyphs. | |||||
---|---|---|---|---|---|

Q |
Of Wrong Base |
"Q + 33" Shift |
"Q + 33" ASCII GLYPH |
"Q + 64" Shift |
"Q + 64" ASCII GLYPH |

An assumption going in when I was producing plots from the Q[Sanger] and Q[Solexa] data was that the "P" was the same value and the Solexa system simply opted to use the Odds (P/(1-P)) as their metric. A proper two-second consideration of the shape of the form of P and P/(1-P) would have lead to the immediate conclusion that something was afoot. The table columns on the left of the black bar in Table 2 (2A) are the Q[Solexa] values based on the use of the Q[Sanger] probabilities. This is here simply to show that they are, in fact, not the same and if you've spent any time wondering why you can't adequately… manipulate Excel's rounding tools to reproduce the Q[Solexa] integer values, this is why.

The probabilities obtained for Q[Solexa] were, in fact, worked backwards from the integer values of Q[Solexa] (having found no table online that gives a number-by-number summary of the probability or odds). For background, the Q[Solexa] values are obtained from:

Q[Solexa] = âˆ’10 * log_{10}[(P/1-P)]

_{10}[(P/1-P)]

Table 2A: Q[Solexa] from P[Sanger] | Table 2B: Q[Solexa] and associated odds (P/(1-P)). | ||||||
---|---|---|---|---|---|---|---|

Probability(P) Of Wrong Base |
AssociatedSanger Odds [P/(1-P)] |
Q[Solexa]Based On Phred Probability |
Solexa Q[-5 to 62] |
SolexaProbability (P) Of Wrong Base |
SolexaOdds [P/(1-P)] |
Solexa"Q + 64" Q Shift |
Solexa"Q + 64" ASCII GLYPH |

With all three data sets, I reproduce a plot familiar to the FASTQ community below, showing the asymptotic behavior of the Q[Solexa] and Q[Sanger] values at high Q (which represent the lowest read errors. They approach one another because the numbers are simply too damn small on the plot). Also obvious from the plot is that the plots show poor agreement with each other in the range where the error probability is highest (so the entire analysis goes to pot as the data quality goes to pot [ed. Note for the international reader: "pot" refers to the device found in the water-closet). The grey line is a good plot of the wrong data (that in Table 2A).

The presentation of this data is likely complete overkill, but I have found it useful in discussion. Hopefully your having tables in front of someone during an explanation will help clarify that explanation.

## One Reply to “Sanger (And Illumina 1.3+ (And Solexa)) Phred Score (Q) ASCII Glyph Base Error Conversion Tables”