|
You are at: Wagner
Home > Technologies > Biotech
> Consensus Confidence
Confidence Scoring of Consensus
DNA Base Calls
Description
Daniel H. Wagner Associates has developed algorithms to judge
the accuracy of DNA consensus base calls. The algorithms assign
a confidence score to each consensus call. The score corresponds
to the probability that the consensus call is incorrect. In fact,
the algorithms produce three scores corresponding to substitution,
insertion, and deletion error probabilities.
The consensus confidence scores are based on the set of primary
sequences that are assembled to create the consensus call. The
algorithms consider the number, consistency, and directions of
the primary sequences, as well as the quality of the trace data
associated to each, as estimated by the ATQA
system. The consensus confidence scoring algorithms employ sophisticated
pattern recognition methods as used in ATQA and in our work on
mixture detection.
Training
Data
To develop a consensus confidence scoring algorithm, we require
a large training data set of consensus calls that are accurately
identified as correct calls or errors. We generate such training
data by subsampling from large sequence data sets.
We begin with a set of primary sequences and the consensus
sequence that is derived from them. At each consensus call, we
select random subsets of the assembled primary basecalls. For
each such random subset, we form a subsampled consensus call as
the majority call in the subset. In the (common) case that the
subsampled consensus call agrees with the original consensus call,
we label the subsampled call as correct. In the (rare)
case that the subsampled consensus call disagrees with the original
consensus call, we label the subsampled call as incorrect.
The following table illustrates the subsampling process. The
first block shows a fragment of the original sequence data with
8-fold coverage. The next two blocks show subsamples of depth
3. The first subsample yields consensus calls that agree with
the original. These subsampled consensus calls would be labelled
as correct. The second subsample yields two incorrect consensus
calls, a deletion and a substitution error, respectively.
| Original Sequence Data |
First Subsample |
Second Subsample |
1 A C C T G A C T
2 A - C T T A C T
3 A C C T G A C T
4 A - C T G A C T
5 A C C C T A C T
6 A C C N G A C T
7 A C C T G A C C
8 A C C C G A C N
Con A C C T G A C T
|
1 A C C T G A C T
2 A - C T T A C T
3 A C C T G A C T
Con A C C T G A C T
Here, the subsampled consensus
calls agree with the original
consensus.
|
2 A - C T T A C T
4 A - C T G A C T
5 A C C C T A C T
Con A - C T T A C T
Here, the 2nd and 5th subsampled
consensus calls disagree with the
original consensus.
|
The results described below are based on a training data set
of ~11 million subsampled consensus calls of which 4400 were incorrect.
This training data was generated from 1100 primary sequences.
Methodology
We developed a set of numerical features that were designed
to distinguish correct consensus calls from incorrect consensus
calls. Most of these features were based on our ATQA
primary basecall confidence scores. For example, one feature was
the maximum ATQA score of the primary basecalls that agreed with
the consensus call.
We used statistical classification algorithms to partition
the feature space into subsets in which the fraction of incorrect
consensus calls (from the training data) was relatively constant.
We assigned a confidence score to each of the subsets corresponding
to this observed error rate by:
| Score = -10 * log10(Prob. of error) |
Results
Of the 11 million consensus calls in our training data set,
about two-thirds were used in building the scoring model. The
remaining one-third were used to test the model performance. The
following table presents the consensus confidence scores predicted
by the model, the corresponding predicted consensus error rate
for that score, and the observed error rate in the test data among
calls with that score. The results show very good agreement between
predictions and observations.
| Confidence Score |
Predicted Error Rate |
Observed Error Rate |
| 0 |
1.0 |
0.099 |
| 28 |
0.0016 |
0.0011 |
| 32 |
0.00063 |
0.00050 |
| 38 |
0.00016 |
0.00022 |
| 48 |
0.000016 |
0.000017 |
| 63 |
0.00000050 |
0.00000027 |
Contact
Us
We are actively seeking clients, commercialization partners,
and collaborators for our work on consensus basecall confidence
scoring.
Please contact atqa@pa.wagner.com
for further information. Go here
for other contact options.
|