DANIEL H. WAGNER ASSOCIATES, INC

A Leader in Applying Mathematics and Computer Science to Industry

 

Operations Research - Mathematics - Software Development

Home

About Us

Technology

Projects

Products

Careers

Contact Us

Search

You are at: Wagner Home > Technologies > Biotech > Consensus Confidence

Confidence Scoring of Consensus DNA Base Calls

Description

Daniel H. Wagner Associates has developed algorithms to judge the accuracy of DNA consensus base calls. The algorithms assign a confidence score to each consensus call. The score corresponds to the probability that the consensus call is incorrect. In fact, the algorithms produce three scores corresponding to substitution, insertion, and deletion error probabilities.

The consensus confidence scores are based on the set of primary sequences that are assembled to create the consensus call. The algorithms consider the number, consistency, and directions of the primary sequences, as well as the quality of the trace data associated to each, as estimated by the ATQA system. The consensus confidence scoring algorithms employ sophisticated pattern recognition methods as used in ATQA and in our work on mixture detection.

Training Data

To develop a consensus confidence scoring algorithm, we require a large training data set of consensus calls that are accurately identified as correct calls or errors. We generate such training data by subsampling from large sequence data sets.

We begin with a set of primary sequences and the consensus sequence that is derived from them. At each consensus call, we select random subsets of the assembled primary basecalls. For each such random subset, we form a subsampled consensus call as the majority call in the subset. In the (common) case that the subsampled consensus call agrees with the original consensus call, we label the subsampled call as correct. In the (rare) case that the subsampled consensus call disagrees with the original consensus call, we label the subsampled call as incorrect.

The following table illustrates the subsampling process. The first block shows a fragment of the original sequence data with 8-fold coverage. The next two blocks show subsamples of depth 3. The first subsample yields consensus calls that agree with the original. These subsampled consensus calls would be labelled as correct. The second subsample yields two incorrect consensus calls, a deletion and a substitution error, respectively.

Original Sequence Data First Subsample Second Subsample
1     A C C T G A C T 
2     A - C T T A C T 
3     A C C T G A C T 
4     A - C T G A C T 
5     A C C C T A C T 
6     A C C N G A C T 
7     A C C T G A C C 
8     A C C C G A C N 

Con   A C C T G A C T
1     A C C T G A C T
2     A - C T T A C T
3     A C C T G A C T

Con   A C C T G A C T

Here, the subsampled consensus
calls agree with the original
consensus.

2     A - C T T A C T
4     A - C T G A C T
5     A C C C T A C T

Con   A - C T T A C T

Here, the 2nd and 5th subsampled
consensus calls disagree with the
original consensus.

The results described below are based on a training data set of ~11 million subsampled consensus calls of which 4400 were incorrect. This training data was generated from 1100 primary sequences.

Methodology

We developed a set of numerical features that were designed to distinguish correct consensus calls from incorrect consensus calls. Most of these features were based on our ATQA primary basecall confidence scores. For example, one feature was the maximum ATQA score of the primary basecalls that agreed with the consensus call.

We used statistical classification algorithms to partition the feature space into subsets in which the fraction of incorrect consensus calls (from the training data) was relatively constant. We assigned a confidence score to each of the subsets corresponding to this observed error rate by:

Score = -10 * log10(Prob. of error)

Results

Of the 11 million consensus calls in our training data set, about two-thirds were used in building the scoring model. The remaining one-third were used to test the model performance. The following table presents the consensus confidence scores predicted by the model, the corresponding predicted consensus error rate for that score, and the observed error rate in the test data among calls with that score. The results show very good agreement between predictions and observations.

Confidence Score Predicted Error Rate Observed Error Rate
0 1.0 0.099
28 0.0016 0.0011
32 0.00063 0.00050
38 0.00016 0.00022
48 0.000016 0.000017
63 0.00000050 0.00000027

Contact Us

We are actively seeking clients, commercialization partners, and collaborators for our work on consensus basecall confidence scoring.

Please contact atqa@pa.wagner.com for further information.  Go here for other contact options.


 

Home | Contact Us | Site Index | Career Opportunities

Technology | Projects | Products | Locations | Legal Notices | Search

© 2005 Daniel H. Wagner Associates, Inc.  - All rights reserved.