Anchored Assembly: Results from Validation Testing

 Anchored Assembly: Results from Validation Testing Validation of Results Obtained with the Anchored Assembly Method July 2014 © Spiral Genetics, Inc. 2014
Table of Contents Introduction ........................................................................................................................... 3 The Anchored Assembly Method ........................................................................................... 4 Step 1: A* (A-­‐Star) Error Correction ................................................................................... 4 Step 2: Read Overlap Assembly .......................................................................................... 4 Step 3: Anchoring .............................................................................................................. 4 Step 4: Variant Validation .................................................................................................. 4 Data Sets Used to Generate Results ....................................................................................... 5 Results ................................................................................................................................... 5 Variants Simulated on Human Chromosome 22 ................................................................. 5 Sensitivity Benchmark: Pindel vs. Anchored Assembly ....................................................... 6 False Positive Benchmark: Pindel vs. Anchored Assembly ................................................. 7 Concordance with Genome in a Bottle ............................................................................... 8 SNP Concordance with GIAB ................................................................................................ 8 Indel Concordance with GIAB ............................................................................................... 9 Concordance with the Gold Standard SV set from Mills et al. 2011 (Suppl. Mat) ................ 9 Concordance with Deletions in 1000 Genomes NA12878 ................................................. 11 Conclusion ........................................................................................................................... 12 Further Information ............................................................................................................. 12 Introduction With short read sequencing data, SNPs are relatively easy to detect using alignment if variation at a unique location is sufficiently low. However, for areas where there are numerous SNPs and indels, or when detecting structural variants, methods of detection using alignment have been susceptible to a level of false positives and false negatives. The Anchored Assembly method, using whole read de novo assembly, has the ability to accurately detect variants across a range of sizes. Here, we validate against NA12878, a well-­‐
characterized human genome. This genome has a range of truth data sets based on Sanger, PCR, and ArrayCGH validated structural variants. In most cases, we compare the results of Anchored Assembly to a popular variant caller, Pindel (with BWA). When comparing the SNPs that were detected, we compare against the aligner BWA followed by the variant caller GATK. Broadly, BWA and GATK are expected to have a similar detection profile to other SNP callers like samtools. Anchored Assembly Validation Results Page 3 of 12 The Anchored Assembly Method Spiral’s Anchored Assembly method uses direct de novo read overlap assembly to more accurately detect and characterize SNPs, indels, and structural variants. By using unmapped reads for de novo assembly, we accurately detect variants across a whole human genome with a low false positive rate. The tool is suitable for use with Illumina X, HiSeq, or MiSeq reads with at least 20x coverage per chromosome set (i.e. 40x coverage for diploid). Step 1: A* (A-­‐Star) Error Correction We k-­‐merize the reads, then compute a score for each k-­‐mer based on frequency and read quality. Low count k-­‐mers are discarded as errors. We construct a de Bruijn graph from the remaining k-­‐mers and correct read errors using an A* search algorithm. Step 2: Read Overlap Assembly Corrected reads are assembled into a read overlap graph to determine their relative layout. Step 3: Anchoring Reads are aligned to the reference. Reads that match the reference exactly are discarded. Reads with near-­‐perfect overlap are labeled as “anchors” denoting the beginning and end of each variation. Starting from each read, variations are discovered by walking the read overlap graph in both directions until an anchor is found on each side. Anchored Assembly identifies SNPs, indels, and structural variations against a reference sequence spanning the entire region between two anchors. Step 4: Variant Validation The coverage depth is computed for each variation. Variations that do not meet a minimum depth are discarded. Where one side of a variant can be anchored to multiple locations in the genome, the number of alternative locations is reported. Anchored Assembly Validation Results Page 4 of 12 Data Sets Used to Generate Results Briefly, we ran Anchored Assembly on the whole human genome data set for NA12878 (ERP001229), which is a 50x coverage 100 bp data set generated on an Illumina HiSeq. We also ran Anchored Assembly on the PCR-­‐free 200x NA12878 data set that is a part of the Illumina Platinum Genomes Project (ERP001775). To compare, we ran BWA 0.7.5, GATK 2.7.3 and the Unified Genotyper on the same ERP001229 data set. For Pindel, we used the results listed on the 1000 Genomes Project FTP site1. This was also run on a 50x data set. Indels are defined as insertions and deletions up to and including 50 base pairs. Structural variants are all insertions, deletions, tandem repeats, and inversions over 50 base pairs. Results Variants Simulated on Human Chromosome 22 Variants that are simulated are rarely as complex as real variants. However, simulated variants provide a baseline for sensitivity and false discovery rates across different types and sizes of variants. To test our performance on simulated human read data, we populated the human chromosome 22 sequence (GRCh37) with sets of randomly placed artificial variants. These were sets of insertions, deletions, inversions, and tandem duplications from 10 bp to 100 kbp in size, and sets of homozygous SNPs. Locations were randomly chosen with a minimum of 400 base pairs between variations. Counts and sizes were chosen to keep the overall variation rate to ~2%. We simulated 50x coverage paired-­‐end read data from the modified chromosome sequence using ART Illumina. Broadly, Anchored Assembly had a greater sensitivity to longer insertions. Although Pindel was able to detect insertions up to 50 bp, Anchored Assembly was able to detect insertions up to 10,000 bp. Additionally, the False Discovery Rate (FDR) for Anchored Assembly was dramatically lower. Detailed charts of comparison are provided in the two sections that follow. 1
CEU.wgs.pindel.20130710.indels_sv.high_coverage.genotypes.vcf.gz
found on the 1000 Genomes ftp site:
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20130610_ceu_hc_trio/Pindel/
Anchored Assembly Validation Results Page 5 of 12 Sensitivity Benchmark: Pindel vs. Anchored Assembly BWA+Pindel Sensitivity
Insertions
Deletions
Inversions
Tandem repeats
100
% detected
75
50
25
0
10
20
30
40
50
100
1000
10000
50000
Variant size in base pairs
Anchored Assembly Sensitivity
Insertions
Deletions
Inversions
Tandem repeats
100
% detected
75
50
25
0
10
20
30
40
50
100
1000
10000 50000 100000
Variant size in base pairs
Anchored Assembly Validation Results Page 6 of 12 False Positive Benchmark: Pindel vs. Anchored Assembly BWA+Pindel False Discovery Rate
Insertions
Deletions
Inversions
Tandem repeats
False discovery rate (%)
100
75
50
25
0
10
20
30
40
50
100
1000
10000
50000
Variant size in base pairs
BWA+Pindel
False
Discovery
Rate
Anchored
Assembly
False
Discovery
Rate
Insertions
Insertions
Deletions
Deletions
Inversions
Inversions
Tandem
Repeats
Tandem
repeats
% detected
rate (%)
False discovery
100
100
75
75
50
50
25
25
0
0
10
10
20
20
30
30
40
40
50
100
50
1000
100
10000 50000 100000
1000
10000 50000
Variantsize
sizeininbase
basepairs
pairs
Variant
Anchored Assembly Validation Results Page 7 of 12 Concordance with Genome in a Bottle The Genome in a Bottle (GIAB) Consortium, part of the National Institute of Standards and Technology, has compiled a set of SNP and indel variants that can be used for validation. To reduce spurious false positives and false negatives, the Consortium recommends that SNP and indel calls be validated within a specific region where there is high confidence in the calls. SNP Concordance with GIAB Broadly, BWA+GATK and Anchored Assembly are comparable at detecting SNPs. While BWA+GATK has a higher sensitivity, Anchored Assembly has a lower false positive rate, as shown in the following diagrams. Anchored Assembly Validation Results Page 8 of 12 Indel Concordance with GIAB For the indels in these high confidence regions, it is clear that GATK 2.7.3 has a low sensitivity. BWA+Pindel has a higher sensitivity. However, almost a third of all the calls are false positives. Anchored Assembly has a comparable sensitivity but with a much lower false positive rate. Concordance with the Gold Standard SV set from Mills et al. 2011 (Suppl. Mat) Mills et al.2 has catalogued a set of structural variants in the NA12878 sample. These were all validated with a range of methods including ArrayCGH and Sanger sequencing. In our concordance, we indicate a match if the start position of the structural variant is within 100 bp 2
Mills et al. Mapping copy number variation by population-­‐scale genome sequencing. Nature. 2011 Feb 3;470(7332):59-­‐65. doi: 10.1038/nature09708. PubMed PMID: 21293372; PubMed Central PMCID: PMC3077050.
Anchored Assembly Validation Results Page 9 of 12 of the position indicated in Mills et al. Of the 30 insertions confirmed (Table 1 below), Anchored Assembly with 200x coverage data detected the most variants (83%), followed by Anchored Assembly on 50x coverage data (66%), and finally Pindel (50%). Broadly, for each of these three, half of the variants were detected within 10 base pairs of the location indicated in Mills et al. Table 1. Concordance with the Gold Standard SV Set from Mills Chr. Mills POS 1 247579917 2 2576951 2 78558069 2 187143096 2 191002548 3 43972635 3 100737223 3 100868475 3 195823764 5 78035993 7 1528948 7 2089876 8 22717662 9 97387403 9 137361862 12 103954170 13 76345722 13 113760939 13 114103496 15 26060663 15 92686723 17 39240782 17 77134774 18 74794821 18 76182038 19 1278240 19 2247173 20 55992535 21 39080014 X 94894756 Pindel at 50x Anchored at 50x Anchored at 200x !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
! ! (dark) = within 10 bp ! ! (light) = between 10 bp and 100bp Anchored Assembly Validation Results Page 10 of 12 Concordance with Deletions in 1000 Genomes NA12878 The 1000 Genomes Project contains information on validated structural variants detected (Mills et al.3). Here we investigate the detection rate between BWA+Pindel and Anchored Assembly on two subsets. The first are deletions validated using targeted integrative graph routing local assembly (TIGRA). Here, we considered that a variant had been detected if there was a 90% reciprocal overlap between the deletion in the 1000 Genomes set and the variant called. Similar to insertions, Anchored Assembly detected a greater number of deletions than BWA+Pindel. Secondly, we compared against deletions validated using array coverage (Mills et al.). In the original analysis, the deletion was considered valid if the deletion and the largest gap in the array coverage had a 50% reciprocal overlap and the sum of the discrepancies was less than 5kb. Some 51 of these array validated SVs were PCR validated. Similarly, we considered an SV detected if there was at least a 50% reciprocal overlap with the called deletion and the sum of the discrepancies was less than 5kb. Overall there is a low detection rate, likely due to the type and size of the variants in this set. Nevertheless, Anchored Assembly was able to detect almost three times as many variants as BWA+Pindel. 3
Mills et al. Mapping copy number variation by population-­‐scale genome sequencing. Nature. 2011 Feb 3;470(7332):59-­‐65. doi: 10.1038/nature09708. PubMed PMID: 21293372; PubMed Central PMCID: PMC3077050.
Anchored Assembly Validation Results Page 11 of 12 Conclusion Here, we present the results from comparing Anchored Assembly against two variant callers, GATK and Pindel. These two were selected because they are commonly used and often form the basis of other, more complex pipelines. The results suggest that whole read de novo assembly can improve sensitivity and lower the false positive rate. Consequently, Anchored Assembly is an alternative to traditional pipelines that captures greater variation across type and size using whole genome Illumina short reads. Theoretically this method will allow for the greatest number of variants to be accurately detected. Further improvements should lead us closer to the upper bound of what can be detected using short read technology. Further Information Upon request, we can supply the VCF files generated by Anchored Assembly used in these comparisons. Please do contact us if you would like to trial Anchored Assembly. If there are further comparisons and analyses you would like to see, please let us know. Anchored Assembly Validation Results Page 12 of 12