Dynamic mappers of NGS reads Karel Břinda ( L I G M Un i ve rsité Pa r is- Est ) Valentina Boeva ( I n st i t ut Cu r i e) Gregory Kucherov ( L I G M U n i ve rsité Pa r is - Est ) Introduction Read mapping is a bottleneck in NGS data processing (e.g., for variant calling) A lot of effort constantly invested into the development of new mappers None of them supports dynamic updates of the reference during the mapping Idea: update reference during the mapping Only few papers on this topic exist ◦ J. Pritt. Efficiently Improving the Reference Genome for DNA Read Alignment. Seminar work, Harvard University, 2013. ◦ A. Ghanayim and D. Geiger. Iterative referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013. ◦ C. S. Iliopoulos et al. An algorithm for mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10, 2012. Mapping – from static to dynamic 1. Static mapping ◦ Classical mappers, no updates 2. Iterative referencing ◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations 3. Dynamic mapping ◦ Mapper is dynamically updating its index accordingly to already mapped reads 1) Static mapping (standard mappers) READS OUTPUT MAPPER Reference (index) 1 iter. 1 2 n Static mapper Read mapping SAM/BAM file 2) Iterative referencing (Ghanayim&Geiger, 2013) READS MAPPER OUTPUT Statistics 1 iter. 1 2 n 1 iter. 1 2 n . . . 1 iter. 1 2 Update of the reference Pileup, consensus Reference (index) n Static mapper Read mapping SAM/BAM file 3) Dynamic mapping (no existing mapper until now) READS MAPPER 1 iter. 1 Statistics 1 iter. 2 . . . 1 iter. n OUTPUT Update of the reference Reference (index) Dynamic mapper Read mapping SAM/BAM file Estimating the usefulness Iterative referencing Dynamic mapping Static mapping Memory requirements Speed Quality of alignment + -+ -+ ++ ++ + - Dynamic mappers Difficulties – dynamic data structures Two basic types of mappers: ◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.) ◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.) Data structures must be dynamic ◦ Difficult to make dynamic versions ◦ More memory needed ◦ Worse cache-optimization (=> significant decrease of speed) Dynamic FM-index – already studied: ◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows– Wheeler transform. Theoretical Computer Science 410(43), 2009. ◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete Algorithms 8(2), 2010. ◦ Implementation: http://dfmi.sourceforge.net/ Difficulties – statistics and reference To make updates, it is necessary to keep simplified pileups (nucleotide counts in an alignment column). It is difficult to deal with insertions. Example (memory needed for statistics for a single nucleotide) ‘A’ ‘C’ ‘G’ ‘T’ DEL counter counter counter counter counter 3 bits The coordinates of already mapped reads can change during the mapping. ◦ Possible solution: padded reference, many initial place holders (‘*’ character), final small post-processing corrections of the SAM file. 3 bits 3 bits 3 bits 3 bits Sum 15 bits Example (padded reference, an insertion at pos. 14) 1 3 5 7 9 11 13 15 17 19 C * * A * * G * * C * * G C * C * * A * … Difficulties – remapping, unmapping When reference sequence changes too much, some of the already mapped reads should be remapped or unmapped Possible solution: ◦ Ignore it ◦ Iterate over the set of reads more times and take only the last reported alignments for each read ...AAAAATATATATATCGATCTGC... Reference: CC _ Reads: 1: 2: 3: 4: ATCTATATATCG CCGATCTGC CCCGATCTG ATCCCGATC Simulating dynamic mapping Dynamic mapping READS MAPPER 1 iter. 1 Statistics 1 iter. 2 . . . 1 iter. n OUTPUT Update of the reference Reference (index) Dynamic mapper Read mapping SAM/BAM file Simulation (ideal approach) READS MAPPER OUTPUT Statistics 1 1 iter. 1 1 iter. 1 iter. 1 2 Update of the reference Pileup, consensus 2 . . . Reference (index) n Static mapper Read mapping SAM/BAM file Simulation (feasible approach: READS 1 𝑑 iterations) MAPPER OUTPUT Statistics 1 iter. 1 iter. 1 iter. d reads Update of the reference d reads d reads d reads Pileup, consensus d reads . . . Reference (index) d reads Static mapper Read mapping SAM/BAM file Our pipeline Goals: ◦ Simulating dynamic mapper using existing static mappers ◦ Estimating usefulness of dynamic mapping ◦ Making general statements about its benefit Implementation: ◦ Set of several scripts (BASH, Python) and programs (C++) ◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be incorporated) ◦ Updates are made by own simple variant caller (simulating real capabilities of mapper) ◦ Currently only SNP updates (no indels) and single-end reads supported Comparing mappers and alignments Comparison of mappers Typical approach: 1. 2. 3. 4. 5. Taking several mappers as black-boxes. Simulating reads. Mapping by the selected mappers. Applying the same threshold on mapping qualities for all reads. Comparing. …it is not very useful. Comparison of mappers/alignments Typical approach: 1. 2. 3. 4. 5. Taking several mappers as black-boxes. Simulating reads. Mapping by the selected mappers. Applying the same threshold on mapping qualities for all reads. Comparing. …it is not very useful. Threshold 20 (on mapping qualities) Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997 Comparison of mappers/alignments Typical approach: 1. 2. 3. 4. 5. Taking several mappers as black-boxes. Simulating reads. Mapping by the selected mappers. Applying the same threshold on mapping qualities for all reads. Comparing. …it is not very useful. It is important to consider all thresholds on mapping qualities! Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997 LAVEnder A new evaluation software for comparing alignments (C++, Python) It creates interactive HTML reports for a set of BAM files Support of: ◦ DWGsim read simulator (will be extended) ◦ Single-end reads Availability ◦ Currently a private repository on GitHub ◦ In case of interest, don’t hesitate to contact me at [email protected] Fraction of wrongly mapped reads in mapped reads Example of a comparison • Human chromosome 21 • Sequencing error rate: 0.04 • Mutation rate: 0.10 • Single-end reads • Simulated by DWGsim • Aligned by BWA-MEM Part of all reads in % EXPERIMENTS Setup Mappers: BWA-ALN, BWA-MEM Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21 Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM Sequencing error rate: 0.01 Read length: 100 Read simulator: DWGSim Evaluator: LAVEnder BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.01 mut. rate BWA-ALN Borrelia crocidurae Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.01 mut. rate BWA-ALN Human chromosome 21 Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.03 mut. rate BWA-ALN Borrelia crocidurae Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.03 mut. rate BWA-ALN Human chromosome 21 Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-ALN 0.05 mut. rate BWA-ALN Borrelia crocidurae Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-ALN 0.05 mut. rate BWA-ALN Human chromosome 21 Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-MEM Borrelia crocidurae Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Borrelia BWA-MEM 0.15 mut. rate BWA-MEM Borrelia crocidurae Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING BWA-MEM Human chromosome 21 Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 MAPPING OF ALL READS WITHOUT ANY UPDATES Human Chr. 21 BWA-MEM 0.15 mut. rate BWA-MEM Human chromosome 21 Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100 Average coverage: 10 DYNAMIC MAPPING ITERATIVE REFERENCING Conclusion We have shown: For cases with small number of mutations between genomes, static mapping suffices (e.g., 1%+1%, BWA-ALN) For cases with high amount of mutations, mapping is much improved when dynamic mapping is employed (e.g., 15%+1%, BWA-MEM) Real situations: regions with low rates of mutations as well as highly mutated regions (e.g., hot spot regions) If we are interested also in these regions, dynamic mapping/iterative referencing would provide great improvement (especially for, e.g., variant calling) Side products of our work: LAVEnder – a new evaluator of alignments Thank you for your attention! Valentina Boeva Gregory Kucherov
© Copyright 2024 ExpyDoc