Dynamic mappers of NGS reads

Dynamic mappers of
NGS reads
Karel Břinda ( L I G M Un i ve rsité Pa r is- Est )
Valentina Boeva ( I n st i t ut Cu r i e)
Gregory Kucherov ( L I G M U n i ve rsité Pa r is - Est )
Introduction
Read mapping is a bottleneck
in NGS data processing (e.g., for
variant calling)
A lot of effort constantly
invested into the development
of new mappers
None of them supports
dynamic updates of the
reference during the mapping
Idea: update reference during the mapping
Only few papers on this topic exist
◦ J. Pritt. Efficiently Improving the Reference
Genome for DNA Read Alignment.
Seminar work, Harvard University, 2013.
◦ A. Ghanayim and D. Geiger. Iterative
referencing for improving the
interpretation of DNA sequence data.
Technical report, Technion, Israel, 2013.
◦ C. S. Iliopoulos et al. An algorithm for
mapping short reads to a dynamically
changing genomic sequence. Journal of
Discrete Algorithms 10, 2012.
Mapping – from static to dynamic
1. Static mapping
◦ Classical mappers, no updates
2. Iterative referencing
◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations
3. Dynamic mapping
◦ Mapper is dynamically updating its index accordingly to already mapped reads
1) Static mapping (standard mappers)
READS
OUTPUT
MAPPER
Reference (index)
1 iter.
1
2
n
Static mapper
Read mapping
SAM/BAM
file
2) Iterative referencing (Ghanayim&Geiger, 2013)
READS
MAPPER
OUTPUT
Statistics
1 iter.
1
2
n
1 iter.
1
2
n
.
.
.
1 iter.
1
2
Update of the
reference
Pileup,
consensus
Reference (index)
n
Static mapper
Read mapping
SAM/BAM
file
3) Dynamic mapping (no existing mapper until now)
READS
MAPPER
1 iter.
1
Statistics
1 iter.
2
.
.
.
1 iter.
n
OUTPUT
Update of the
reference
Reference (index)
Dynamic mapper
Read mapping
SAM/BAM
file
Estimating the usefulness
Iterative referencing
Dynamic mapping
Static mapping
Memory requirements
Speed
Quality of alignment
+
-+
-+
++
++
+
-
Dynamic mappers
Difficulties – dynamic data structures
Two basic types of mappers:
◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.)
◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.)
Data structures must be dynamic
◦ Difficult to make dynamic versions
◦ More memory needed
◦ Worse cache-optimization (=> significant decrease of speed)
Dynamic FM-index – already studied:
◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows–
Wheeler transform. Theoretical Computer Science 410(43), 2009.
◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete
Algorithms 8(2), 2010.
◦ Implementation: http://dfmi.sourceforge.net/
Difficulties – statistics and reference
To make updates, it is necessary to
keep simplified pileups (nucleotide
counts in an alignment column).
It is difficult to deal with insertions.
Example (memory needed for
statistics for a single nucleotide)
‘A’
‘C’
‘G’
‘T’
DEL
counter counter counter counter counter
3 bits
The coordinates of already mapped
reads can change during the mapping.
◦ Possible solution: padded reference, many
initial place holders (‘*’ character), final
small post-processing corrections of the
SAM file.
3 bits
3 bits
3 bits
3 bits
Sum
15 bits
Example (padded reference, an
insertion at pos. 14)
1
3
5
7
9 11 13 15 17 19
C * * A * * G * * C * * G C * C * * A * …
Difficulties – remapping, unmapping
When reference sequence changes too
much, some of the already mapped
reads should be remapped or
unmapped
Possible solution:
◦ Ignore it
◦ Iterate over the set of reads more times
and take only the last reported alignments
for each read
...AAAAATATATATATCGATCTGC...
Reference:
CC
_
Reads:
1:
2:
3:
4:
ATCTATATATCG
CCGATCTGC
CCCGATCTG
ATCCCGATC
Simulating dynamic mapping
Dynamic mapping
READS
MAPPER
1 iter.
1
Statistics
1 iter.
2
.
.
.
1 iter.
n
OUTPUT
Update of the
reference
Reference (index)
Dynamic mapper
Read mapping
SAM/BAM
file
Simulation
(ideal approach)
READS
MAPPER
OUTPUT
Statistics
1
1 iter.
1
1 iter.
1 iter.
1
2
Update of the
reference
Pileup,
consensus
2
.
.
.
Reference (index)
n
Static mapper
Read mapping
SAM/BAM
file
Simulation
(feasible approach:
READS
1
𝑑
iterations)
MAPPER
OUTPUT
Statistics
1 iter.
1 iter.
1 iter.
d reads
Update of the
reference
d reads
d reads
d reads
Pileup,
consensus
d reads
.
.
.
Reference (index)
d reads
Static mapper
Read mapping
SAM/BAM
file
Our pipeline
Goals:
◦ Simulating dynamic mapper using existing static mappers
◦ Estimating usefulness of dynamic mapping
◦ Making general statements about its benefit
Implementation:
◦ Set of several scripts (BASH, Python) and programs (C++)
◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be
incorporated)
◦ Updates are made by own simple variant caller (simulating real capabilities of mapper)
◦ Currently only SNP updates (no indels) and single-end reads supported
Comparing mappers
and alignments
Comparison of mappers
Typical approach:
1.
2.
3.
4.
5.
Taking several mappers as black-boxes.
Simulating reads.
Mapping by the selected mappers.
Applying the same threshold on mapping
qualities for all reads.
Comparing.
…it is not very useful.
Comparison of mappers/alignments
Typical approach:
1.
2.
3.
4.
5.
Taking several mappers as black-boxes.
Simulating reads.
Mapping by the selected mappers.
Applying the same threshold on mapping
qualities for all reads.
Comparing.
…it is not very useful.
Threshold 20
(on mapping qualities)
Source: Heng Li: Aligning sequence reads, clone sequences and
assembly contigs with BWA-MEM, arXiv:1303.3997
Comparison of mappers/alignments
Typical approach:
1.
2.
3.
4.
5.
Taking several mappers as black-boxes.
Simulating reads.
Mapping by the selected mappers.
Applying the same threshold on mapping
qualities for all reads.
Comparing.
…it is not very useful.
It is important to consider
all thresholds on mapping qualities!
Source: Heng Li: Aligning sequence reads, clone sequences and
assembly contigs with BWA-MEM, arXiv:1303.3997
LAVEnder
A new evaluation software for comparing
alignments (C++, Python)
It creates interactive HTML reports for a set of
BAM files
Support of:
◦ DWGsim read simulator (will be extended)
◦ Single-end reads
Availability
◦ Currently a private repository on GitHub
◦ In case of interest, don’t hesitate to contact me
at [email protected]
Fraction of wrongly mapped
reads in mapped reads
Example of a
comparison
•
Human chromosome 21
•
Sequencing error rate: 0.04
•
Mutation rate: 0.10
•
Single-end reads
•
Simulated by DWGsim
•
Aligned by BWA-MEM
Part of all
reads in %
EXPERIMENTS
Setup
Mappers: BWA-ALN, BWA-MEM
Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21
Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM
Sequencing error rate: 0.01
Read length: 100
Read simulator: DWGSim
Evaluator: LAVEnder
BWA-ALN
Borrelia crocidurae
Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Borrelia
BWA-ALN
0.01 mut. rate
BWA-ALN
Borrelia crocidurae
Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-ALN
Human chromosome 21
Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Human Chr. 21
BWA-ALN
0.01 mut. rate
BWA-ALN
Human chromosome 21
Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-ALN
Borrelia crocidurae
Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Borrelia
BWA-ALN
0.03 mut. rate
BWA-ALN
Borrelia crocidurae
Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-ALN
Human chromosome 21
Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Human Chr. 21
BWA-ALN
0.03 mut. rate
BWA-ALN
Human chromosome 21
Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-ALN
Borrelia crocidurae
Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Borrelia
BWA-ALN
0.05 mut. rate
BWA-ALN
Borrelia crocidurae
Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-ALN
Human chromosome 21
Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Human Chr. 21
BWA-ALN
0.05 mut. rate
BWA-ALN
Human chromosome 21
Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-MEM
Borrelia crocidurae
Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Borrelia
BWA-MEM
0.15 mut. rate
BWA-MEM
Borrelia crocidurae
Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
BWA-MEM
Human chromosome 21
Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
MAPPING OF ALL READS
WITHOUT ANY UPDATES
Human Chr. 21
BWA-MEM
0.15 mut. rate
BWA-MEM
Human chromosome 21
Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100
Average coverage: 10
DYNAMIC MAPPING
ITERATIVE REFERENCING
Conclusion
We have shown:
 For cases with small number of mutations between genomes, static mapping suffices (e.g., 1%+1%,
BWA-ALN)
 For cases with high amount of mutations, mapping is much improved when dynamic mapping is
employed (e.g., 15%+1%, BWA-MEM)
Real situations: regions with low rates of mutations as well as highly mutated regions (e.g., hot
spot regions)
 If we are interested also in these regions, dynamic mapping/iterative referencing would provide great
improvement (especially for, e.g., variant calling)
Side products of our work:
 LAVEnder – a new evaluator of alignments
Thank you for your attention!
Valentina Boeva
Gregory Kucherov