Sequencing read mapping

Technologie w skali genomowej 2/
Algorytmiczne i statystyczne aspekty
sekwencjonowania DNA
Sequencing read mapping
Norbert Dojer
Instytut Informatyki
Uniwersytet Warszawski
Read mapping data
I
sequencing data
I
reference genome – long sequence over alphabet {A, C , G , T }
Read mapping
Problem
For each read find a corresponding genome fragment.
Kind of approximate string matching problem
I models
I
I
I
Hamming distance
edit distance
alignment score with non-linear gap cost
I
match should be unique (or have assigned quality)
I
fixed length of patterns (25-250bp)
I
huge amount of data
I
repetitions and unknown fagments in a reference text
Technical issues with indexing
ambiguous bases
7% of human genome sequence is unknown
Possible approach
I
skip long ambiguous fragments
I
replace short ambiguous fragments with random nucleotides
Technical issues with indexing
Repetitive elements
Alu repetitive element has ∼ 1 million occurences in human
genome (∼ 10%)
Technical issues with indexing
low complexity regions
number of occurences of 20bp-long substrings in human genome
> 3Mbp (∼ 0.1% of human genome) consists of long sequences of
the form
I
aaaa. . . (tttt. . . )
I
caca. . . (gtgt. . . )
Technical issues with indexing
repetitions
Most popular substrings of length 20 in human genome GRCh37
Substring
TTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAA
GTGTGTGTGTGTGTGTGTGT
ACACACACACACACACACAC
TGTGTGTGTGTGTGTGTGTG
CACACACACACACACACACA
CTCCCAAAGTGCTGGGATTA
TAATCCCAGCACTTTGGGAG
CCTCCCAAAGTGCTGGGATT
AATCCCAGCACTTTGGGAGG
Possible approach
I
Mask and skip repetitive fragments
Occurrences
451296
447468
246066
243148
241608
238826
170026
169758
166855
166726
Mapping tools
BFAST, Bowtie, BWA, ELAND, Exonerate, GenomeMapper,
GMAP, gnumap, MAQ, MOSAIK, MrFAST, MUMmer, Novocraft,
PASS, RMAP, SeqMap, SHRiMP, Slider, SOAP, SSAHA, SOCS,
SWIFT, SXOligoSearch, Vmatch, Zoom . . .
General schema
1. Build an index of one dataset (genome or reads) allowing
effective substring searching.
2. Process the other dataset against the index to find potential
mappings.
3. Verify potential mappings.
Main differences between mapping tools
Index type
I hash table
I
I
I
q-gram index (all q-mers in a dataset)
q-sample index (selected q-mers in a dataset)
suffix index
Indexed dataset
I
reference genome
I
sequenced reads
Filtering strategy
Mapping with Q-gram index
Mosaik, BFAST, PASS use BLAST-like technique
I
build a q-gram index of a genome
I
find seeds with index
I
extend (sequences of) seeds to full alignment
I
spaced seeds are often used
Performance
− huge memory required
− relatively slow for low error rates
+ easy to handle higher error rates
Mapping with Q-gram index
Q-gram filtration
(Rasmussen et al. ’06)
SHRIMP and RazerS use q-gram filtration with spaced seeds
Q-sample index
. . . is a hash table with positions of all non-overlapping text
substrings of length q.
Partitioning into exact search
Let A = A1 . . . Ak+s and B be two strings such that d (A, B) ≤ k,
where d is Hamming/edit distance. Then at least s substrings
Ai1 . . . Ais appear in B without errors. Moreover:
I
when d is Hamming distance, their positions in B are the same
as in A,
I
when d is edit distance, their relative distances in B cannot
differ from those in A by more than k.
Read mapping with q-sample indexes
Eland, MAQ, SeqMap, RMAP:
I
Each read is split into 4 substrings of length q.
I
For each of 6 pairs of substring positions a separate index is
built.
I
Genome is scanned against each index – hits are potential
mappings with ≤ 2 errors.
I
Hits are verified.
ZOOM uses spaced q-samples.
Performance
+ Read dataset may be split into subsets to fit into available
memory.
− Aligning with gaps decreases efficiency.
Suffix indexes
Suffix tree suffixes = paths from root to leaves
I index size: ≥ 10 · |genome| bytes
I exact mapping time: O(|read | + |occurences|)
Suffix array lexicographic order on suffixes
I index size: ≥ 4 · |genome| bytes
I exact mapping time:
O(|read | · log |genome| + |occurences|)
FM index self-index based on Burrows-Wheeler transform
I index size: < 1 · |genome| bytes (including
sequence!)
I exact mapping time: 2-1000× slower than suffix
arrays
Operations in Ferragina-Manzini index
Find(Q) → R searches for all occurrences of sequence Q and
returns an opaque result R that can be used with
other operations.
FindSuffixes(Q1..m ) → R1..m works just like Find, but returns
results for each suffix of Q so that Ri is the result of
searching for Qi..m .
FindContinue(Q1..m , Rold , f ) → Rnew just like Find searches for all
occurrences of Q1..m , but takes advantage of an
earlier result Rold , assumed to be obtained by
searching for Qf ..m , and returns a new result Rnew .
Count(R) → k returns the number of occurrences k represented by
R.
Locate(R) → l1..k returns locations of occurrences represented by
R.
Extract(b, l ) → S retrieves a subsequence of the reference
sequence T : S = T [b..b + l − 1].
Read mapping with FM-index
Neighborhood generation
All words matching a pattern with 0, 1, 2, . . . errors are generated
and searched in the index.
Backtracking
Partial results of backward searching are recycled in searching words
sharing suffixes.
Unfortunately. . .
. . . the complexity of neighborhood generation is exponential with
respect to the number of allowed errors, even with backtracking.
Bowtie (Langmead et al. ’09)
Seed – high-quality part of the read (default: first 28bp)
Policy
Search for read occurences in the genome with
I
limited number of errors in the seed (default: first 28bp),
I
limited sum of quality values of mismatched positions in the
whole read.
Algorithm
I
Genome index is searched with k-neighborhood of the seed of
a read.
I
Located occurences are extended to whole read mappings and
the quality criterion is checked.
Bowtie – avoiding excessive backtracking
I
k ≤3
I
Double indexing:
FM-index is build for a
genome sequence
(forward index)
and for a reverse
sequence (mirror index).
BWA (Li and Durbin ’09)
Algorithm
FM-index of a genome is searched with k-neighbourhood of a read.
Avoiding excessive backtracking
I
backward searching until d (V , W [i..|W |]) + D(i) > k, where
I
I
I
d (V , W [i..|W |]) is the distance between currently searched
word V and corresponding suffix of a read W ,
D(i) is the lower bound of the number of errors in the genome
occurence of the remaining part of W .
D(i) is calculated with the FM-index of a reverse genome.
Read mapping with FM-index
Performance
+ Mapping is extremely fast and memory efficient for low error
rate.
− Index construction is complex and memory demanding.
Assigning mapping quality
– model from BWA and MAQ (Li et al. ’08)
Assigning mapping quality – model from BWA and MAQ
Assigning mapping quality – model from BWA and MAQ
Assigning mapping quality
Other possible error sources:
I
differences between reference and sequenced genomes
(SNPs, small and large indels)
I
sequence contamination
I
imperfect filtering