Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Sequencing read mapping Norbert Dojer Instytut Informatyki Uniwersytet Warszawski Read mapping data I sequencing data I reference genome – long sequence over alphabet {A, C , G , T } Read mapping Problem For each read find a corresponding genome fragment. Kind of approximate string matching problem I models I I I Hamming distance edit distance alignment score with non-linear gap cost I match should be unique (or have assigned quality) I fixed length of patterns (25-250bp) I huge amount of data I repetitions and unknown fagments in a reference text Technical issues with indexing ambiguous bases 7% of human genome sequence is unknown Possible approach I skip long ambiguous fragments I replace short ambiguous fragments with random nucleotides Technical issues with indexing Repetitive elements Alu repetitive element has ∼ 1 million occurences in human genome (∼ 10%) Technical issues with indexing low complexity regions number of occurences of 20bp-long substrings in human genome > 3Mbp (∼ 0.1% of human genome) consists of long sequences of the form I aaaa. . . (tttt. . . ) I caca. . . (gtgt. . . ) Technical issues with indexing repetitions Most popular substrings of length 20 in human genome GRCh37 Substring TTTTTTTTTTTTTTTTTTTT AAAAAAAAAAAAAAAAAAAA GTGTGTGTGTGTGTGTGTGT ACACACACACACACACACAC TGTGTGTGTGTGTGTGTGTG CACACACACACACACACACA CTCCCAAAGTGCTGGGATTA TAATCCCAGCACTTTGGGAG CCTCCCAAAGTGCTGGGATT AATCCCAGCACTTTGGGAGG Possible approach I Mask and skip repetitive fragments Occurrences 451296 447468 246066 243148 241608 238826 170026 169758 166855 166726 Mapping tools BFAST, Bowtie, BWA, ELAND, Exonerate, GenomeMapper, GMAP, gnumap, MAQ, MOSAIK, MrFAST, MUMmer, Novocraft, PASS, RMAP, SeqMap, SHRiMP, Slider, SOAP, SSAHA, SOCS, SWIFT, SXOligoSearch, Vmatch, Zoom . . . General schema 1. Build an index of one dataset (genome or reads) allowing effective substring searching. 2. Process the other dataset against the index to find potential mappings. 3. Verify potential mappings. Main differences between mapping tools Index type I hash table I I I q-gram index (all q-mers in a dataset) q-sample index (selected q-mers in a dataset) suffix index Indexed dataset I reference genome I sequenced reads Filtering strategy Mapping with Q-gram index Mosaik, BFAST, PASS use BLAST-like technique I build a q-gram index of a genome I find seeds with index I extend (sequences of) seeds to full alignment I spaced seeds are often used Performance − huge memory required − relatively slow for low error rates + easy to handle higher error rates Mapping with Q-gram index Q-gram filtration (Rasmussen et al. ’06) SHRIMP and RazerS use q-gram filtration with spaced seeds Q-sample index . . . is a hash table with positions of all non-overlapping text substrings of length q. Partitioning into exact search Let A = A1 . . . Ak+s and B be two strings such that d (A, B) ≤ k, where d is Hamming/edit distance. Then at least s substrings Ai1 . . . Ais appear in B without errors. Moreover: I when d is Hamming distance, their positions in B are the same as in A, I when d is edit distance, their relative distances in B cannot differ from those in A by more than k. Read mapping with q-sample indexes Eland, MAQ, SeqMap, RMAP: I Each read is split into 4 substrings of length q. I For each of 6 pairs of substring positions a separate index is built. I Genome is scanned against each index – hits are potential mappings with ≤ 2 errors. I Hits are verified. ZOOM uses spaced q-samples. Performance + Read dataset may be split into subsets to fit into available memory. − Aligning with gaps decreases efficiency. Suffix indexes Suffix tree suffixes = paths from root to leaves I index size: ≥ 10 · |genome| bytes I exact mapping time: O(|read | + |occurences|) Suffix array lexicographic order on suffixes I index size: ≥ 4 · |genome| bytes I exact mapping time: O(|read | · log |genome| + |occurences|) FM index self-index based on Burrows-Wheeler transform I index size: < 1 · |genome| bytes (including sequence!) I exact mapping time: 2-1000× slower than suffix arrays Operations in Ferragina-Manzini index Find(Q) → R searches for all occurrences of sequence Q and returns an opaque result R that can be used with other operations. FindSuffixes(Q1..m ) → R1..m works just like Find, but returns results for each suffix of Q so that Ri is the result of searching for Qi..m . FindContinue(Q1..m , Rold , f ) → Rnew just like Find searches for all occurrences of Q1..m , but takes advantage of an earlier result Rold , assumed to be obtained by searching for Qf ..m , and returns a new result Rnew . Count(R) → k returns the number of occurrences k represented by R. Locate(R) → l1..k returns locations of occurrences represented by R. Extract(b, l ) → S retrieves a subsequence of the reference sequence T : S = T [b..b + l − 1]. Read mapping with FM-index Neighborhood generation All words matching a pattern with 0, 1, 2, . . . errors are generated and searched in the index. Backtracking Partial results of backward searching are recycled in searching words sharing suffixes. Unfortunately. . . . . . the complexity of neighborhood generation is exponential with respect to the number of allowed errors, even with backtracking. Bowtie (Langmead et al. ’09) Seed – high-quality part of the read (default: first 28bp) Policy Search for read occurences in the genome with I limited number of errors in the seed (default: first 28bp), I limited sum of quality values of mismatched positions in the whole read. Algorithm I Genome index is searched with k-neighborhood of the seed of a read. I Located occurences are extended to whole read mappings and the quality criterion is checked. Bowtie – avoiding excessive backtracking I k ≤3 I Double indexing: FM-index is build for a genome sequence (forward index) and for a reverse sequence (mirror index). BWA (Li and Durbin ’09) Algorithm FM-index of a genome is searched with k-neighbourhood of a read. Avoiding excessive backtracking I backward searching until d (V , W [i..|W |]) + D(i) > k, where I I I d (V , W [i..|W |]) is the distance between currently searched word V and corresponding suffix of a read W , D(i) is the lower bound of the number of errors in the genome occurence of the remaining part of W . D(i) is calculated with the FM-index of a reverse genome. Read mapping with FM-index Performance + Mapping is extremely fast and memory efficient for low error rate. − Index construction is complex and memory demanding. Assigning mapping quality – model from BWA and MAQ (Li et al. ’08) Assigning mapping quality – model from BWA and MAQ Assigning mapping quality – model from BWA and MAQ Assigning mapping quality Other possible error sources: I differences between reference and sequenced genomes (SNPs, small and large indels) I sequence contamination I imperfect filtering