Mapping and aligning NGS data

Next genera*on sequencing data and analysis Mapping reads to a reference 09/09/2014 www.qub.ac.uk/igfs 1 Mapping of reads to a reference sequence •  The raw data –  Quality assessment of the raw data –  Trimming the raw data •  Quick overview of some alignment tools –  MAQ –  BWA •  BWA-­‐backtrack, BWA-­‐Mem, BWA-­‐SW –  BOWTIE –  BOWTIE2 •  SAM/BAM output –  Alignment Informa*on •  MapQ, CIGAR, Tlength –  Op*onal Tags •  NM, MD, etc 09/09/2014 www.qub.ac.uk/igfs 2 The raw data: fastq file Reads.fastq files are very large files Each read has 4 rows with ~10-­‐50 million reads in one fastq file •  Sequence Header •  Sequence •  Quality Header •  Quali*es 09/09/2014 www.qub.ac.uk/igfs 3 Typically: ~10-­‐50 million reads in one fastq file @HWI-­‐EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1 TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTTCTGACAACTTATAGTT + @=?DCDAGBBAA=??<<8CC?EBBEEEEBBBB???<<<<@@BEBEE=89B8BBBEEE;;=6 @HWI-­‐EAS209_0006_FC706VJ:5:58:4694:21321#ATCACG/1 CTATGGCGTAGTAAATAAATCTCCTAATAGCTTAGATATTACCTTCAATAGCTTAGTC + BAA=??<<=??<<8CC?EBBEEEDCDAGBBAA=??<<8CCEBBBB??F<<<<@@BEBEE8 @HWI-­‐EAS209_0006_FC706VJ:5:58:3455:21453#ATCACG/1 ATAGCTTGTAGTAAATAAATCTCCATAGCTTTTAGATATTACCTTCAATAGCTTAGTC + 09/09/2014 www.qub.ac.uk/igfs 4 Quali*es in fastq file Phred Quality scores Phred quality score Probability of incorrect call Base call accuracy 10 1 in 10 90% 20 1 in 100 99% 30 1 in 1000 99.9% 40 1 in 10000 99.99% Encoding (Illumina 1.8+ format): !"#$%&'()*+,-­‐.0123456789:;<=>?@ABCDEFGHIJ 0-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐40 ..……………………………………………………….............99.99% accuracy 09/09/2014 www.qub.ac.uk/igfs 5 Quality Control: FastQC •  Quality assessment of Next-­‐genera*on sequencing data •  Genera*on of mul*ple QC plots •  Gives a quick impression of data quality hqp://www.bioinforma*cs.babraham.ac.uk/projects/fastqc/ www.qub.ac.uk/igfs 6 FastQC for good Illumina run 09/09/2014 www.qub.ac.uk/igfs 7 FastQC for bad Illumina run 09/09/2014 www.qub.ac.uk/igfs 8 Processing of reads with the fastx-­‐toolkit •  FASTQ Clipper –  Removing sequencing adapters / linkers from reads •  FASTQ Quality Filter –  Filters (removes) sequences based on quality •  FASTQ Quality –  Trims (cuts) sequences based on quality •  etc……. See hqp://hannonlab.cshl.edu/fastx_toolkit/commandline.html 09/09/2014 www.qub.ac.uk/igfs 9 Trimming of paired reads •  Fastq-­‐mcf –  Detect & remove sequencing adapters and primers –  Detect poor quality at the ends of reads and clip –  Detect Ns, and remove from ends –  Discard sequences that are too short ater all of the above –  Keep mul*ple mate-­‐reads in sync while doing all of the above –  ………. –  See hqps://code.google.com/p/ea-­‐u*ls/wiki/
FastqMcf 09/09/2014 www.qub.ac.uk/igfs 10 Post-­‐ QC analysis •  De novo assembly (no reference available) •  Mapping reads to a reference 09/09/2014 www.qub.ac.uk/igfs 11 Alignment methods Whole genome alignment BWA-­‐mem BWA-­‐SW BWA-­‐backtrack Bowbe Bowbe2 Short read mapping Maq Short pairwise alignment Database search Adapted from: Chaisson and Tesler BMC Bioinforma*cs 2012, 13:238 09/09/2014 www.qub.ac.uk/igfs 12 NGS data: Mapping millions of short reads to a reference sequence (genome/transcriptome) •  Examples of mapping tools •  Maq (2007, ungapped alignment) •  BWA-­‐backtrack (2009, gapped alignment) •  Bow*e (2009, ungapped alignment) •  BWA-­‐SW (2010, gapped alignment) •  BWA-­‐MEM (2013, gapped alignment) •  Bow*e2 (2012, gapped alignment) 09/09/2014 www.qub.ac.uk/igfs 13 Running mapping, see documenta*on bwa index ref.fa bwa mem ref.fa read1.fq read2.fq > file.sam maq.pl easyrun –p ref.fa read1.fq read2.fq maq2sam-­‐long all.map > all.sam bow*e2-­‐build [op*ons] ref.fa bt2_base bow*e2 [op*ons] –x reference -­‐1 1.fastq -­‐2 2.fastq –S file.sam bow*e-­‐build [op*ons] ref.fa bt_base bow*e [op*ons] –S –x reference -­‐1 1.fastq -­‐2 2.fastq > file.sam 09/09/2014 www.qub.ac.uk/igfs 14 Mapping billions of short reads to a genome Spaced-­‐seed indexing Burrows-­‐Wheeler transform From Trapnell & Saltzberg Nat Biotechnol. 2009 May;
27(5):455-­‐7 09/09/2014 www.qub.ac.uk/igfs 15 Mapping process •  Create an index of the reference sequence •  Map the reads to the reference sequence –  Inexact match –  Quality awareness –  Global end-­‐to-­‐end or local alignment –  ungapped, gapped alignment •  Output usually in form of sam/bam file –  Records the alignment of each read –  Mapping Quali*es –  Matches and Miss-­‐matches 09/09/2014 www.qub.ac.uk/igfs 16 BWA-­‐backtrack, BWA-­‐Mem, BWA-­‐SW: •  BWA, which algorithm should I choose? –  BWA-­‐backtrack •  beqer for shorter sequences •  gapped alignment •  designed for sequencing error rates <2% –  BWA-­‐MEM •  For >70bp Illumina, 454, Ion Proton •  gapped alignment •  tolerates more errors –  BWA-­‐SW •  For longer sequences when alignment gaps are likely to be frequent •  gapped alignment •  tolerates more errors See hqp://bio-­‐bwa.sourceforge.net/ 09/09/2014 www.qub.ac.uk/igfs 17 Bow*e, Bow*e2 •  Bow*e –  For short reads <50bp Bow*e may be faster and more sensi*ve, performs ungapped alignment •  Bow*e2 –  For reads >50bp Bow*e2 is usually faster and more accurate, performs gapped alignment and allows mul*ple gaps See hqp://bow*e-­‐bio.sourceforge.net/index.shtml 09/09/2014 www.qub.ac.uk/igfs 18 Problems the aligner has to deal with •  Sequence will not be exactly the same as the reference sequence, therefore have to allow inexact mapping and find the best match –  Sequencing errors –  Real differences •  SNPs, Indels, structural rearrangements •  Diploid, tetraploid etc •  Homozygous, Heterozygous •  Longer reads are more likely to have more miss-­‐matches Choice of mapping approach and choice of downstream analysis 09/09/2014 www.qub.ac.uk/igfs 19 The Result: The Alignment file •  The results obtained from an alignment tool –  Alignment file –  Usually in for of SAM/BAM format •  Can be viewed in a genome viewer –  Integra*ve Genomics Viewer (IGV) 09/09/2014 www.qub.ac.uk/igfs 20 SAM file (here shown without headers) 09/09/2014 www.qub.ac.uk/igfs 21 SAM/BAM files viewed in IGV MAQ Bow*e BWA-­‐Mem Bow*e2 09/09/2014 www.qub.ac.uk/igfs 22 MAQ Bow*e BWA-­‐Mem Bow*e2 09/09/2014 www.qub.ac.uk/igfs 23 MAQ Bow*e BWA-­‐Mem Bow*e2 09/09/2014 www.qub.ac.uk/igfs 24 SAM file Read name Alignment Info Sequence of read Quality of each base Op*onal Tags Read name, sequence, qualibes for each read are the same as in the fastq file Alignment info and op*onal tags describe how the read aligns to the reference 09/09/2014 www.qub.ac.uk/igfs 25 Read name, sequence read, quality of each base HWI-­‐ST863:143:D157WACXX:7:1101:15088:3801 CGGCTTTTCGGCTGGCTGCTGGAGGAGCTTGGCGCAGATGGCC JIJJJJIIJJJFIJJIGGGGGCCEHBEFDACCCDBBD:ACCCDCCCDC;@@ 09/09/2014 www.qub.ac.uk/igfs 26 SAM file (Bow*e2) Read name Alignment Info Sequence of read Quality of each base Op*onal Tags 1.4 The alignment section: mandatory fields
In the SAM format, each alignment line typically represents the linear alignment of a segment. Each
line has 11 mandatory fields. These fields always appear in the same order and must be present, but
their values can be ‘0’ or ‘*’ (depending on the field) if the corresponding information is unavailable.
The following table gives an overview of the mandatory fields in the SAM format:
Col
1
2
3
4
5
6
7
8
9
10
11
Field
QNAME
FLAG
RNAME
POS
MAPQ
CIGAR
RNEXT
PNEXT
TLEN
SEQ
QUAL
Type
String
Int
String
Int
Int
String
String
Int
Int
String
String
Regexp/Range
[!-?A-~]{1,255}
[0,216 -1]
\*|[!-()+-<>-~][!-~]*
[0,231 -1]
[0,28 -1]
\*|([0-9]+[MIDNSHPX=])+
\*|=|[!-()+-<>-~][!-~]*
[0,231 -1]
[-231 +1,231 -1]
\*|[A-Za-z=.]+
[!-~]+
Brief description
Query template NAME
bitwise FLAG
Reference sequence NAME
1-based leftmost mapping POSition
MAPping Quality
CIGAR string
Ref. name of the mate/next read
Position of the mate/next read
observed Template LENgth
segment SEQuence
ASCII of Phred-scaled base QUALity+33
1. QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to
come from the same template. A QNAME ‘*’ indicates the information is unavailable. In a
SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when
multiple mappings are given.
2. FLAG: bitwise FLAG. Each bit is explained in the following table:
09/09/2014 Bit
0x1
0x2
0x4
0x8
0x10
0x20
0x40
0x80
0x100
0x200
0x400
0x800
Description
template having multiple segments in sequencing
each segment properly aligned according to the aligner
segment unmapped
next segment in the template unmapped
SEQ being reverse complemented
SEQ of the next segment in the template being reversed
the first segment in the template
the last segment in the template
secondary alignment
not passing quality controls
PCR or optical duplicate
supplementary alignment
www.qub.ac.uk/igfs 27 SAM format specifica*on reference www.qub.ac.uk/igfs 28 Sam file alignment info Flag
RefName BWA 99
iso*g11038
147
iso*g11038
99
iso*g11038
147
iso*g11038
83
iso*g11038
163
iso*g11038
Bowbe2 99
iso*g11038
147
iso*g11038
83
iso*g11038
163
iso*g11038
Bowbe 99
iso*g11038
147
iso*g11038
163
iso*g11038
83
iso*g11038
09/09/2014 Pos
MaqQ
Cigar
1187
1189
1602
1602
909
782
60
60
9
9
60
60
1187
1189
909
782
1187
1189
782
909
Rnext
Pnext
Tleng 80M
80M
53S20M7S
22S20M38S
80M
80M
=
=
=
=
=
=
1189
1187
1602
1602
782
909
82 -­‐82 20 -­‐20 -­‐207 207 42
42
42
42
80M
80M
80M
80M
=
=
=
=
1189
1187
782
909
82 -­‐82 -­‐207 207 255
255
255
255
80M
80M
80M
80M
=
=
=
=
1189
1187
909
782
82 -­‐82 207 -­‐207 www.qub.ac.uk/igfs 29 Sam file alignment info Flag
RefName BWA 99
iso*g11038
147
iso*g11038
99
iso*g11038
147
iso*g11038
83
iso*g11038
163
iso*g11038
Bowbe2 99
iso*g11038
147
iso*g11038
83
iso*g11038
163
iso*g11038
Bowbe 99
iso*g11038
147
iso*g11038
163
iso*g11038
83
iso*g11038
09/09/2014 Pos
MaqQ
Cigar
1187
1189
1602
1602
909
782
60
60
9
9
60
60
1187
1189
909
782
1187
1189
782
909
Rnext
Pnext
Tleng 80M
80M
53S20M7S
22S20M38S
80M
80M
=
=
=
=
=
=
1189
1187
1602
1602
782
909
82 -­‐82 20 -­‐20 -­‐207 207 42
42
42
42
80M
80M
80M
80M
=
=
=
=
1189
1187
782
909
82 -­‐82 -­‐207 207 255
255
255
255
80M
80M
80M
80M
=
=
=
=
1189
1187
909
782
82 -­‐82 207 -­‐207 www.qub.ac.uk/igfs 30 MAPQ •  MAPQ: MAPping Quality. –  It equals −10 log10 Probability{mapping posi*on is wrong}, rounded to the nearest integer. –  A value 255 indicates that the mapping quality is not available. -­‐10*Log10(0.1)=10 i.e MAPQ of 10 = 10% probability that the mapping posi*on is wrong -­‐10*log10(0.01)= MAPQ of 20 = 1% 1 in 100 -­‐10*log10(0.001)= MAPQ of 30 = 0.1% 1 in 1,000 -­‐10*log10(0.0001)= MAPQ of 40 = 0.01% 1 in 10,000 -­‐10*log10(0.00001)= MAPQ of 50 = 0.001% 1 in 100,000 -­‐10*log10(0.000001)= MAPQ of 60 = 0.0001% 1 in 1,000,000 09/09/2014 www.qub.ac.uk/igfs 31 Sam file alignment info Flag
RefName Pos
MaqQ Cigar
Rnext Pnext
BWA 99
iso*g11038
1187
60
80M
=
1189
147
iso*g11038
1189
60
80M
=
1187
99
iso*g11038
1602
9
53S20M7S
=
1602
147
iso*g11038
1602
9
22S20M38S
=
1602
83
iso*g11038
909
60
80M
=
782
163
iso*g11038
782
60
80M
=
909
Bowbe2 99
iso*g11038
1187
42
80M
=
1189
147
iso*g11038
1189
42
80M
=
1187
83
iso*g11038
909
42
80M
=
782
163
iso*g11038
782
42
80M
=
909
Bowbe Useful link: u*lity that e 2xplains AM flags in plain 99
iso*g11038
1187
55
8S0M
= English 1189
147
iso*g11038
1189
255
80M
=
1187
hqp://picard.sourceforge.net/explain-­‐flags.html 163
iso*g11038
782
255
80M
=
909
83
iso*g11038
909
255
80M
=
782
09/09/2014 www.qub.ac.uk/igfs Tleng 82 -­‐82 20 -­‐20 -­‐207 207 82 -­‐82 -­‐207 207 82 -­‐82 207 -­‐207 32 Sam file: bit-­‐wise flags •  Explana*on: 1 read paired Bit wise flag: 2 read mapped in proper pair 1=true 4 read unmapped 0=false 8 mate unmapped 16 read reverse strand converted into a decimal 32 mate reverse strand 64 first in pair 128 second in pair 1100011=64+32+2+1=99 256 not primary alignment 512 read fails plaƒorm quality checks 1024 read is PCR or op*cal duplicate Flags can be used for filtering 09/09/2014 www.qub.ac.uk/igfs 33 SAM file: bit-­‐wise flags •  Explana*on: 1 read paired 2 read mapped in proper pair 4 read unmapped 8 mate unmapped 16 read reverse strand 32 mate reverse strand 64 first in pair 128 second in pair 256 not primary alignment 512 read fails plaƒorm quality checks 1024 read is PCR or op*cal duplicate Bit wise flag: 1=true 0=false 10010011=147 Flags can be used for filtering 09/09/2014 www.qub.ac.uk/igfs 34 SAM file: bit-­‐wise flags •  10010011=147 •  1100011=99 –  read paired –  read mapped in proper pair –  mate reverse strand –  first in pair 09/09/2014 www.qub.ac.uk/igfs –  read paired –  read mapped in proper pair –  read reverse strand –  second in pair 35 SAM file: bit-­‐wise flags •  Explana*on: 1 read paired 2 read mapped in proper pair 4 read unmapped 8 mate unmapped 16 read reverse strand 32 mate reverse strand 64 first in pair 128 second in pair 256 not primary alignment 512 read fails plaƒorm quality checks 1024 read is PCR or op*cal duplicate What does this flag mean 83? 163? Flags can be used for filtering 09/09/2014 www.qub.ac.uk/igfs 36 SAM file: bit-­‐wise flags •  Explana*on: 1 read paired 2 read mapped in proper pair 4 read unmapped 8 mate unmapped 16 read reverse strand 32 mate reverse strand 64 first in pair 128 second in pair 256 not primary alignment 512 read fails plaƒorm quality checks 1024 read is PCR or op*cal duplicate 1010011=83 Flags can be used for filtering 09/09/2014 www.qub.ac.uk/igfs 37 SAM file: bit-­‐wise flags •  Explana*on: 1 read paired 10100011=163 2 read mapped in proper pair 4 read unmapped 8 mate unmapped 16 read reverse strand 32 mate reverse strand 64 first in pair 128 second in pair 256 not primary alignment 512 read fails plaƒorm quality checks 1024 read is PCR or op*cal duplicate Flags can be used for filtering 09/09/2014 www.qub.ac.uk/igfs 38 SAM file: bit-­‐wise flags •  10100011=163 •  1010011=83 –  read paired –  read mapped in proper pair –  read reverse strand –  first in pair 09/09/2014 www.qub.ac.uk/igfs –  read paired –  read mapped in proper pair –  mate reverse strand –  second in pair 39 Sam file alignment info Flag
RefName BWA 99
iso*g11038
147
iso*g11038
99
iso*g11038
147
iso*g11038
83
iso*g11038
163
iso*g11038
Bowbe2 99
iso*g11038
147
iso*g11038
83
iso*g11038
163
iso*g11038
Bowbe 99
iso*g11038
147
iso*g11038
163
iso*g11038
83
iso*g11038
09/09/2014 Pos
MaqQ
Cigar
1187
1189
1602
1602
909
782
60
60
9
9
60
60
1187
1189
909
782
1187
1189
782
909
Rnext
Pnext
Tleng 80M
80M
53S20M7S
22S20M38S
80M
80M
=
=
=
=
=
=
1189
1187
1602
1602
782
909
82 -­‐82 20 -­‐20 -­‐207 207 42
42
42
42
80M
80M
80M
80M
=
=
=
=
1189
1187
782
909
82 -­‐82 -­‐207 207 255
255
255
255
80M
80M
80M
80M
=
=
=
=
1189
1187
909
782
82 -­‐82 207 -­‐207 www.qub.ac.uk/igfs 40 SAM file: Cigar string M= Alignment match (sequence match or mismatch) I= Inser*on to reference D= dele*on from reference S= clipped alignment (sotclipped) H= clipped alignment (hard clipped) N= long skip on the reference sequence P= silent dele*on from padded reference 09/09/2014 www.qub.ac.uk/igfs 41 The Cigar String 25M 18M1I7M 3S8M1D6M4S 9M14N8M What do these mean? 09/09/2014 M= Alignment match (sequence match or mismatch) I= Inser*on to reference D= dele*on from reference S= clipped alignment (sotclipped) N= long skip on the reference sequence www.qub.ac.uk/igfs 42 The Cigar String 25M REF: TGCATTCATGTGAATGTGAATGTAATATGGTGATCGCAC Read: ATGCGAATGTGATTGTAATATGGTG 18M1I7M REF: TGCATTCATGTGAATGTGAATGTAA*TATGGTGATCGCAC Read: ATGCGAATGTGATTGTAAATATGGTG 3S8M1D6M4S REF: AGCTAGCATCGTGTCGCCCGTCTAGCATACGCATGATCGAC Read: gggGTGTAGCC-­‐GACTAGgggg 9M14N8M REF: TCGTGTCGCCCGTCTAGCATACGCATGATCGACTGTCAGCTA READ: GTGTAACCC..............................TGAGCGCC 09/09/2014 www.qub.ac.uk/igfs 43 Read name, alignment info, sequence, quali*es, tags HWI-­‐ST863:143:D157WACXX:7:1101:15088:3801 99 isobg11038 1187 42 80M = 1189 82 CGGCTTTTCGGCTGGCTGCTGGAGGAGCTTGGCGCAGATGG
JIJJJJIIJJJFIJJIGGGGGCCEHBEFDACCCDBBD:ACCCDCCCDC;
AS:i:-­‐6 NM:i:1 MD:Z:10C69 YS:i:-­‐5 YT:Z:CP 09/09/2014 www.qub.ac.uk/igfs 44 SAM file, Op*onal Tags BWA NM:i:1 MD:Z:10C69
NM:i:1 MD:Z:8C71
NM:i:0 MD:Z:20 NM:i:0 MD:Z:20 NM:i:2 MD:Z:8A33G37
NM:i:0 MD:Z:80 Bowbe2 AS:i:-­‐6 XN:i:0 XM:i:1
AS:i:-­‐5 XN:i:0 XM:i:1
AS:i:-­‐11 XN:i:0 XM:i:2
AS:i:0 XN:i:0 XM:i:0
Bowbe XA:i:1 MD:Z:10C69
XA:i:0 MD:Z:8C71
XA:i:0 MD:Z:80 XA:i:0 MD:Z:8A33G37
09/09/2014 AS:i:75
AS:i:75
AS:i:20
AS:i:20
AS:i:70
AS:i:80
XS:i:0 XS:i:0 XS:i:19 XA:Z:iso*g08401,-­‐365,2S19M59S,0; XS:i:19 XA:Z:iso*g08401,+365,33S19M28S,0; XS:i:0 XS:i:0 XO:i:0
XO:i:0
XO:i:0
XO:i:0
XG:i:0
XG:i:0
XG:i:0
XG:i:0
NM:i:1 NM:i:1 NM:i:0 NM:i:2 NM:i:1
NM:i:1
NM:i:2
NM:i:0
www.qub.ac.uk/igfs MD:Z:10C69
MD:Z:8C71
MD:Z:8A33G37
MD:Z:80 YS:i:-­‐5
YS:i:-­‐6
YS:i:0
YS:i:-­‐11
YT:Z:CP
YT:Z:CP
YT:Z:CP
YT:Z:CP
45 SAM file op*onal Tags 09/09/2014 TAG TYPE Descripbon NM i=integer edit distance to reference MD Z=string string for miss-­‐natching posi*ons AS i=integer score generated by aligner YS i=integer score of other mate YT Z=string string represen*ng alignment type X? ? reserved fields for end-­‐users www.qub.ac.uk/igfs 46 Sam file Tags •  Op*onal fields are in the format: <TAG>:<TYPE>:<VALUE> •  Examples: –  NM:i:1 •  NM tag means Edit distance to reference •  i stands for integer •  1 = one mismatch
–  MD:Z:12A12 09/09/2014 •  MD tag means String for mismatching posi*ons •  Z stands for printable String •  12A12 shows 12 matches, A, 12 matches www.qub.ac.uk/igfs 47 NM and MD Flags: Bwa-­‐mem alignment file CIGAR NM-­‐tag MD-­‐tag 25M NM:i:1 MD:12A12 REF: TGCATTCATGTGAATGTGAATGTAATATGGTGATCGCAC Read: ATGCGAATGTGATTGTAATATGGTG What do these mean? CIGAR NM-­‐tag MD-­‐tag •  80M NM:i:1 MD:Z:10C69 •  80M NM:i:1 MD:Z:8C71 •  80M NM:i:2 MD:Z:8A33G37 •  80M NM:i:0 MD:Z:80 •  80M NM:i:2 MD:Z:24T44A10 09/09/2014 www.qub.ac.uk/igfs 48 SAM file: • 
• 
• 
• 
• 
MAPQ (mapping quality, unique mapping) CIGAR (match, inser*on, dele*on) Op*onal Fields, Tags (various) Sequence string (of the read) Base-­‐quality (of each base in the read) Process with downstream tools to get desired output 09/09/2014 www.qub.ac.uk/igfs 49 The SAM/BAM alignment file •  The sam/bam file may be processed to iden*fy –  Counts per gene/read-­‐depth (RNAseq, CHIPseq) •  Gene expression analysis •  transcrip*on factor binding sites –  polymorphisms (SNPs, Indels etc) –  Annota*on file (gƒ/gff) for annota*on of results 09/09/2014 www.qub.ac.uk/igfs 50 Processing tools for sam/bam alignment file •  Samtools/Picard/GATK Provide sets of tools for working with NextGen data in bam format i.e deduplica*on, re-­‐alignment around indels •  Bedtools: Comparing genomic features in bed format •  Perl/Python scripts etc for more custom made explora*on of the data •  Downstream sta*s*cal analysis in R etc. 09/09/2014 51 REFERENCES Mapping short reads onto genomes •  Trapnell C, Salzberg SL. How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-­‐7. Bowbe and Bowbe2 •  Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-­‐efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. •  Langmead B, Salzberg SL. Fast gapped-­‐read alignment with Bow*e 2. Nat Methods. 2012 Mar 4;9(4):357-­‐9. BWA •  Li H, Durbin R. Fast and accurate short read alignment with Burrows-­‐Wheeler transform. Bioinforma*cs. 2009 Jul 15;25(14):1754-­‐60. •  Li H, Durbin R. Fast and accurate long-­‐read alignment with Burrows-­‐Wheeler transform. Bioinforma*cs. 2010 Mar 1;26(5):589-­‐95 •  Li H. Aligning sequence reads, clone sequences and assembly con*gs with BWA-­‐MEM. 2013 arXiv:1303.3997v1 [q-­‐bio.GN] Sam alignment format •  Sam alignment format specifica*ons
hqp://samtools.sourceforge.net/SAMv1.pdf 09/09/2014 www.qub.ac.uk/igfs 52