DNA Sequence Compression Simon Zheng DNA Sequence Compression ● ● ● DNA: deoxyribonucleic acid Sequence: a continuous or connected an extremely long macromolecule thatseries is the main component of Compression: press into less space. chromosomes to and is thetogether; materialforce that transfers genetic characteristics in all life forms, constructed of two nucleotide strands coiled around each other in a ladderlike arrangement with the sidepieces composed of alternating phosphate and deoxyribose units and the rungs composed of the purine and pyrimidine bases adenine, guanine, cytosine, and thymine: the genetic information of DNA is encoded in the sequence of the bases and is transcribed as the strands unwind and replicate. Source: Shaury Nash, Flickr (Creative Commons) Why Compress DNA Sequences? High-throughput sequencing becoming faster and cheaper More people interested in their + genomic data and predispositions Rapidly growing amounts of sequence data!! Source: “On the future of genomic data” (Kahn, 2011) Why Compress DNA Sequences? ● ● Is growth of sequence data good? With ASCII files, 10k genomes = 30TB ○ ● ● 1 genome ~ 3.1 GB Storage and transfer is a huge problem Cost of sequencing is decreasing much faster than the cost for storing sequences (i.e. buying hard disks) Source: “On the future of genomic data” (Kahn, 2011) Lossless vs. Lossy Compression ● Lossless data compression algorithms allows the original data to be perfectly reconstructed from the compressed data. ○ ● Examples: ZIP functions (gzip, bzip) Lossy data compression algorithms lose information from the original data ○ Examples: image compression functions (e.g. the JPEG standard) ● We will only discuss lossless DNA sequence compression today. ● ● Why would we prefer lossless for DNA sequence compression? Why would we use lossy for image compression, such as the JPEG standard? Outline ● Four Types of Compression Techniques ○ ○ ○ ○ (20 minutes) Naive Bit Manipulation Algorithms Dictionary-based Algorithms Statistical Algorithms Referential Algorithms ● Compressing genomes versus. sequence reads (5 minutes) ● “Human genomes as email attachments” (15 minutes) ● “HapZipper: sharing HapMap populations just got easier” (15 minutes) Four Types of Compression Techniques 1. 2. 3. 4. Naive Bit Manipulation Algorithms Dictionary-based Algorithms Statistical Algorithms Referential Algorithms Naive Bit Manipulation Algorithms ● ● ● 1 byte = 8 bits; 1 bit can represent {0,1} ASCII is inefficient (1 byte per character is wasteful) ~4x compression rate ● Further improvements with run-length encoding Source: “Trends in Genome Compression” (Wandelt, et al., 2013)) Dictionary-based Algorithms ● ● ● ● Main idea: Replace DNA subsequences with references to a dictionary Dictionary can be reconstructed; no need to store Lempel-Ziv-based compression algorithms: LZ77, LZ78 4-6x compression rate Source: “Trends in Genome Compression” (Wandelt, et al., 2013)) Lempel-Ziv-based Algorithms ● ● ● ● Widely used Exploits repeated patterns in input data Dictionary initialized with {A, C, T, G} As dictionary grows, codes grow in bit width to accommodate new entries. ● Main idea: ○ ○ Scan through input string for increasingly longer substrings S until one is not in dict. When match is found, the code for the string S without the last character (call it S-1: the longest substring that is in the dictionary) is retrieved ■ S-1 is sent to output ■ S is added to the dictionary with the next available code. Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # A exists in the dictionary, so we extend it to AC, which does not exist in the dictionary. Output 1 for A. Add AC to the dictionary. Output: 1 # 0 A 1 T 2 G 3 C 4 AC 5 Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # C exists in the dictionary, so we extend it to CC, which does not exist in the dictionary. Output 4 for C. Add CC to the dictionary. Output: 1 4 # 0 A 1 T 2 G 3 C 4 AC 5 CC 6 Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # C exists in the dictionary, so we extend it to CA, which does not exist in the dictionary. Output 4 for C. Add CA to the dictionary. Output: 1 4 4 # 0 A 1 T 2 G 3 C 4 AC 5 CC 6 CA 7 Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # AC exists in the dictionary, so we extend it to ACC, which does not exist in the dictionary. Output 5 for AC. Add ACC to the dictionary. Output: 1 4 4 5 # 0 A 1 T 2 G 3 C 4 AC 5 CC 6 CA 7 ACC 8 Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # CA exists in the dictionary, so we extend it to CAC, which does not exist in the dictionary. Output 7 for CA. Add CAC to the dictionary. Output: 1 4 4 5 7 A 1 T 2 G 3 C 4 AC 5 CC 6 CA 7 ACC 8 CAC 9 Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # CA exists in the dictionary, so we extend it to CAC, which does not exist in the dictionary. Output 6 for CC. Add CCA to the dictionary. Output: 1 4 4 5 7 6 T 2 G 3 C 4 AC 5 CC 6 CA 7 ACC 8 CAC 9 CCA 10 Lempel-Ziv-Welch Compression Example Input: Dictionary ACCACCACCACC # ACC exists in the dictionary, so we extend it to CAC, which does not exist in the dictionary. Output 8 for ACC. Add ... to the dictionary. Output: 1 4 4 5 7 6 8 T 2 G 3 C 4 AC 5 CC 6 CA 7 ACC 8 CAC 9 CCA 10 Statistical Algorithms ● Main idea: Create statistical model of DNA subsequences ○ ● ● ● Higher frequency subsequences represented by shorter codes 4-8x compression rate Huffman Encoding Example: Huffman Encoding with the following probabilities ○ ○ ○ ○ A = 66% G = 18% C = 8% T = 8% Huffman Encoding Example A = 66% G = 18% C = 8% T = 8% Huffman Encoding Example A = 66% G = 18% (CT) = 16% 8% T 8% C Huffman Encoding Example A = 66% (CGT) = 34% 16% 18% G 8% T 8% C Huffman Encoding Example (ACGT) = 100% 34% 66% A 16% 18% G 8% T 8% C Huffman Encoding Example ● Assign shortest codes from top to bottom 1: 34% 0: 66% A 11: 16% 10: 18% G 110: 8% T 111: 8% C Huffman Encoding Example Source: “Trends in Genome Compression” (Wandelt, et al., 2013)) Referential Algorithms ● ● ● ● Main idea: Only track differences from a reference genome of same species up to 400x compression rate Reference genomes: HG19 for humans Example of mapping to a reference genome: Source: “Trends in Genome Compression” (Wandelt, et al., 2013)) Variation Mapping (main idea) Continuous Genome Sequence Source: Medgener123, Wikimedia (Creative Commons) Compressing Genomes vs. Reads ● Main differences: ○ ○ ○ ● Genomes are much longer Reads may map to anywhere on a genome (is by definition, not reconstructed) DNA sequencing is prone to errors → need quality scores Quality Scores ○ ○ ○ Important for methods like SNP detection (mismatches b/c of seq. error or novel SNP?) Typically, Phred-like scores (94 different values compared to 4 for genomes) Harder to compress reads because the quality scores have a larger alphabet to represent both reads and quality scores (ASCII) ■ 30-40x data increase with quality scores Compressing Genomes vs. Reads Source: “Trends in Genome Compression” (Wandelt, et al., 2013)) Two Referential Compression Papers ● ● “Human genomes as email attachments” (Christley, et al. 2009) “HapZipper: sharing HapMap populations just got easier” (Chanda, et al., 2012) Human genomes as email attachments (Christley, et al., 2009) Overview of “Human Genomes as email attachments” ● ● ● Describes a series of techniques for compressing James Watson’s genome (3 billion base pairs; 3 billion bytes; 3 GB) into 4 MB, small enough to be an email attachment Roughly 750x compression from a naive 1-bp-1-byte representation Four special-purpose DNA compression steps: a. b. c. d. Variable integers for positions (VINT) Delta positions (DELTA) SNP Mapping (DBSNP) K-mer partitioning (KMER) Variable integers for positions (VINT) ● ● ● ● Problem: Chromosomes are up to 247 MB long, we would need 4 bytes for full-range of position values Main idea: don’t need all four bytes for smaller values A variable integer (VINT) uses single bit within a byte as an end-of-integer flag (0 = continue, 1 = end) Exercise: How many bytes for 128 using VINT? How do we represent? # Bytes # Representation Bits Possible values 1 byte 7 bits 0-127 2 bytes 14 bits 128-16383 3 bytes 21 bits 16384 - 2097152 Variable integers for positions (VINT) Source: “HapZipper: Sharing HapMap populations just got easier” Delta Positions (DELTA) ● ● Problem: Positions stored as absolute values; positions require increasing storage towards end of chromosome Main Idea: Store position value as relative position (DELTA) from position of previous variation Position SNP DELTA SNP 1,000 G 1,000 G 5,000 T 4,000 T 10,000 A 5,000 A SNP Mapping (DBSNP) ● ● ● Among humans, much of the genome is shared and much of the variations are also shared. Most SNPs are biallele. Therefore, DBSNP transforms the reference SNP map into an ordered list of bits or bitmap where each bit has a value of 1 if that common SNP is in our genome’s variation data and otherwise, has a value of 0. K-mer partitioning (KMER) ● ● ● Observation: human genome sequence is highly nonrandom; has repeat sequences with low complexity. Main idea: Partition insertion sequence data into k-mers, then use Huffman coding to encode. Researchers found that k=4 gave optimal compression rate k-mer Huffman Encoding Example ATCG = 12% AATT = 5% AAAA = 5% TAAA = 4% … … … … 0.1% 0.2% GGCG = 0.2% TTTT = 0.1% TTTT GGCG Compression Results Source: “Human Genomes as email attachments” HapZipper: sharing HapMap populations just got easier (Chanda, et al., 2012) Overview of HapZipper ● ● ● ● ● Describes an effective way to compress population haplotype data > 20x fold compression; 2- to 4- fold better than leading general-purpose compression utilities Uses phased haplotypes Uses dbSNP as reference Three Steps: a. b. c. Encode HapMap data to bit sequence using dbSNP as reference Split encoded dataset to high- and low-minor allele frequencies Compress parts separately with bzip2 Genotypes vs. Haplotypes Source: http://www.biomedcentral. com/content/figures/1471-2105-8-484-1-l.jpg Step 1: Encode HapMap data to bit sequence using dbSNP as reference ● Each HapMap SNP is classified as a bit-sequence in reference to dbSNP, according to one of four categories: bp position allele SNP ref # Exact match SNP (EX-SNPs) match match match Strand flipped SNPs (SF-SNPs) match match in the opposite strand match Reference number mismatch SNPs (RMSNPs) match match no match Novel allele SNPs (Nallele-SNPs) match different allele from those in dbSNP new, so no match Completely Novel SNPs (Nnew-SNPs) new, so no match SNP does not exist in dbSNP new, so no match Step 1: Encode HapMap data to bit sequence using dbSNP as reference Source: “HapZipper: Sharing HapMap populations just got easier” VINT Source: “HapZipper: Sharing HapMap populations just got easier” Step 2: Split encoded dataset to high- and lowminor allele frequencies Source: “HapZipper: Sharing HapMap populations just got easier Step 3: Compress parts separately with bzip2 Source: “HapZipper: Sharing HapMap populations just got easier Step 3: Compress parts separately with bzip2 Source: “HapZipper: Sharing HapMap populations just got easier Comparison of two techniques Compression input Steps “Human genomes as email attachments” (Christley, et al. 2009) “HapZipper: sharing HapMap populations just got easier” (Chanda, et al., 2012) Genotypic data Population haplotype data 1. 2. 3. 4. Variable integers for positions (VINT) Delta positions (DELTA) SNP Mapping (DBSNP) K-mer partitioning (KMER) 1. 2. 3. Compression Rate 3.1GB → 4MB for one genome Encode HapMap data to bit sequence using dbSNP as reference (includes VINT, DELTA, DBSNP) Split encoded dataset to high- and lowminor allele frequencies Compress parts separately with bzip2 > 20x compression rate for all population haplotype files
© Copyright 2024 ExpyDoc