DNA Sequence Compression

DNA Sequence
Compression
Simon Zheng
DNA Sequence Compression
●
●
●
DNA: deoxyribonucleic acid
Sequence:
a continuous
or connected
an extremely
long macromolecule
thatseries
is the main component of
Compression:
press
into less space.
chromosomes to
and
is thetogether;
materialforce
that transfers
genetic characteristics in all
life forms, constructed of two nucleotide strands coiled around each other in
a ladderlike arrangement with the sidepieces composed of alternating
phosphate and deoxyribose units and the rungs composed of the purine and
pyrimidine bases adenine, guanine, cytosine, and thymine: the genetic
information of DNA is encoded in the sequence of the bases and is
transcribed as the strands unwind and replicate.
Source: Shaury Nash, Flickr
(Creative Commons)
Why Compress DNA Sequences?
High-throughput sequencing becoming
faster and cheaper
More people interested in their
+ genomic data and predispositions
Rapidly growing amounts
of sequence data!!
Source: “On the future of genomic data” (Kahn, 2011)
Why Compress DNA Sequences?
●
●
Is growth of sequence data good?
With ASCII files, 10k genomes = 30TB
○
●
●
1 genome ~ 3.1 GB
Storage and transfer is a huge problem
Cost of sequencing is decreasing much
faster than the cost for storing sequences
(i.e. buying hard disks)
Source: “On the future of genomic data” (Kahn, 2011)
Lossless vs. Lossy Compression
●
Lossless data compression algorithms allows the original data to be
perfectly reconstructed from the compressed data.
○
●
Examples: ZIP functions (gzip, bzip)
Lossy data compression algorithms lose information from the original data
○
Examples: image compression functions (e.g. the JPEG standard)
●
We will only discuss lossless DNA sequence compression today.
●
●
Why would we prefer lossless for DNA sequence compression?
Why would we use lossy for image compression, such as the JPEG standard?
Outline
●
Four Types of Compression Techniques
○
○
○
○
(20 minutes)
Naive Bit Manipulation Algorithms
Dictionary-based Algorithms
Statistical Algorithms
Referential Algorithms
●
Compressing genomes versus. sequence reads
(5 minutes)
●
“Human genomes as email attachments”
(15 minutes)
●
“HapZipper: sharing HapMap populations just got easier”
(15 minutes)
Four Types of Compression Techniques
1.
2.
3.
4.
Naive Bit Manipulation Algorithms
Dictionary-based Algorithms
Statistical Algorithms
Referential Algorithms
Naive Bit Manipulation Algorithms
●
●
●
1 byte = 8 bits; 1 bit can represent {0,1}
ASCII is inefficient (1 byte per character is wasteful)
~4x compression rate
●
Further improvements with run-length encoding
Source: “Trends in
Genome Compression”
(Wandelt, et al., 2013))
Dictionary-based Algorithms
●
●
●
●
Main idea: Replace DNA subsequences with references to a dictionary
Dictionary can be reconstructed; no need to store
Lempel-Ziv-based compression algorithms: LZ77, LZ78
4-6x compression rate
Source: “Trends in
Genome Compression”
(Wandelt, et al., 2013))
Lempel-Ziv-based Algorithms
●
●
●
●
Widely used
Exploits repeated patterns in input data
Dictionary initialized with {A, C, T, G}
As dictionary grows, codes grow in bit width to accommodate new entries.
●
Main idea:
○
○
Scan through input string for increasingly longer substrings S until one is not in dict.
When match is found, the code for the string S without the last character (call it S-1: the
longest substring that is in the dictionary) is retrieved
■ S-1 is sent to output
■ S is added to the dictionary with the next available code.
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
A exists in the dictionary, so we extend it to AC, which
does not exist in the dictionary.
Output
1
for A.
Add
AC
to the dictionary.
Output: 1
#
0
A
1
T
2
G
3
C
4
AC
5
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
C exists in the dictionary, so we extend it to CC, which
does not exist in the dictionary.
Output
4
for C.
Add
CC
to the dictionary.
Output: 1 4
#
0
A
1
T
2
G
3
C
4
AC
5
CC
6
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
C exists in the dictionary, so we extend it to CA, which
does not exist in the dictionary.
Output
4
for C.
Add
CA
to the dictionary.
Output: 1 4 4
#
0
A
1
T
2
G
3
C
4
AC
5
CC
6
CA
7
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
AC exists in the dictionary, so we extend it to ACC,
which does not exist in the dictionary.
Output
5
for AC.
Add
ACC to the dictionary.
Output: 1 4 4 5
#
0
A
1
T
2
G
3
C
4
AC
5
CC
6
CA
7
ACC
8
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
CA exists in the dictionary, so we extend it to CAC,
which does not exist in the dictionary.
Output
7
for CA.
Add
CAC to the dictionary.
Output: 1 4 4 5 7
A
1
T
2
G
3
C
4
AC
5
CC
6
CA
7
ACC
8
CAC
9
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
CA exists in the dictionary, so we extend it to CAC,
which does not exist in the dictionary.
Output
6
for CC.
Add
CCA to the dictionary.
Output: 1 4 4 5 7 6
T
2
G
3
C
4
AC
5
CC
6
CA
7
ACC
8
CAC
9
CCA
10
Lempel-Ziv-Welch Compression Example
Input:
Dictionary
ACCACCACCACC #
ACC exists in the dictionary, so we extend it to CAC,
which does not exist in the dictionary.
Output
8
for ACC.
Add
...
to the dictionary.
Output: 1 4 4 5 7 6 8
T
2
G
3
C
4
AC
5
CC
6
CA
7
ACC
8
CAC
9
CCA
10
Statistical Algorithms
●
Main idea: Create statistical model of DNA subsequences
○
●
●
●
Higher frequency subsequences represented by shorter codes
4-8x compression rate
Huffman Encoding
Example: Huffman Encoding with the following probabilities
○
○
○
○
A = 66%
G = 18%
C = 8%
T = 8%
Huffman Encoding Example
A = 66%
G = 18%
C = 8%
T = 8%
Huffman Encoding Example
A = 66%
G = 18%
(CT) = 16%
8%
T
8%
C
Huffman Encoding Example
A = 66%
(CGT) = 34%
16%
18%
G
8%
T
8%
C
Huffman Encoding Example
(ACGT) = 100%
34%
66%
A
16%
18%
G
8%
T
8%
C
Huffman Encoding Example
●
Assign shortest codes from
top to bottom
1: 34%
0: 66%
A
11: 16%
10: 18%
G
110: 8%
T
111: 8%
C
Huffman Encoding Example
Source: “Trends in
Genome
Compression”
(Wandelt, et al.,
2013))
Referential Algorithms
●
●
●
●
Main idea: Only track differences from a reference genome of same species
up to 400x compression rate
Reference genomes: HG19 for humans
Example of mapping to a reference genome:
Source: “Trends in
Genome Compression”
(Wandelt, et al., 2013))
Variation Mapping (main idea)
Continuous Genome Sequence
Source: Medgener123, Wikimedia (Creative
Commons)
Compressing Genomes vs. Reads
●
Main differences:
○
○
○
●
Genomes are much longer
Reads may map to anywhere on a genome (is by definition, not reconstructed)
DNA sequencing is prone to errors → need quality scores
Quality Scores
○
○
○
Important for methods like SNP detection (mismatches b/c of seq. error or novel SNP?)
Typically, Phred-like scores (94 different values compared to 4 for genomes)
Harder to compress reads because the quality scores have a larger alphabet to represent
both reads and quality scores (ASCII)
■ 30-40x data increase with quality scores
Compressing Genomes vs. Reads
Source: “Trends in
Genome Compression”
(Wandelt, et al., 2013))
Two Referential Compression Papers
●
●
“Human genomes as email attachments” (Christley, et al. 2009)
“HapZipper: sharing HapMap populations just got easier” (Chanda, et al.,
2012)
Human genomes as email
attachments
(Christley, et al., 2009)
Overview of “Human Genomes as email
attachments”
●
●
●
Describes a series of techniques for compressing James Watson’s genome (3
billion base pairs; 3 billion bytes; 3 GB) into 4 MB, small enough to be an
email attachment
Roughly 750x compression from a naive 1-bp-1-byte representation
Four special-purpose DNA compression steps:
a.
b.
c.
d.
Variable integers for positions (VINT)
Delta positions (DELTA)
SNP Mapping (DBSNP)
K-mer partitioning (KMER)
Variable integers for positions (VINT)
●
●
●
●
Problem: Chromosomes are up to 247 MB long, we would need 4 bytes for
full-range of position values
Main idea: don’t need all four bytes for smaller values
A variable integer (VINT) uses single bit within a byte as an end-of-integer
flag (0 = continue, 1 = end)
Exercise: How many bytes for 128 using VINT? How do we represent?
# Bytes
# Representation Bits
Possible values
1 byte
7 bits
0-127
2 bytes
14 bits
128-16383
3 bytes
21 bits
16384 - 2097152
Variable integers for positions (VINT)
Source: “HapZipper: Sharing HapMap populations
just got easier”
Delta Positions (DELTA)
●
●
Problem: Positions stored as absolute values; positions require increasing
storage towards end of chromosome
Main Idea: Store position value as relative position (DELTA) from position of
previous variation
Position
SNP
DELTA
SNP
1,000
G
1,000
G
5,000
T
4,000
T
10,000
A
5,000
A
SNP Mapping (DBSNP)
●
●
●
Among humans, much of the genome is shared and much of the variations
are also shared.
Most SNPs are biallele.
Therefore, DBSNP transforms the reference SNP map into an ordered list of
bits or bitmap where each bit has a value of 1 if that common SNP is in our
genome’s variation data and otherwise, has a value of 0.
K-mer partitioning (KMER)
●
●
●
Observation: human genome sequence is highly nonrandom; has repeat
sequences with low complexity.
Main idea: Partition insertion sequence data into k-mers, then use Huffman
coding to encode.
Researchers found that k=4 gave optimal compression rate
k-mer Huffman Encoding Example
ATCG = 12%
AATT = 5%
AAAA = 5%
TAAA = 4%
…
…
…
…
0.1%
0.2%
GGCG = 0.2%
TTTT = 0.1%
TTTT
GGCG
Compression Results
Source: “Human Genomes as email attachments”
HapZipper: sharing
HapMap populations
just got easier
(Chanda, et al., 2012)
Overview of HapZipper
●
●
●
●
●
Describes an effective way to compress population haplotype data
> 20x fold compression; 2- to 4- fold better than leading general-purpose
compression utilities
Uses phased haplotypes
Uses dbSNP as reference
Three Steps:
a.
b.
c.
Encode HapMap data to bit sequence using dbSNP as reference
Split encoded dataset to high- and low-minor allele frequencies
Compress parts separately with bzip2
Genotypes vs. Haplotypes
Source: http://www.biomedcentral.
com/content/figures/1471-2105-8-484-1-l.jpg
Step 1: Encode HapMap data to bit sequence
using dbSNP as reference
●
Each HapMap SNP is classified as a bit-sequence in reference to dbSNP,
according to one of four categories:
bp position
allele
SNP ref #
Exact match SNP (EX-SNPs)
match
match
match
Strand flipped SNPs (SF-SNPs)
match
match in the opposite strand
match
Reference number mismatch SNPs (RMSNPs)
match
match
no match
Novel allele SNPs (Nallele-SNPs)
match
different allele from those in dbSNP
new, so no
match
Completely Novel SNPs (Nnew-SNPs)
new, so no
match
SNP does not exist in dbSNP
new, so no
match
Step 1: Encode HapMap data to bit sequence
using dbSNP as reference
Source: “HapZipper: Sharing HapMap populations just got easier”
VINT
Source: “HapZipper: Sharing HapMap populations just got easier”
Step 2: Split encoded dataset to high- and lowminor allele frequencies
Source: “HapZipper: Sharing HapMap populations just got easier
Step 3: Compress parts separately with bzip2
Source: “HapZipper: Sharing HapMap populations just got easier
Step 3: Compress parts separately with bzip2
Source: “HapZipper: Sharing HapMap populations just got easier
Comparison of two techniques
Compression
input
Steps
“Human genomes as email
attachments” (Christley, et al. 2009)
“HapZipper: sharing HapMap populations just
got easier” (Chanda, et al., 2012)
Genotypic data
Population haplotype data
1.
2.
3.
4.
Variable integers for positions
(VINT)
Delta positions (DELTA)
SNP Mapping (DBSNP)
K-mer partitioning (KMER)
1.
2.
3.
Compression
Rate
3.1GB → 4MB for one genome
Encode HapMap data to bit sequence
using dbSNP as reference (includes
VINT, DELTA, DBSNP)
Split encoded dataset to high- and lowminor allele frequencies
Compress parts separately with bzip2
> 20x compression rate for all population
haplotype files