Evaluation of methods for identifying exposure

Evaluation of methods for identifying
exposure-related differentially
methylated regions in human blood
DNA
Matthew Suderman
Postdoctoral Research Assistant
MRC Integrative Epidemiology Unit
University of Bristol, UK
CpG site correlation (R)
CpG site dependence and distance
CpG site distance (bp)
Repression and DNA methylation
upstream
downstream
Medvedeva et al. BMC Genomics 2014 15:119
Classical differentially methylated region
Jaffe A E et al. Int. J. Epidemiol. 2012;41:200-209
Questions
•  How to calculate statistical significance?
•  Predefine regions?
•  Sliding window?
•  450K data vs whole genome bisulfite sequencing (~20M)
Predefining regions
gap
bumphunter::clusterMaker(maxgap)
Different behaviour across genes
Madeleine et al. Nature Biotechnology, 2009
Bumphunting
Step 1: linear
regression for
each CpG
site
Step 2: smooth
regression
coefficients
across the
genome
Step 3: identify
candidate
DMRs; “area” is
the statistic
Step 4:
construct null
distribution
from
permutations
Jaffe A E et al. Int. J. Epidemiol. 2012;41:200-209
Combining site statistics
CG1 CG2 CG3
CGn
TSS
•  Fisher’s method for combining p-values
x = -2Σi=1..n log(pi)
x has X2n2 distribution
•  Stouffer’s method for combining z-scores
z = Σi=1..n zi / √ n
z = Σi=1..n wizi / √ Σi=1..n wi2
Stouffer: choosing weights
z = Σi=1..n wizi / √ Σi=1..n wi2
Weights can be chosen to emphasize:
•  independence
wi = 1/avg correlation between site i and others
•  agreement
wi = avg correlation between site i and others
•  variability
wi = variance of site i methylation
Stouffer: non-independence
1.  z = Σi=1..n wizi / √ Σi=1..n wi2 + 2 Σi<j wiwjrij
Estimate rij (correlation zi and zj) by
bootstrapping.
Comb-p: transform to independent z-scores
2. 
Estimate correlation between CpG sites at given distances
b.  Construct corresponding correlation matrix
c.  Compute Cholesky factor
d.  Use factor to generate independent z-scores
a. 
Linear regression
•  OLS: y = X β + ϵ minimizing | X β – y|2
y = phenotype/exposure variable
X = methylation levels
(rows=samples, columns=CpG sites)
•  Lasso: minimizing | X β – y|2 + λ | β|
•  Ridge: minimizing | X β – y|2 + λ | β|2
The globaltest::gt() function is designed for this.
The Avon Longitudinal Study of Parents
and Children (Children of the 90s)
Antenatal <1y 1y 2y 3y 4y 5y 6y 7y 8y Child’s health (e.g. medical history, anthropometry) Child’s demographics (e.g. ethnicity, social background) Environmental health (e.g. pollutant exposure) Child’s lifestyle (e.g. physical ac@vity, diet) Child’s school & educa@on Child’s behaviour and psychology Child’s development (e.g. cogni@ve, motor skills, puberty) Parent’s lifestyle (e.g. smoking & drinking) Parent’s psychological well-­‐being Biological samples (blood, plasma, serum, cells, @ssue, hair, urine) 9y 10y 11y 12y 13y 16+ Accessible Resource for Integrated
Epigenomic Studies (ARIES)
cord blood
Number of associations
exposures/phenotypes Recrea@onal drug Depression Miscarriages Blood metal Pain medica@on (early) SEP (income) Sensory phenotype SEP (home) Pain medica@on (late) Diet Home air quality Blood metal SEP (house) Air pollu@on Tobacco Birth characteris@c Pregnancy .me point single-­‐site bumphunter globaltest stouffer (cor) stouffer (ind) lasso (ChAMP) cord 0 0 0 0 0 0 cord 0 0 0 0 0 0 cord 0 0 0 0 0 1 cord 0 1 0 0 0 1 cord 0 1 0 0 0 1 cord 0 0 0 2 0 0 cord 0 0 0 0 0 2 cord 0 0 0 1 1 1 cord 0 1 1 3 4 0 cord 0 0 1 13 13 0 cord 1 1 0 1 1 0 cord 1 0 5 2 2 1 cord 3 0 0 0 0 3 cord 3 0 4 3 2 0 cord 32 1 26 27 27 1 cord 221 1 66 46 50 5 cord 2013 1 606 435 398 17 DMR methods contribute
exposures/phenotypes Recrea@onal drug Depression Miscarriages Blood metal Pain medica@on (early) SEP (income) Sensory phenotype SEP (home) Pain medica@on (late) Diet Home air quality Blood metal SEP (house) Air pollu@on Tobacco Birth characteris@c Pregnancy .me point single-­‐site bumphunter globaltest stouffer (cor) stouffer (ind) lasso (ChAMP) cord 0 0 0 0 0 0 cord 0 0 0 0 0 0 cord 0 0 0 0 0 1 cord 0 1 0 0 0 1 cord 0 1 0 0 0 1 cord 0 0 0 2 0 0 cord 0 0 0 0 0 2 cord 0 0 0 1 1 1 cord 0 1 1 3 4 0 cord 0 0 1 13 13 0 cord 1 1 0 1 1 0 cord 1 0 5 2 2 1 cord 3 0 0 0 0 3 cord 3 0 4 3 2 0 cord 32 1 26 27 27 1 cord 221 1 66 46 50 5 cord 2013 1 606 435 398 17 Stouffer (cor) DMRs
exposures/phenotypes Recrea@onal drug Depression Miscarriages Blood metal Pain medica@on (early) SEP (income) Sensory phenotype SEP (home) Pain medica@on (late) Diet Home air quality Blood metal SEP (house) Air pollu@on Tobacco Birth characteris@c Pregnancy .me point single-­‐site bumphunter globaltest stouffer (cor) stouffer (ind) lasso (ChAMP) cord 0 0 0 0 0 0 cord 0 0 0 0 0 0 cord 0 0 0 0 0 1 cord 0 1 0 0 0 1 cord 0 1 0 0 0 1 cord 0 0 0 2 0 0 cord 0 0 0 0 0 2 cord 0 0 0 1 1 1 cord 0 1 1 3 4 0 cord 0 0 1 13 13 0 cord 1 1 0 1 1 0 cord 1 0 5 2 2 1 cord 3 0 0 0 0 3 cord 3 0 4 3 2 0 cord 32 1 26 27 27 1 cord 221 1 66 46 50 5 cord 2013 1 606 435 398 17 methylation levels
linear model coefficients
SEP (income) DMR
variable
~35kb upstream
Globaltest DMRs
exposures/phenotypes Recrea@onal drug Depression Miscarriages Blood metal Pain medica@on (early) SEP (income) Sensory phenotype SEP (home) Pain medica@on (late) Diet Home air quality Blood metal SEP (house) Air pollu@on Tobacco Birth characteris@c Pregnancy .me point single-­‐site bumphunter globaltest stouffer (cor) stouffer (ind) lasso (ChAMP) cord 0 0 0 0 0 0 cord 0 0 0 0 0 0 cord 0 0 0 0 0 1 cord 0 1 0 0 0 1 cord 0 1 0 0 0 1 cord 0 0 0 2 0 0 cord 0 0 0 0 0 2 cord 0 0 0 1 1 1 cord 0 1 1 3 4 0 cord 0 0 1 13 13 0 cord 1 1 0 1 1 0 cord 1 0 5 2 2 1 cord 3 0 0 0 0 3 cord 3 0 4 3 2 0 cord 32 1 26 27 27 1 cord 221 1 66 46 50 5 cord 2013 1 606 435 398 17 Model fit
predicted
linear model coefficients
Blood metal levels DMR
measurements
~65kb from binding gene
New tobacco exposure DMRs?
exposures/phenotypes Recrea@onal drug Depression Miscarriages Blood metal Pain medica@on (early) SEP (income) Sensory phenotype SEP (home) Pain medica@on (late) Diet Home air quality Blood metal SEP (house) Air pollu@on Tobacco Birth characteris@c Pregnancy .me point single-­‐site bumphunter globaltest stouffer (cor) stouffer (ind) lasso (ChAMP) cord 0 0 0 0 0 0 cord 0 0 0 0 0 0 cord 0 0 0 0 0 1 cord 0 1 0 0 0 1 cord 0 1 0 0 0 1 cord 0 0 0 2 0 0 cord 0 0 0 0 0 2 cord 0 0 0 1 1 1 cord 0 1 1 3 4 0 cord 0 0 1 13 13 0 cord 1 1 0 1 1 0 cord 1 0 5 2 2 1 cord 3 0 0 0 0 3 cord 3 0 4 3 2 0 cord 32 1 26 27 27 1 cord 221 1 66 46 50 5 cord 2013 1 606 435 398 17 New tobacco exposure DMRs
single-site
globaltest
11/32 sites
stouffer (ind)
1
13
12
12
3
stouffer (cor)
3
DMRs and replication
500 samples
associated
sites/DMRs
probes
1000 samples
random split
compare
associated
sites/DMRs
repeat 10x
stouffer (ind)
stouffer (cor)
Number replicated
global test
single site
stouffer (ind)
stouffer (cor)
global test
Number sites/DMRs
single site
stouffer (ind)
stouffer (cor)
global test
single site
Tobacco exposure replication
% replicated
stouffer (ind)
stouffer (cor)
Number replicated
global test
single site
stouffer (ind)
stouffer (cor)
global test
Number sites/DMRs
single site
stouffer (ind)
stouffer (cor)
global test
single site
Pregnancy (lots of associations)
% replicated
stouffer (ind)
stouffer (cor)
Number replicated
global test
single site
stouffer (ind)
stouffer (cor)
global test
Number sites/DMRs
single site
stouffer (ind)
stouffer (cor)
global test
single site
Blood metal levels replication
% replicated
Two-stage analysis
500 samples
probes
1000 samples
Stage 1: identify
sites/DMRs with
p < threshold
random split
repeat 10x
Stage 2:
test only those
that pass the
threshold
Two-stage analysis sensitivity
exposures/phenotypes Recrea@onal drug Depression Miscarriages Blood metal Pain medica@on (early) SEP (income) Sensory phenotype SEP (home) Pain medica@on (late) Diet Home air quality Blood metal SEP (house) Air pollu@on Tobacco Birth characteris@c Pregnancy Full dataset analysis Two-­‐stage analysis (mean n=10 splits) single-­‐ Bump-­‐ Global-­‐ stouffer stouffer lasso single-­‐ Global-­‐ stouffer stouffer hunter test (cor) (ind) (ChAMP) site test (cor) (ind) .me point site cord 0 0 0 0 0 0 0 0.2 0.3 0.3 cord 0 0 0 0 0 0 0.1 0.1 0.7 0.5 cord 0 0 0 0 0 1 0 0.4 0.5 0.3 cord 0 1 0 0 0 1 0 0.5 0.3 0.3 cord 0 1 0 0 0 1 0 0.4 0.2 0.2 cord 0 0 0 2 0 0 0 0.6 0.3 0.4 cord 0 0 0 0 0 2 0 0 0.2 0.3 cord 0 0 0 1 1 1 0 0.1 0.5 0.5 cord 0 1 1 3 4 0 0 0.3 0.4 0.5 cord 0 0 1 13 13 0 0 0.3 0.4 0.6 cord 1 1 0 1 1 0 0 0 0.5 0.4 cord 1 0 5 2 2 1 0.9 1.5 0.6 0.7 cord 3 0 0 0 0 3 0.4 0.2 0.3 0.2 cord 3 0 4 3 2 0 1 1.4 0.6 0.4 cord 32 1 26 27 27 1 2.4 3.7 4.3 4.4 cord 221 1 66 46 50 5 4.9 6.7 4.9 5.4 cord 2013 1 606 435 398 17 37.2 24.3 19.1 18.5 Exposure/phenotype distributions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Future directions
•  Stouffer weighting schemes
•  Stouffer without bootstrapping (e.g. Comb-p)
•  Bumphunter
•  Blockfinder
•  Lasso (ChAMP)
•  DMRs and exposure prediction
•  Grand theory about association numbers
Acknowledgements
Caroline Relton
George Davey Smith
Phenotypes/exposures
Jean Golding
Kate Northstone
Rebecca Richmond
Stouffer’s method
Andrew Simpkins
ALSPAC participants
ARIES
Sue Ring
Wendy McArdle
Tom Gaunt
Geoff Woodward
Oliver Lyttleton