π - Lirmm

OMICS days, feb 2014
Transcriptome-based population genomics
in non-model animals
Nicolas Galtier
Institut des Sciences de l'Evolution
CNRS – Université Montpellier 2 - France
The genotype-phenotype link
development
natural selection
The genotype-phenotype link
recombination
mutation
development
natural selection
genetic drift
The genotype-phenotype link
recombination
mutation
polymorphism
substitution rate
longevity
development
dN/dS
GC-content
mating system
natural selection
genome size
generation
time
body size
abundance
genetic drift
Can genome variation patterns be predicted from species biology ?
Testing the population genetic theory
Lynch 2007 Sinauer Ass.
Can genome variation patterns be predicted from species biology ?
Testing the population genetic theory
the PopPhyl project
Can genome variation patterns be predicted from species biology ?
The PopPhyl project
Injecting species biology/ecology into comparative genomics
Exploring the molecular diversity of nonmodel taxa
Testing predictions of the population genetic theory genome-wide
What (if any) determines the genetic polymorphism of a metazoan species ?
The PopPhyl project
Injecting species biology/ecology into comparative genomics
Exploring the molecular diversity of nonmodel taxa
Testing predictions of the population genetic theory genome-wide
What (if any) determines the genetic polymorphism of a metazoan species ?
π ~ Ne . µ
The PopPhyl project
Injecting species biology/ecology into comparative genomics
Exploring the molecular diversity of nonmodel taxa
Testing predictions of the population genetic theory genome-wide
What (if any) determines the genetic polymorphism of a metazoan species ?
π ~ Ne . µ
1010
106
102
Leffler et al 2012 PLoS Biol
neutral
genetic diversity
πS
species (grouped by phylum)
"diversity levels [...] vary surprisingly little among species, and
mostly in ways that we still do not understand."
Experimental design
- Next-Generation Sequencing technology
- Target = transcriptome:
coding sequences
expression data
~35 groups of closely related species
- Sampling scheme:
1-4 species per group
2-10 individuals per species
Sponges
Demosponges
Cnidarians
Ctenophors
Rotifers
Acanthocephales
Entoprocts
Nemertians
Platyhelminthes
Annelids
Mollusks
Ectoprocts
Brachiopods
Chaetognathes
Tardigrades
Onychophores
Arthropods
Loricifers
Kinorhynchans
Priapulids
Nematodes
Hemichordates
Echinoderms
Cephalochordates
Urochordates
Vertebrates
collect in the field
extract RNA
build cDNA libraries
sequence cDNA libraries
Ind1: ACAATACTGATTTGCGT
Ind1: ACGATACTGATTTGCGT
Ind2: ACAATACTGATCTGCGT
Ind2: ACAATACTGATCTGCGT
Ind3: ACAATACTGATTTGCGT
Ind3: ACAATACTGATTTGCGT
Ind4: ACAATACTGATTTGCGT
Ind4: ACAATACTGATCTGCGT
* *
Ind1: TACTGATTTGCGTAAAGACTTA
Ind1: TACTGATTTGCGTAAAGGCTTA
Ind2: TACTGATTTGCGTAAGGACTTA
Ind2: TACTCATTTGCGTAAAGACTTA
Ind3: TACTGATTTGCGTAAAGGCTTA
Ind3: TACTGATTTGCGTAAAGGCTTA
* * *
assemble contigs
map reads
call SNPs and genotypes
predict ORFs
Bioinformatic developments
Data analysis pipeline
mapping
Illumina
assembling
aligned
reads
predicted
cDNAs
ORF
prediction
454
πN, πS
allele frequencies
SNP calling
SNPs and
genotypes
What we grabbed from existing resources and tuned:
- assemblers : Abyss, Cap3
(Cahais et al 2012 Mol Ecol Res)
- mapper : BWA
- ORF predictor : trinity_ORF
- coding sequence aligner : MACSe
What we developped by ourselves:
- SNP caller : reads2snp
(Tsagkogeorga et al 2012 GBE)
- paralog cleaner : paraclean
(Gayral et al 2013 PloS Gen)
- population genetic calculations : dNdSpiNpiS
The main problem we faced:
- genomic redundancy
Calling SNPs and genotypes from transcriptome reads
>contig1
pos ind1
1 6/0/9/0
2 0/4/0/0
3 1/0/0/57
…
>contig2
pos ind1
1 0/0/0/14
2 14/1/9/0
…
ind2
0/0/8/0
0/7/0/0
0/0/0/26
ind3
10/0/0/0
0/17/0/0
0/0/0/32
ind2
ind3
0/0/0/8
0/2/0/28
8/0/15/0
12/0/0/0
read counts
Calling SNPs and genotypes from transcriptome reads
>contig1
pos ind1
1 6/0/9/0
2 0/4/0/0
3 1/0/0/57
…
>contig2
pos ind1
1 0/0/0/14
2 14/1/9/0
…
ind2
ind3
AG 0/0/8/0 GG 6/0/0/0 AA
CC 0/7/0/0 CC 0/17/0/0 CC
TT 0/0/0/26 TT 0/0/0/32 TT
ind2
ind3
TT 0/0/0/8 TT 0/2/0/28 ?
AG 8/0/15/0 AG 12/0/0/0 AA
read counts → genotypes
Calling SNPs and genotypes from transcriptome reads
Principle:
calculate P1 = probability of data | heterozygote (multinomial)
calculate P2 = probability of data | homozygote (error rate)
posterior probability of heterozygote: P1/(P1+P2)
Variants:
measured (Qphred) vs. estimated error rate
single (variant calling) vs. multiple samples
prior on allele frequency, theoretical (e.g. Wright-Fisher) or empirical
prior on genotype frequency (e.g. Hardy-Weinberg)
reads2snp: SNP and genotype calling in the ML framework
- estimated, contig-specific error rate
- Hardy-Weinberg equilibrium assumed
- allele frequencies taken from the data (all individuals)
- allelic expression bias accomodated
1. estimate the error rate ε
2. calculate posterior probabilities of genotypes knowing ε
3. validate genotype if posterior probability > 0.95
Calling SNPs and genotypes from transcriptome reads
Major problem:
In case of incomplete reference, two paralogous genes might map
to the same contig.
>contig1
pos ind1
1 3/0/20/0
…
ind2
5/0/50/0
ind3
4/0/45/0
ind4
3/0/25/0
Calling SNPs and genotypes from transcriptome reads
Major problem:
In case of incomplete reference, two paralogous genes might map
to the same contig.
>contig1
pos ind1
1 3/0/20/0
…
ind2
5/0/50/0
ind3
4/0/45/0
ind4
3/0/25/0
Solution:
Discard such spurious SNPs using a SNP-specific likelihood-ratio test
involving explicit modelling of the paralogy case.
Quality control
Robustness of estimates to methodological options
Ciona intestinalis
Lepus granatensis
0,2
0,15
πN (%)
0,1
0,05
0
0
0,5
1
1,5
2
πS (%)
de novo
samtools
reference
no paralogue cleaning
No influence of coverage on heterozygosity
20
hare
turtle
ciona
oyster
termite
15
Coverage
10
5
0.000
0.002
0.004
Heterozygosity
0.006
Isolation-by-distance patterns consistent with the phylogeographic literature
1.0
4
0.9
3
0.8
0.7
2
0.6
1
r2=0.04
r2=0.54
0
500
1000
1500
2000
0
200
400
600
Turtle
Hare
Emys orbicularis
Lepus granatensis
Homozygous mitochondrial "genotypes"
0.02
GA03M|2
GA03N|2
ref|Trachemys_scripta
GA03M|1
GA03N|1
ref|Chrysemys_picta
GA03C|1
GA03C|2
GA03H|1
GA03L|1
GA03H|2
GA03L|2
GA03E|1
GA03E|2
GA03F|1
GA03F|2
GA03I|1
GA03J|1
GA03I|2
GA03J|2
GA03G|1
GA03G|2
GA03D|2
GA03K|1
GA03K|2
GA03D|1
ref|Emys_orbicularis
Trachemys scripta
Emys orbicularis
Plausible site frequency spectrum
Wright-Fisher
0
samtools
-0.14
S
Lepus granatensis
-0.32
NS
reads2snps
-0.37
-1.18
S
NS
Comparative population genomics in animals
The data set:
- 81 metazoan species (67 new)
- 2 to 10 individuals each
- 1000 to 20,000 predicted ORFs of length >200bp
- 300 to 230,000 predicted coding SNPs
25
30
25
20
20
15
15
Median: 3176
10
10
5
5
0
0
2
4
6
8
10
# individuals per species
0
5000
10000 15000 20000
# ORFs per species
πS in animals spans two orders of magnitude across species
Median: -1.96
Median: 0.0108
35
12
30
10
25
8
20
6
15
4
10
2
5
0
0
0.00
0.02
0.04
πS
0.06
0.08
0.10
-3.0
-2.5
-2.0
-1.5
log(πS)
-1.0
No influence of sample size on the estimated πS
0.10 0.08 0.06 0.04 0.02 0.00
0.10 0.08 0.06 0.04 0.02 0.00
NS
NS
πS
πS
2
4
6
8
10
# individuals per species
0
5000
10000
15000
# ORFs per species
20000
Non-synonymous vs. synonymous diversity
πN
πS
More efficient purifying selection in large populations
πN/πS=0.12
πN
πS
Determinants of genetic diversity, part 1:
geographic range and population history
No effect of the geographic range
πS
πS
latitude
var(latitude)
No effect of continental vs. marine habitat
-3
-4
log(πS)
-5
-6
continental
marine
No effect of the conservation status
threat
LC
NT
VU
EN
CR
A strong phylogenetic (family) effect
πN
πS
Artemia franciscana
Artemia sinica
Artemia tibetania
A strong phylogenetic (family) effect
πN
πS
Determinants of genetic diversity, part 2:
species biology
Body size matters – with exceptions
R2=0.26
πS
Body size
Body size matters – with exceptions
πS
Body size
Body size matters – with exceptions
πS
A strong effect of parental investment
R2=0.53
propagule/adult
size ratio
K
πS
r
Body size
A strong effect of parental investment
R2=0.78
+ longevity
+ fecundity
+ speed
propagule/adult
size ratio
K
πS
r
Body size
π S and life-history traits: a summary
Low genetic diversity
K-strategy
Large body size
High parental investment (brooders,
Long-lived
Low fecundity
High-mobility species
High genetic diversity
eusocials)
r-strategy
Small body size
Low parental investment (external fertilization)
Short-lived
High fecundity
Fixed and slow species
How comes that LHT determine diversity ?
Hypothesis 1:
LHT influence species response to environmental perturbations
Why the r/K contrast?
Hypothesis 2: stronger fluctuations in time of Ne in r-strategists.
Modelling population dynamics as an auto-regressive process
logN(t+1) = (1-φ) μ + φ logN(t) +
arithmetic mean
N (0,σ)
environmental noise
autocorrelation
N = 10 6
φ=0
σ = log10 (4)
N = 10
φ=0
σ = log10 (64)
6
N = 10
φ = 0.9
σ = log10 (4)
6
1010
106
102
1010
106
102
1010
106
102
0
200
400
600
800
1000
K-strategists
0.9
3.103
1010
0.75
φ
0.5
0.25
0.1
102
2
109
4
8
16
σ
32
64
r-strategists
Minimal N required to avoid extinction in 105 generations
K-strategists
0.9
0.75
φ
102
0.5
106
0.25
0.1
2
4
8
16
σ
32
64
r-strategists
Minimal harmonic mean N conditional on non-extinction
Summary - Conclusions
- transcriptome-based population genomics in non-model organisms:
yes, we can.
- weak effects of population history and contingency
- life-history traits explain >75% of the variance of πS across species
- parental investment effect: a higher Ne in r- than in K-strategists
- LHT influence species response to environmental perturbations
Thanks to:
- the PopPhyl team:
P. Gayral
V. Cahais
Y. Chiari
N. Faivre
E. Loire
J. Lourenço
M. Ballenghien
G. Tsagkogeorga
L. Weinert
A. Bernard
C. Roux
J. Romiguier
- collaborators:
S. Glémin
J. Melo-Fereira
N. Bierne
M. Carneiro
- MBB Platform:
K. Belkhir
R. Dernat
- dozens of sample providers
http://kimura.univ­montp2.fr/PopPhyl
http://mbb.univ­montp2.fr/MBB/
F. Delsuc
V. Ranwez
Molluscs (12 species)
πN
πS
Insects (13 species)
πS
Vertebrates (25 species)
πΝ
πS