OMICS days, feb 2014 Transcriptome-based population genomics in non-model animals Nicolas Galtier Institut des Sciences de l'Evolution CNRS – Université Montpellier 2 - France The genotype-phenotype link development natural selection The genotype-phenotype link recombination mutation development natural selection genetic drift The genotype-phenotype link recombination mutation polymorphism substitution rate longevity development dN/dS GC-content mating system natural selection genome size generation time body size abundance genetic drift Can genome variation patterns be predicted from species biology ? Testing the population genetic theory Lynch 2007 Sinauer Ass. Can genome variation patterns be predicted from species biology ? Testing the population genetic theory the PopPhyl project Can genome variation patterns be predicted from species biology ? The PopPhyl project Injecting species biology/ecology into comparative genomics Exploring the molecular diversity of nonmodel taxa Testing predictions of the population genetic theory genome-wide What (if any) determines the genetic polymorphism of a metazoan species ? The PopPhyl project Injecting species biology/ecology into comparative genomics Exploring the molecular diversity of nonmodel taxa Testing predictions of the population genetic theory genome-wide What (if any) determines the genetic polymorphism of a metazoan species ? π ~ Ne . µ The PopPhyl project Injecting species biology/ecology into comparative genomics Exploring the molecular diversity of nonmodel taxa Testing predictions of the population genetic theory genome-wide What (if any) determines the genetic polymorphism of a metazoan species ? π ~ Ne . µ 1010 106 102 Leffler et al 2012 PLoS Biol neutral genetic diversity πS species (grouped by phylum) "diversity levels [...] vary surprisingly little among species, and mostly in ways that we still do not understand." Experimental design - Next-Generation Sequencing technology - Target = transcriptome: coding sequences expression data ~35 groups of closely related species - Sampling scheme: 1-4 species per group 2-10 individuals per species Sponges Demosponges Cnidarians Ctenophors Rotifers Acanthocephales Entoprocts Nemertians Platyhelminthes Annelids Mollusks Ectoprocts Brachiopods Chaetognathes Tardigrades Onychophores Arthropods Loricifers Kinorhynchans Priapulids Nematodes Hemichordates Echinoderms Cephalochordates Urochordates Vertebrates collect in the field extract RNA build cDNA libraries sequence cDNA libraries Ind1: ACAATACTGATTTGCGT Ind1: ACGATACTGATTTGCGT Ind2: ACAATACTGATCTGCGT Ind2: ACAATACTGATCTGCGT Ind3: ACAATACTGATTTGCGT Ind3: ACAATACTGATTTGCGT Ind4: ACAATACTGATTTGCGT Ind4: ACAATACTGATCTGCGT * * Ind1: TACTGATTTGCGTAAAGACTTA Ind1: TACTGATTTGCGTAAAGGCTTA Ind2: TACTGATTTGCGTAAGGACTTA Ind2: TACTCATTTGCGTAAAGACTTA Ind3: TACTGATTTGCGTAAAGGCTTA Ind3: TACTGATTTGCGTAAAGGCTTA * * * assemble contigs map reads call SNPs and genotypes predict ORFs Bioinformatic developments Data analysis pipeline mapping Illumina assembling aligned reads predicted cDNAs ORF prediction 454 πN, πS allele frequencies SNP calling SNPs and genotypes What we grabbed from existing resources and tuned: - assemblers : Abyss, Cap3 (Cahais et al 2012 Mol Ecol Res) - mapper : BWA - ORF predictor : trinity_ORF - coding sequence aligner : MACSe What we developped by ourselves: - SNP caller : reads2snp (Tsagkogeorga et al 2012 GBE) - paralog cleaner : paraclean (Gayral et al 2013 PloS Gen) - population genetic calculations : dNdSpiNpiS The main problem we faced: - genomic redundancy Calling SNPs and genotypes from transcriptome reads >contig1 pos ind1 1 6/0/9/0 2 0/4/0/0 3 1/0/0/57 … >contig2 pos ind1 1 0/0/0/14 2 14/1/9/0 … ind2 0/0/8/0 0/7/0/0 0/0/0/26 ind3 10/0/0/0 0/17/0/0 0/0/0/32 ind2 ind3 0/0/0/8 0/2/0/28 8/0/15/0 12/0/0/0 read counts Calling SNPs and genotypes from transcriptome reads >contig1 pos ind1 1 6/0/9/0 2 0/4/0/0 3 1/0/0/57 … >contig2 pos ind1 1 0/0/0/14 2 14/1/9/0 … ind2 ind3 AG 0/0/8/0 GG 6/0/0/0 AA CC 0/7/0/0 CC 0/17/0/0 CC TT 0/0/0/26 TT 0/0/0/32 TT ind2 ind3 TT 0/0/0/8 TT 0/2/0/28 ? AG 8/0/15/0 AG 12/0/0/0 AA read counts → genotypes Calling SNPs and genotypes from transcriptome reads Principle: calculate P1 = probability of data | heterozygote (multinomial) calculate P2 = probability of data | homozygote (error rate) posterior probability of heterozygote: P1/(P1+P2) Variants: measured (Qphred) vs. estimated error rate single (variant calling) vs. multiple samples prior on allele frequency, theoretical (e.g. Wright-Fisher) or empirical prior on genotype frequency (e.g. Hardy-Weinberg) reads2snp: SNP and genotype calling in the ML framework - estimated, contig-specific error rate - Hardy-Weinberg equilibrium assumed - allele frequencies taken from the data (all individuals) - allelic expression bias accomodated 1. estimate the error rate ε 2. calculate posterior probabilities of genotypes knowing ε 3. validate genotype if posterior probability > 0.95 Calling SNPs and genotypes from transcriptome reads Major problem: In case of incomplete reference, two paralogous genes might map to the same contig. >contig1 pos ind1 1 3/0/20/0 … ind2 5/0/50/0 ind3 4/0/45/0 ind4 3/0/25/0 Calling SNPs and genotypes from transcriptome reads Major problem: In case of incomplete reference, two paralogous genes might map to the same contig. >contig1 pos ind1 1 3/0/20/0 … ind2 5/0/50/0 ind3 4/0/45/0 ind4 3/0/25/0 Solution: Discard such spurious SNPs using a SNP-specific likelihood-ratio test involving explicit modelling of the paralogy case. Quality control Robustness of estimates to methodological options Ciona intestinalis Lepus granatensis 0,2 0,15 πN (%) 0,1 0,05 0 0 0,5 1 1,5 2 πS (%) de novo samtools reference no paralogue cleaning No influence of coverage on heterozygosity 20 hare turtle ciona oyster termite 15 Coverage 10 5 0.000 0.002 0.004 Heterozygosity 0.006 Isolation-by-distance patterns consistent with the phylogeographic literature 1.0 4 0.9 3 0.8 0.7 2 0.6 1 r2=0.04 r2=0.54 0 500 1000 1500 2000 0 200 400 600 Turtle Hare Emys orbicularis Lepus granatensis Homozygous mitochondrial "genotypes" 0.02 GA03M|2 GA03N|2 ref|Trachemys_scripta GA03M|1 GA03N|1 ref|Chrysemys_picta GA03C|1 GA03C|2 GA03H|1 GA03L|1 GA03H|2 GA03L|2 GA03E|1 GA03E|2 GA03F|1 GA03F|2 GA03I|1 GA03J|1 GA03I|2 GA03J|2 GA03G|1 GA03G|2 GA03D|2 GA03K|1 GA03K|2 GA03D|1 ref|Emys_orbicularis Trachemys scripta Emys orbicularis Plausible site frequency spectrum Wright-Fisher 0 samtools -0.14 S Lepus granatensis -0.32 NS reads2snps -0.37 -1.18 S NS Comparative population genomics in animals The data set: - 81 metazoan species (67 new) - 2 to 10 individuals each - 1000 to 20,000 predicted ORFs of length >200bp - 300 to 230,000 predicted coding SNPs 25 30 25 20 20 15 15 Median: 3176 10 10 5 5 0 0 2 4 6 8 10 # individuals per species 0 5000 10000 15000 20000 # ORFs per species πS in animals spans two orders of magnitude across species Median: -1.96 Median: 0.0108 35 12 30 10 25 8 20 6 15 4 10 2 5 0 0 0.00 0.02 0.04 πS 0.06 0.08 0.10 -3.0 -2.5 -2.0 -1.5 log(πS) -1.0 No influence of sample size on the estimated πS 0.10 0.08 0.06 0.04 0.02 0.00 0.10 0.08 0.06 0.04 0.02 0.00 NS NS πS πS 2 4 6 8 10 # individuals per species 0 5000 10000 15000 # ORFs per species 20000 Non-synonymous vs. synonymous diversity πN πS More efficient purifying selection in large populations πN/πS=0.12 πN πS Determinants of genetic diversity, part 1: geographic range and population history No effect of the geographic range πS πS latitude var(latitude) No effect of continental vs. marine habitat -3 -4 log(πS) -5 -6 continental marine No effect of the conservation status threat LC NT VU EN CR A strong phylogenetic (family) effect πN πS Artemia franciscana Artemia sinica Artemia tibetania A strong phylogenetic (family) effect πN πS Determinants of genetic diversity, part 2: species biology Body size matters – with exceptions R2=0.26 πS Body size Body size matters – with exceptions πS Body size Body size matters – with exceptions πS A strong effect of parental investment R2=0.53 propagule/adult size ratio K πS r Body size A strong effect of parental investment R2=0.78 + longevity + fecundity + speed propagule/adult size ratio K πS r Body size π S and life-history traits: a summary Low genetic diversity K-strategy Large body size High parental investment (brooders, Long-lived Low fecundity High-mobility species High genetic diversity eusocials) r-strategy Small body size Low parental investment (external fertilization) Short-lived High fecundity Fixed and slow species How comes that LHT determine diversity ? Hypothesis 1: LHT influence species response to environmental perturbations Why the r/K contrast? Hypothesis 2: stronger fluctuations in time of Ne in r-strategists. Modelling population dynamics as an auto-regressive process logN(t+1) = (1-φ) μ + φ logN(t) + arithmetic mean N (0,σ) environmental noise autocorrelation N = 10 6 φ=0 σ = log10 (4) N = 10 φ=0 σ = log10 (64) 6 N = 10 φ = 0.9 σ = log10 (4) 6 1010 106 102 1010 106 102 1010 106 102 0 200 400 600 800 1000 K-strategists 0.9 3.103 1010 0.75 φ 0.5 0.25 0.1 102 2 109 4 8 16 σ 32 64 r-strategists Minimal N required to avoid extinction in 105 generations K-strategists 0.9 0.75 φ 102 0.5 106 0.25 0.1 2 4 8 16 σ 32 64 r-strategists Minimal harmonic mean N conditional on non-extinction Summary - Conclusions - transcriptome-based population genomics in non-model organisms: yes, we can. - weak effects of population history and contingency - life-history traits explain >75% of the variance of πS across species - parental investment effect: a higher Ne in r- than in K-strategists - LHT influence species response to environmental perturbations Thanks to: - the PopPhyl team: P. Gayral V. Cahais Y. Chiari N. Faivre E. Loire J. Lourenço M. Ballenghien G. Tsagkogeorga L. Weinert A. Bernard C. Roux J. Romiguier - collaborators: S. Glémin J. Melo-Fereira N. Bierne M. Carneiro - MBB Platform: K. Belkhir R. Dernat - dozens of sample providers http://kimura.univmontp2.fr/PopPhyl http://mbb.univmontp2.fr/MBB/ F. Delsuc V. Ranwez Molluscs (12 species) πN πS Insects (13 species) πS Vertebrates (25 species) πΝ πS
© Copyright 2024 ExpyDoc