Genetic Reconstruction of Large Pedigrees Nuala A. Sheehan1, Mark Bartlett2 & James Cussens2 1 Departments of Health Sciences and Genetics, University of Leicester 2 Centre for Complex Systems, University of York Medical Research Council Project Grant G1002312 MRC Conference on Biostatistics March 2014 Finding Relatives Identification of relatives relevant for applications in many areas. In Human Genetics: • linkage analysis • remove or adjust for cryptic relatedness in association analyses • search for rare variants −→ large pedigrees ‘ideal’ – efficient – protect against deleterious effects of genetic heterogeneity – design for high throughput sequencing (Wijsman 2012) Will view finding relatives as a problem of pedigree reconstruction 2 Likelihood-based Pedigree Reconstruction In theory: simple to estimate the pedigree connecting a given set of individuals from genetic marker data — consider all possible pedigrees and compute the likelihoods (Thompson 1975) In practice: enormous number of possibilities −→ brute force enumeration only feasible for small numbers. (Sheehan & Egeland 2007) ‘Sequential’ (i.e. greedy) algorithms exist which efficiently produce a single high likelihood reconstruction but not necessarily a maximum likelihood (ML) pedigree. (Thompson 1976, Almudevar 2003) We propose: – search for maximum likelihood pedigrees using integer linear programming (ILP) and state-of-the-art ILP optimisation solvers. 3 Likelihood-based Pedigree Reconstruction As for other current approaches, we will assume: • a complete sample i.e. all observed individuals have complete marker data and unobserved parents are unrelated to other individuals in the sample — pedigree founders • genotype marker data at unlinked loci • Hardy-Weinberg proportions for founder genotypes • Mendelian segregation of genes from parents to offspring. 4 Pedigree (Log) Likelihood Under the above assumptions: – the pedigree likelihood decomposes into a product of conditional probabilities – each a function of an individual node and it’s parent nodes (e.g. Mendelian 0, 41 , 21 or 1 for an individual with two observed parents). Likelihood: L(G) = Y τ (v, Pa(v, G)). v∈V Log-likelihood: l(G) = log L(G) = X log τ (v, Pa(v, G)). v∈V 5 Integer Linear Programming (ILP An integer linear program (ILP) is defined by: 1. a set of variables X, representing unknown quantities, some of which are restricted to be integer values; P 2. an objective function of the form x∈X cxx where the coefficients cx are fixed constants, and 3. linear equations and inequalities putting joint constraints on the values the variables can take. ILP optimisation problem: find an assignment of values to X which maximises the objective function while respecting all constraints. (NP-hard when some variables are constrained to be integers.) 6 Encoding Pedigree Learning as an ILP Problem This can be done for the above assumptions. Create binary indicator variables (X) to represent all possible parentages of an individual: 1 if v has parents W in G I(W → v)(G) = 0 otherwise, where W ⊆ V \ {v} and |W | ≤ 2. Constraints required so that: P – each individual has exactly one parent set: ∀v : W I(W → v) = 1 – nobody can be his own ancestor or descendant – sex can be consistently assigned i.e. parents are of opposite sex. 7 Encoding Pedigree Learning as an ILP Problem Pedigree log-likelihood log L(G) = X log τ (v, Pa(v, G)). v∈V can now be expressed in desired form, Optimisation Problem: X Maximise log τ (v, W )I(W → v)(G) subject to all constraints. v,W NOTE 1: Returns a guaranteed maximum likelihood pedigree. NOTE 2: Can repeatedly solve the above adding an additional constraint each time to prevent a previously returned solution from appearing again −→ find k most likely pedigrees. 8 Maximum Likelihood Versus True Pedigree There is no guarantee that the most probable i.e. maximum likelihood pedigree is the true pedigree. Marker data on unrelated individuals may indicate that they are actually related and the observed data could assign a higher probability to an incorrect relationship than the true one (e.g. siblings could look like parent-child). Whether the maximum likelihood pedigree ressembles the true one, or not, has no bearing on the efficiency of a method for finding it. We would perhaps expect the true pedigree to be among the top most likely pedigrees. 9 Simulations For different pedigree structures • Simulated complete data on the pedigree from a allele frequencies for a typical forensic set of autosomal short tandem repeat markers – Caucasian allele frequencies for 13 CODIS loci plus two others (Butler 2003) – Number of alleles at each marker ranges from 7 to 15. • Assigned genetic profile to founders at each marker independently assuming random union of gametes and dropped genes from parents to offspring assuming Mendelian transmission. 10 A Fairly Large Pedigree 1614 Polar Eskimos 11 How Good is a Reconstruction? Evaluation depends on what you want it for! Demographic/ecological research: focus on general features? e.g. correct distribution of sibship sizes, numbers of marriages and overall levels of relatedness. Forensic research: want particular relationships to be correct. BN learning error rate: number of incorrect parent-offspring links – not necessarily appropriate measure of success for maximum likelihood approach where goal is to find high likelihood pedigrees quickly – and may not be best criterion for any given situation. 12 Evaluating an Eskimo Reconstruction Unreasonable to expect all 2778 parent-child graph edges to be correct! Local Features: • family (sibship) sizes • number of marriages for a given individual • particular relationships between certain individuals Global Features: • • • • number number number number of of of of pedigree components (if graph not connected) founders (pedigree width) generations (pedigree depth) pedigree descendants for certain individuals 13 Evaluating an Eskimo Reconstruction • Solving times: Average of 10 minutes and ranging from 3 to 42 minutes over 100 simulated datasets • Seems ‘good’ at getting local features right e.g. specific relationships, number of children of a single individual. • Not so good at global features — will try to create as many relatives as possible to optimise likelihood −→ too few founders, too many generations, too many descendants. • Edges?: proportion of all real edges (directed) found and proportion of reconstructed edges that are true — both ∼ 95% – not sure what this is telling us though! 14 Complications Simple decomposition of likelihood breaks down for – incomplete data – linked markers — siblings no longer independent given parental genotypes. Need to add extra variables for missing data and unknown phase – will return most probable combination of pedigree structure and haplotypes in latter case. (ILP has been used for haplotype inference.) More intensive likelihood computations −→ approximations. Prior information can — and should — be incorporated – as additional constraints (‘hard’ prior information) – as additional log prior terms in the objective function. 15 Conclusions • Can now investigate the properties of maximum likelihood pedigrees • Performance of approximate approaches can now be evaluated • Only ML approach that can get the first n most likely pedigrees and so can address model uncertainty. • ML appears to be performing well with only marker data but will need prior information to guide search for accurate reconstruction on big problems • 15 microsatellites sufficent for complete data: will need more markers for missing / latent data • ILP approach seems to be more amenable to generalising to more complex problems. GOBNILP software feely available at: http://www.cs.york.ac.uk/aig/sw/gobnilp 16
© Copyright 2024 ExpyDoc