Haplotyping and Estimation of Haplotype Frequencies for Closely Linked Biallelic Multilocus Genetic Phenotypes Including Nuclear Family Information : Haplotyping and Estimation of Haplotype Frequencies for Closely Linked Biallelic Multilocus Genetic Phenotypes Including Nuclear Family Information Human Mutation 17: 289-295 (2001)
K.Rohde and R. Fuerst
Introduction: Introduction Large samples of biallelic multilocus genetic phenotypes are available to estimate the population haplotype frequencies or to find the most likely haplotype pair for each individual in the sample or both.
Case Control or transmission disequilibrium test (TDT) studies to find associations between a given genetic trait and some haplotype (Zhao et al., 2000; Fallin and Schork, 2001)
Introduction(continued): Introduction(continued) EM Algorithm to estimate MLE of haplotype frequencies in population under HWE assumption and random mating(Weir, 1990; Xie and Ott, 1993;etc. and Fallin and Schork, 2000)
If x is complete data, y is incomplete data
1.Start with θ0 (unknown parameter)
2.E-Step: compute Q(θ)=Q(θ| θ0)
=E[ logL(θ;x)|y, θ0 ]
3.M-Step: maximize Q(θ) to get updated θ1. Go back to E-step with θ1, iterate.
Repeat E-Step and M-Step until convergence.
Main purpose: Main purpose To examine association of (complex) genetic traits to some haplotypes. Only count occurrences of haplotypes or respectively, haplotypes transmitted or nontransmitted in nuclear families. So if we describe the haplotype pairs in the sample correctly, deviation from HWE and random mating, or even imprecise haplotype frequency estimations are tolerable.
Slide5: Haplotype frequencies and haplotype pairs are estimated via EM algorithm, including nuclear family information.
Parents: treated as an independent sample from the population
Children’s Genotypes: used to reduce the number of potential haplotype pairs for both parents
Method: Method Likelihood of the data given the phenotype mi- # of individuals in the sample with phenotype i
m-# of individuals in the sample
gi-phenotype of individual i (k distinct phenotypes)
P(gi)-population frequency of gi
Slide7: Based on the assumption of random mating and HWE, get equation:
Solving the equations:: Solving the equations:
expectation-maximization recursion: expectation-maximization recursion
Slide10: Starts:
Assign equal prior probability for all haplotype frequencies
Take in the first iteration step only the most likely haplotype pair of the sum into account and find the next estimation by counting over all most likely haplotype pairs in the sample
Stops:
If improvement of the likelihood or haplotype frequencies smaller than a pre-set limit
Includes nuclear family information: Includes nuclear family information For E-M step, use only the genetic phenotypes of the parents
Construct a list of all pairs of compatible haplotype pairs for each couple from the lists of compatible haplotype pairs for each parent in a family (to sum over all families in sample)
Genetic phenotype of children is used to remove all couples of haplotype pairs from the family’s list with contradictory phenotypes.
Updated E-M Step: Updated E-M Step
Results of simulation: Results of simulation Simulate samples of 25 families with one or two children. 3 sets of haplotypes are chosen with low to high numbers of heterozygous position.
Table 1. For 10 different equally likely haplotypes with 10 loci
Table 2. For 4 different equally likely haplotypes with 4 loci
P(het) and P_het: P(het) and P_het P_het=1-∑1n pi2
frequency of heterozygotes in the sample
P(het)-frequency of heterozygous postitions in the sample
=Sum over all heterozygotes with their frequency pipj multiplied by the fraction of heterozygous positions in genotype (i, j)
Discussion of Simulation: Discussion of Simulation EM(only parents), EM(nuclear families) and GENEHUNTER perform better if number of heterozygous positions in the sample is lower;
Both EM approaches perform better than GENEHUNTER;
Both family based approaches, EM(nuclear families) and GENEHUNTER perform better with two offspring than one.
Slide18: For low to moderate P(het), both EM are comparable and GENEHUNTER follows closely;
For a high P(het), EM(nuclear families) performs better than EM(only parents) and GENEHUNTER;
Both EM perform better with set of 10 locus haplotypes than with that of the 4 locus haplotypes.
Slide19: Doubling the sample size in EM(only parents) gives no definite improvement of haplotyping by comparing EM (parents only) over 50 families with EM(nuclear families) over 25 families with two children each.
Conclusion: Conclusion EM(nuclear families) should be advisable, especially if one wants to do a subsequent TDT test of linkage and association to a genetic trait.
Calculation limit is mainly on the number of heterozygous positions in the genotypes, but not so much on number of loci per haplotype (up to 30).