Efficient Haplotype Inference on Pedigrees and Applications: Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang
Dept of Computer Science
University of California – Riverside
(joint work with Jing Li, CWRU)
Outline: Outline Background
The MRHC problem and complexity
An exact algorithm for 0-recombinant data
A heuristic algorithm (block-extension)
Integer linear programming formulation and solution for MRHC with missing alleles
Experimental results and application in disease gene association mapping
Inference of haplotypes on population data
Terms: Terms Diploid
Polymorphisms, marker, allele, and SNP
Genotype, homozygous & heterozygous
Haplotype, paternal & maternal haplotypes A G C G 1 1 6 3 1 2 3 2 Haplotype Paternal Maternal Maker locus Genotype Multiallelic Biallelic
Mendelian Law of Inheritance and Recombination: Mendelian Law of Inheritance and Recombination B A Father C D Mother A C A D B C D B B
D A
C Father A
C B
D A
D B
C Child:
Slide5: Pedigree Pedigree, nuclear family, founder
Slide6: Pedigree Pedigree, nuclear family, founder
Father Mother Children ID no. Genotypes Founders Nuclear
family Family
trio Loop
Slide7: Haplotyping from Genotypes: The Problem & Methods Problem:
Input: genotype data (possibly with missing alleles).
Output: haplotypes.
Input data:
Data with pedigree (dependent individuals).
Data without pedigree info (independent individuals).
Statistical methods
Find the most likely haplotypes based on genotype data.
Adv: solid theoretical bases
Disadv: computation intensive
Rule-based (i.e. combinatorial) methods
Define rules/objective functions based on some plausible assumptions and find haplotypes consistent with the rules or optimizing the obj. fun.
Adv: usually simple thus very fast
Disadv: no numerical assessment of the reliability of the results
Motivations: Motivations Haplotype is more biologically meaningful than genotype since haplotypes are directly inherited from parents. Haplotype data is more informative in the studies of association between diseases and genes, and human history.
The human genome project gives us the consensus genotype sequence of humans, but in order to understand the genetic effects on many complex diseases such as cancers, diabetes, osteoporoses, genetic variations are more important, which is best refecledt in haplotypes.
Current experimental techniques collect genotype data. Computational methods deriving haplotypes from genotypes are highly demanded.
The ongoing international HapMap project.
Motivations (cont’d): Motivations (cont’d) It is generally believed that with parents/pedigree information, we could get more accurate haplotype and frequency estimations than from data without such information (i.e. population data).
Family-based association studies have been widely used. We would expect more family-based gene mapping methods that assume accurate haplotype information.
Not only computation intensive, model-based statistical methods may use assumptions that may not hold in real datasets.
MRHC Problem: MRHC Problem Find a minimum recombinant haplotype configuration from a given pedigree with genotype data.
Assumptions:
Mendelian law (no mutations)
Recombination events are rare Input (phase unknown)
The MRHC Problem: The MRHC Problem PS: parental source of the two alleles at the locus
(i.e. phase)
Haplotype configuration = assignment of PS values at each locus of every individual. Output (phase known) PS = 1 (because the allele with the smaller index is maternal)
Previous Results: Previous Results Genotype elimination (O’Connell & Weeks’99).
Can only find haplotype configurations requiring no recombinant in the pedigree, exhaustive elimination.
Genetic algorithm (Tapadar et al.’00).
Still time consuming, needs many iterations before convergence.
MRH (Qian & Beckmann’02).
Six step rule-based algorithm.
Locus by locus at every step, extremely slow for biallelic (e.g. SNP) markers.
Thm. MRHC is NP-hard.: Thm. MRHC is NP-hard. Idea: Reduction from a variant of set cover.
First complexity result concerning the problem.
Remains hard when there are only two loci.
Remains hard when no loops in a pedigree.
An Exact Algorithm for 0-Recombinant Data Based on Resolution of Constraints: An Exact Algorithm for 0-Recombinant Data Based on Resolution of Constraints Assumptions:
Zero recombinants.
No missing alleles, no errors.
Idea: finding all feasible (i.e. 0-recombinant) haplotype configurations is equivalent to reducing the degree of freedom in PS assignment.
Steps:
formulate all the constraints, as linear equations over GF(2)
solve the equations by Gaussian elimination
enumerate all feasible haplotype configurations
Four Levels of Constraints: Four Levels of Constraints Based on Mendelian law
(for single locus) :
Level 1: GS (grantparental source) constraint
Level 2: PS constraint
Based on 0-recombinant assumption
(for a pair of loci):
Level 3: Haplotype constraint
Level 4: Grouping constraint
Level 3 and Level 4 Constraints: 1 2
1 2 Level 3 and Level 4 Constraints 1 2
1 2 1 2
1 2 1 2
1 2 1 1
1 1 1 2
1 2 1 2 3 4 5 6 1 2
1 2 1 2
1 2 1 2
1 2 4 5 6 1 2
1 2 1 2
1 2 2 1
2 1 4 5 6 1 2
2 1 1 2
2 1 1 2
2 1 4 5 6
Slide17: Level 3 and level 4 Constraints Note: The variables represent PS values and the
equations are over GF(2) (in fact, addition mod 2).
Constraint-Based Algorithm: Constraint-Based Algorithm Thm. Every solution consistent with the constraint equations is a feasible solution and vice versa.
We can adapt the classical Gaussian elimination algorithm
to find all consistent solutions in O(n3m3) time.
Previously, only an exponential time algorithm is known
due to O’Connell and Weeks (1999).
The algorithm is useful for solving 0-recombinant data and
may serve as a subroutine in a general haplotyping algorithm.
Block-Extension Algorithm: Block-Extension Algorithm Iterative, heuristic, five steps. Rules are derived
from Mendelian law, MR principle, block concept
and some greedy ideas based on the following
observations:
Block structures are common in haplotypes.
Double recombination events are rare.
Common haplotype blocks shared in siblings.
…
Steps in the BE algorithm: Steps in the BE algorithm Missing allele imputation by the Mendelian Law of inheritance and allele frequency
PS assignment by Mendelian Law
Locus by locus, member by member, in a top-down scan
Greedy assignment of PS
Bottom-up, infer PS value from PS of adjacent loci.
Block-Extension
Iteratively extend the longest block to the same region of other members.
Finishing the gaps between blocks by enumeration.
Analysis of the BE Algorithm: Analysis of the BE Algorithm Advantage:
Simple and efficient.
Accurate when the number of recombination events is small.
Disadvantage:
Potential errors in steps 3 and 4. Accuracy could decrease with the increase of the number of recombination events.
More Exact Algorithms: More Exact Algorithms Locus-based dynamic programming algorithm
Linear time in the number of the members
Applicable to only tree pedigrees
Member-based dynamic programming algorithm
Linear time in the number of the loci
Applicable to general pedigrees with small size
Integer linear programming (ILP) with branch-and-bound
Combines missing data imputation and haplotype inference together.
It also implicitly checks Mendelian consistency for pedigree genotype data with missing alleles, which is also an NPC problem.
Effective for practical size problems, regardless of the pedigree structure
ILP for MRHC with Missing Data: ILP for MRHC with Missing Data Alleles are represented as binary variables.
Genotype info and the Mendelian law of inheritance are enforced by linear constraints.
The objective of minimizing the total number of recombinants is encoded as a linear function of the variables.
Effective preprocessing of constraints by taking advantage of special properties in our ILP formulation to reduce the number of variables.
Branch-and-bound strategy to find solutions. The branch step guided by a partial order relationship (and some other special relationships) identified during the preprocessing step.
Non-trivial bounds are estimated to prune the search tree.
A maximum likelihood approach is used to select the best haplotype configuration from multiple optimal solutions.
Formulation: variables: Formulation: variables Possible alleles (totally tj) at marker locus j: Define 2tj (f and m) vars and 2 g vars for each paternal allele and maternal allele at locus j for individual i Var fk (or mk)=1 iff the allele is mk. Var g1 = 0 (or 1) iff paternal allele is copied from father’s paternal (or maternal) allele. Var g2 defined similarly. Define r vars:
Formulation: Formulation Subject to
Genotype constraints: (0 means missing allele) Objective function:
Formulation: Formulation Mendelian law of inheritance constraints (for a child i and its father f ): Constraints for r vars:
A Partial Order Relationship: A Partial Order Relationship Denote: Inequalities with 2 variables:
Forced Variables: Forced Variables Rule 1:
Rule 2:
Rule 3:
Lower and Upper Bounds: Lower and Upper Bounds Lower bounds
Linear relaxation.
Sum of minimum number of recombinants in each nuclear family.
Effective for data with a large number of recombinants.
Upper bound
Obtained by the Block-Extension algorithm.
Effective for data with a small number of recombinants.
ILP: ILP Practical in terms of time efficiency
Could find all possible optimal solutions
Very effective in terms of missing allele imputation.
Simulation Studies: Simulation Studies The algorithms have been implemented in a program called PedPhase in C++.
Simulated data were generated to compare our algorithms, and with MRH (Qian&Beckmann’02)
Three different pedigree structures.
Multiallelic and biallelic data.
Numbers of loci: 10, 25 and 50.
Number of recombinants: 0-4.
100 runs per data set.
Pedigree Structures: Pedigree Structures
Accuracy Results of Algrotihm Block-Extension: Accuracy Results of Algrotihm Block-Extension
Efficiency Results: Efficiency Results
More Results on ILP: More Results on ILP
Real Data Analysis: Real Data Analysis Data set (Gabriel et al.’02)
93 members, 12 pedigrees (each with 7-8 members);
chromosome 3, 4 regions, each region 1-4 blocks.
Reconstruction of Common Haplotypes and Estimation of Their Frequencies: Reconstruction of Common Haplotypes and Estimation of Their Frequencies
Results from ILP on the Whole Dataset: Results from ILP on the Whole Dataset
Application of Haplotype Inference in Gene Association Mapping: Application of Haplotype Inference in Gene Association Mapping We have developed a new haplotype association mapping method based on density-based clustering for case-control data.
The method regards haplotype segments as data points in a high dimensional space, and defines a new pairwise haplotype distance measure.
Clusters are then identified by a density-based clustering algorithm.
Z-scores based on the number of cases and controls in a cluster can be used as an indicator of the degree of association between a cluster and the disease under study.
Results are very promising.
But it needs haplotypes as input.
An Application of Haplotype Inference: An Application of Haplotype Inference Haplotypes are inferred by computational methods that we mentioned earlier.
For example: a real data set that we analyzed consists of 385 nuclear families of size 4 (2 parents with 2 affected children).
We do haplotype inference first using our ILP algorithm. The haplotypes transmitted to (affected) children are treated as cases and un-transmitted haplotypes as controls. The haplotype association method was applied then.
Inference of Haplotypes from Population Data: The Perfect Phylogeny Model : 00000 1 2 4 3 5 10100 10000 01011 00010 01010 12345 Loci Ancestral haplotype Extant haplotypes at the leaves Locus mutations on edges
Inference of Haplotypes from Population Data: The Perfect Phylogeny Model Each locus suffers from
at most one mutation.
No recombination!
Slide42: Perfect Phylogeny Haplotype (PPH) Given a set/poplation of genotypes S, find an explanatory set of haplotypes that fits a perfect phylogeny. Loci The genotype coding:
(11): 0 homozygous
(22): 1 homozygous
(12): 2 heterozygous
A haplotype pair explains a genotype if the merge of the haplotypes creates the genotype. E.g., merging haplotypes
001 and 100 results in genotype 202. Genotype matrix S
Slide43: 1 c c a a b b 2 10 10 10 01 01 00 00 Perfect Phylogeny Haplotype (PPH) Given a set of genotypes S, find an explanatory set of haplotypes that fits a perfect phylogeny.
Slide44: Perfect Phylogeny Haplotype (PPH) Given a set of genotypes S, find an explanatory set of haplotypes that fits a perfect phylogeny.
Slide45: No perfect phylogeny exists for this
explanation An Alternative Haplotype Explanation
Efficient Solutions to the PPH Problem with n Individuals and m Loci: Efficient Solutions to the PPH Problem with n Individuals and m Loci Reduction to a graph realization problem (GPPH), based on Bixby-Wagner or Fushishige solution to graph realization (Gusfield’01).
Reduction to graph realization, based on Tutte’s graph realization method, in O(nm^2) time (Gusfield’02).
Direct combinatorial approach in O(nm^2) time.
Bafna et al.’03
Eskin, Halperin and Karp’03: Specialize the Tutte solution to the PPH problem, in O(nm^2) time.
Summary: Summary Li, J. and T. Jiang. Efficient Rule-Based Haplotyping Algorithm for Pedigree Data. RECOMB’03
NP-completeness proof for general pedigrees.
An efficient heuristic algorithm: block-extension.
An efficient exact algorithm for 0-recombinant data.
Doi, K., J. Li and T. Jiang. Minimum Recombinant Haplotype Configuration on Tree Pedigrees. WABI’03
NP-completeness proof for loopless (or tree) pedigrees.
Two dynamic programming algorithms
Li, J. and T. Jiang. An Exact Solution for Finding Minimum Recombinant Haplotype Configurations on Pedigrees with Missing Data by Integer Linear Programming. RECOMB’04.
Li, J. and T. Jiang. Haplotype Association Mapping by Density-Based Clustering in Case-Control Studies. RECOMB Satellite Workshop on Computational Methods for SNPs and Haplotypes, CMU, 2004.
Future Work: Future Work Incorporating mutations and errors into MRHC.
Incorporating the likelihood of recombination into the objective function of ILP.
Haplotype inference and missing allele imputation without pedigree information.
Approximation algorithms for MRHC, especially MRHC on tree pedigrees.
Efficient fixed-parameter (# of recombinants) algorithms.
Slide49: Acknowledgements Dr. Dajun Qian from City of Hope.
Whitehead/MIT Center for Genome Research
Drs. David Altshuler, Mark Daly, Stacey Gabriel,
and Stephen Schaffner, and their entire group.
Financial support from NSF and CMOST 973 project.