logging in or signing up Test aSGuest1509 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 24 Category: Science & Tech.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 20, 2008 This Presentation is Public Favorites: 0 Presentation Description tEST Comments Posting comment... Premium member Presentation Transcript Data Mining and Bioinformatics: Some Challenges : http://www.cse.ust.hk/~qyang/ 1 Data Mining and Bioinformatics: Some Challenges Qiang Yang, Computer Science and Engineering HKUST Thanks: HKUST RPC Project Ben Niu, Can Yang Prof. Hannah Xue Prof. W. Yu State of Art: DM for Bio : http://www.cse.ust.hk/~qyang/ 2 State of Art: DM for Bio We know how to classify biological sequences SVM, Neural Nets, Decision Trees, Rules We know how to cluster biological entities Bi-clustering, K-means, hierarchical We know how to select features PCA, LDA, SVM-RFE Data Mining: Challenges in Bio : http://www.cse.ust.hk/~qyang/ 3 Data Mining: Challenges in Bio Non-traditional Feature Selection When the number of attributes >> number of samples? Highly imbalanced Explainable and Accurate Data Mining Methods NN, SVM Rules? Transfer Learning Can knowledge learned from one set of samples help data mining on another sample? Exploiting the network structure Individual i.i.d type of classification vs social networks? Challenge 1: Non-traditional Feature Selection: Question: which (few) genes lead to diseases? : http://www.cse.ust.hk/~qyang/ 4 Challenge 1: Non-traditional Feature Selection: Question: which (few) genes lead to diseases? # of attributes >> # samples # of positive << # negative KDD CUP 2002 task 2 Yeast Gene Regulation Prediction Traditional feature selection methods fail: overfitting, singularity of covariance matrix Non-traditional Feature Selection (2) : http://www.cse.ust.hk/~qyang/ 5 Non-traditional Feature Selection (2) Some potential solutions ‘Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems’, Journal of Machine Learning Research. Vol. 6, 2005. Singularity problem is solved by splitting the subspace into the regular and the irregular parts. Irregular part (null space) of the within-class scatter matrix is fully utilized to extract the discriminant info. ‘Two-Dimensional PCA: A New Approach to Appearance-Based Face Representation and Recognition’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, 2004. High dimensional data in 2D arrays are projected directly onto the subspaces. Size of covariance matrix can be reduced significantly. Singularity is avoided. Non-traditional Feature Selection (3) : http://www.cse.ust.hk/~qyang/ 6 Non-traditional Feature Selection (3) Other approaches: Manifold learning Manifold learning methods, e.g., Isomap, LLE, maintain the local patterns of distribution during transform, Extract features suitable for k-NN classifiers Can be used to reduce the dimensionality of Bio. Data. Semi-supervised learning What if we have 10% labeled data, but the rest 90% are unlabelled? Build clusters around the labeled samples. Samples in the same cluster are labeled as from the same class, assuming they follow the normal distributions. Challenge 2: Explainable and Accurate Data Mining Methods : http://www.cse.ust.hk/~qyang/ 7 Challenge 2: Explainable and Accurate Data Mining Methods Current methods, such as SVMs, discriminant analysis, neural networks, are ‘black box’ models. The learned knowledge is hard to understand by biologists. Some potential solutions Logic based method, e.g., decision trees and variants may be better in giving the ‘IF-THEN’ like rules that explicitly define the epigenetic logics in cancer and stem cell development. DNA methylation rules can be learned by using SVM based recursive feature elimination and fuzzy logics. [Gene selection for cancer classification using support vector machines’, Machine Learning, 2002.] Epigenetic Analysis: A Case Study : http://www.cse.ust.hk/~qyang/ 8 Epigenetic Analysis: A Case Study Epigenetic events dominate the growth of cancer and embryonic stem cells These two type of cells are of great importance Genes can be turned on/ off through Cytosine methylation or Histone modifications The logics of DNA methylation underlie the cells’ behaviors Wish to Know: Methylation status of CpG sites CpG islands/ promoter regions in DNA sequence Cancer prediction Traditional methods, SVMs, ANNs are ‘black box’ models Knowledge are trained connection weights, or Support Vectors. Hard to understand for biologists Adaptive Cascade Sharing Trees (ACS4) Niu et al. 2007 (tmr’s talk) : http://www.cse.ust.hk/~qyang/ 9 Adaptive Cascade Sharing Trees (ACS4) Niu et al. 2007 (tmr’s talk) Objective: learn human understandable rules that define the epigenetic process in cancer and embryonic stem cells Idea: Adaptively partition the numeric attributes into a set of the linguistic domains, e.g., ‘high’, ‘very high’, ‘Medium’, ‘Low’, ‘Very Low’ Method: clustering Train a committee of trees to select the most salient features and predict by voting Method: tree learning ACS4 method (2) : http://www.cse.ust.hk/~qyang/ 10 ACS4 method (2) ACS4 method (3) : http://www.cse.ust.hk/~qyang/ 11 ACS4 method (3) Dataset: 37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines. 1,536 attributes Result Just 2 attributes are enough to separate the 3 cell types No need of 40 attributes by using fisher’s score in [1]. Wet lab cost can be reduced by testing on 2 attributes only, instead of 40. Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’. Rules can be easily understood to biologist to conceive new biological experiments seeking in wet lab proof. 40 attributes: [1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006 Challenge 3: Transfer Learning : http://www.cse.ust.hk/~qyang/ 12 Challenge 3: Transfer Learning In real life, data are hard to obtain Biological experiments are expensive However, biological data are related Can we leverage the knowledge learned in one task/domain/data set for prediction of another? Humans often do this: having learned one language, find it easier to learn another In Web mining, having learned to classify one web site, use the abstract knowledge to help classify another web site Challenge: can we leverage the knowledge learned from one data set to classify/cluster/predict another? Transfer Learning (Examples) : http://www.cse.ust.hk/~qyang/ 13 Transfer Learning (Examples) Problem: how to Propagate the classification knowledge? Difficulty: old and new data may have different distributions Transfer Learning to Classify Web : http://www.cse.ust.hk/~qyang/ 14 Transfer Learning to Classify Web [Dai, et al, 2007] 20 newsgroups (20,000 documents, 20 data sets) SRAA (A. McCallum, 70,000 articles) comp.graphics (comp) comp.os.mis-windows.misc (comp) sci.crypt (sci) sci.electronics (sci) comp.sys.ibm (?) comp.windows.x (?) sci.med (?) sci.space (?) Old New Document-word co-occurrence : http://www.cse.ust.hk/~qyang/ 15 Document-word co-occurrence [Dai, et al. 2007] Transfer Learning: Related Works : http://www.cse.ust.hk/~qyang/ 16 Transfer Learning: Related Works Semi-supervised Learning [Zhu, Survey, Blum and Mitchell “co-training”, Nigam et al, “EM-based”, Zeng et al “clustering”, Joachims, “transductive”] Distributions of training and test data are usually assumed to be the same Multi-task Learning [Caruana, MLJ] multiple Dis exist Domain specific knowledge jointly learned to benefit each other. Focused on how multiple tasks helping each other Semi-supervised Clustering Same distribution assumption, but can be relaxed when must-links are few Transfer to Classify Web : http://www.cse.ust.hk/~qyang/ 17 Transfer to Classify Web Co-clustering is applied between words and out-of-domain documents (new tasks) Word clustering is constrained by the labels of in-domain (Old) documents The word clustering part in both domains serve as a bridge A Biological Transfer Problem : http://www.cse.ust.hk/~qyang/ 18 A Biological Transfer Problem ‘Promoter prediction analysis on the whole human genome’, Nature Biotechnology, Vol. 22, 2004. Most of the promoter prediction programs are effective on individual chromosomes, e.g., Chr21, Chr22, But inadequate to generalize to the whole genome scale only 65% of accuracy rate on average too low Can we build a unifying model for transferring the learned knowledge to other chromosomes to predict across the whole genome? to cluster other genes and protein arrays? to classify related sequences? Challenge 4: Exploiting the network structure : http://www.cse.ust.hk/~qyang/ 19 Challenge 4: Exploiting the network structure We are short of labeled data The matrix structures are very sparse if we only have several hundred samples and a huge number of attributes Classification accuracy cannot be improved much Gene expression data: tens or low hundreds of samples, but tens of thousands of attributes (?) Accuracy ~ less than 80% Challenge: can we leverage the network structure? Social Network Mining : http://www.cse.ust.hk/~qyang/ 20 Social Network Mining Very large scale computational analysis of gene and social networks. Social networks: a social structure made of nodes (individuals or organizations) tied by one or more specific types of relations. Relations: financial exchanges, friends, web links, disease transmission (epidemiology), or gene interactions. Small world phenomenon: chain of social acquaintances required to connect on arbitrary person to another arbitrary person anywhere in the world is generally short, five to seven separation steps are sufficient. Centrality Eigenvector: measure the importance of node in a network. Collective Classification Collective Recommendation Social Net Mining: Engineering meets Science : http://www.cse.ust.hk/~qyang/ 21 ‘Empirical Analysis of an Evolving Social Network’, Science, Vol. 311, 2006. A dynamic social network comprising 43,553 students, faculty, and staff at a large University. Interactions between individuals are inferred from time-stamped e-mail headers recorded over one academic year and are matched with affiliations and attributes. Findings: when two students are in the same class, they are on average 3 times more likely to interact if they also share an acquaintance Netflix Challenge and KDDCUP 2007 Blog Evolution (NEC Work) Can we deduce the actors’ roles/functions/attributes from the topology of the network? Social Net Mining: Engineering meets Science Using Network Structure in Biology : http://www.cse.ust.hk/~qyang/ 22 Using Network Structure in Biology ‘Adaptive Response of a Gene Network to Environmental Changes by Fitness-Induced Attractor Selection’, Plos One, 2006 The gene network is formulated as differential equations, given some initial state the network stabilized at some attractors, corresponding to the different cell types. The complex dynamics of the gene networks can explain the high diversity of the species. Given some perturbations, how will the state of the gene networks change to adjust the levels of gene expression to environment factors? The dynamics of a gene network are described by differential equations, e.g., a simplified network involving only two gene nodes is formulated as: where m1 and m2 are the gene expression levels. S(A) and D(A) are the rate coefficients of synthesis and degradation. They depend on A, which represents cellular activity. g1 and g2 represent the noises in gene expression. ‘Adaptive Response of a Gene Network to Environmental Changes by Fitness-Induced Attractor Selection’, PlosOne, 2006. : http://www.cse.ust.hk/~qyang/ 23 ‘Adaptive Response of a Gene Network to Environmental Changes by Fitness-Induced Attractor Selection’, PlosOne, 2006. Given the initial condition, the gene expression levels stabilize at the attractors determined by the coefficients of equations. Real world gene networks can be much more complex, involving thousands of genes, leading to the complex patterns of attractors and cell activities. Conclusions : http://www.cse.ust.hk/~qyang/ 24 Conclusions I have listed four challenges Non-traditional Feature Selection Explainable and Accurate Data Mining Methods Transfer Learning Exploiting the network structure There are more… Solving these challenges requires biologists and computer scientists to work hand in hand You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Test aSGuest1509 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 24 Category: Science & Tech.. License: All Rights Reserved Like it (0) Dislike it (0) Added: October 20, 2008 This Presentation is Public Favorites: 0 Presentation Description tEST Comments Posting comment... Premium member Presentation Transcript Data Mining and Bioinformatics: Some Challenges : http://www.cse.ust.hk/~qyang/ 1 Data Mining and Bioinformatics: Some Challenges Qiang Yang, Computer Science and Engineering HKUST Thanks: HKUST RPC Project Ben Niu, Can Yang Prof. Hannah Xue Prof. W. Yu State of Art: DM for Bio : http://www.cse.ust.hk/~qyang/ 2 State of Art: DM for Bio We know how to classify biological sequences SVM, Neural Nets, Decision Trees, Rules We know how to cluster biological entities Bi-clustering, K-means, hierarchical We know how to select features PCA, LDA, SVM-RFE Data Mining: Challenges in Bio : http://www.cse.ust.hk/~qyang/ 3 Data Mining: Challenges in Bio Non-traditional Feature Selection When the number of attributes >> number of samples? Highly imbalanced Explainable and Accurate Data Mining Methods NN, SVM Rules? Transfer Learning Can knowledge learned from one set of samples help data mining on another sample? Exploiting the network structure Individual i.i.d type of classification vs social networks? Challenge 1: Non-traditional Feature Selection: Question: which (few) genes lead to diseases? : http://www.cse.ust.hk/~qyang/ 4 Challenge 1: Non-traditional Feature Selection: Question: which (few) genes lead to diseases? # of attributes >> # samples # of positive << # negative KDD CUP 2002 task 2 Yeast Gene Regulation Prediction Traditional feature selection methods fail: overfitting, singularity of covariance matrix Non-traditional Feature Selection (2) : http://www.cse.ust.hk/~qyang/ 5 Non-traditional Feature Selection (2) Some potential solutions ‘Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems’, Journal of Machine Learning Research. Vol. 6, 2005. Singularity problem is solved by splitting the subspace into the regular and the irregular parts. Irregular part (null space) of the within-class scatter matrix is fully utilized to extract the discriminant info. ‘Two-Dimensional PCA: A New Approach to Appearance-Based Face Representation and Recognition’, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26, 2004. High dimensional data in 2D arrays are projected directly onto the subspaces. Size of covariance matrix can be reduced significantly. Singularity is avoided. Non-traditional Feature Selection (3) : http://www.cse.ust.hk/~qyang/ 6 Non-traditional Feature Selection (3) Other approaches: Manifold learning Manifold learning methods, e.g., Isomap, LLE, maintain the local patterns of distribution during transform, Extract features suitable for k-NN classifiers Can be used to reduce the dimensionality of Bio. Data. Semi-supervised learning What if we have 10% labeled data, but the rest 90% are unlabelled? Build clusters around the labeled samples. Samples in the same cluster are labeled as from the same class, assuming they follow the normal distributions. Challenge 2: Explainable and Accurate Data Mining Methods : http://www.cse.ust.hk/~qyang/ 7 Challenge 2: Explainable and Accurate Data Mining Methods Current methods, such as SVMs, discriminant analysis, neural networks, are ‘black box’ models. The learned knowledge is hard to understand by biologists. Some potential solutions Logic based method, e.g., decision trees and variants may be better in giving the ‘IF-THEN’ like rules that explicitly define the epigenetic logics in cancer and stem cell development. DNA methylation rules can be learned by using SVM based recursive feature elimination and fuzzy logics. [Gene selection for cancer classification using support vector machines’, Machine Learning, 2002.] Epigenetic Analysis: A Case Study : http://www.cse.ust.hk/~qyang/ 8 Epigenetic Analysis: A Case Study Epigenetic events dominate the growth of cancer and embryonic stem cells These two type of cells are of great importance Genes can be turned on/ off through Cytosine methylation or Histone modifications The logics of DNA methylation underlie the cells’ behaviors Wish to Know: Methylation status of CpG sites CpG islands/ promoter regions in DNA sequence Cancer prediction Traditional methods, SVMs, ANNs are ‘black box’ models Knowledge are trained connection weights, or Support Vectors. Hard to understand for biologists Adaptive Cascade Sharing Trees (ACS4) Niu et al. 2007 (tmr’s talk) : http://www.cse.ust.hk/~qyang/ 9 Adaptive Cascade Sharing Trees (ACS4) Niu et al. 2007 (tmr’s talk) Objective: learn human understandable rules that define the epigenetic process in cancer and embryonic stem cells Idea: Adaptively partition the numeric attributes into a set of the linguistic domains, e.g., ‘high’, ‘very high’, ‘Medium’, ‘Low’, ‘Very Low’ Method: clustering Train a committee of trees to select the most salient features and predict by voting Method: tree learning ACS4 method (2) : http://www.cse.ust.hk/~qyang/ 10 ACS4 method (2) ACS4 method (3) : http://www.cse.ust.hk/~qyang/ 11 ACS4 method (3) Dataset: 37 hESC, 33 non-hESC, 24 cancer cell lines, 9 normal cell lines. 1,536 attributes Result Just 2 attributes are enough to separate the 3 cell types No need of 40 attributes by using fisher’s score in [1]. Wet lab cost can be reduced by testing on 2 attributes only, instead of 40. Accuracy is better, except when compared with SVM, but SVM cannot tell us ‘why’. Rules can be easily understood to biologist to conceive new biological experiments seeking in wet lab proof. 40 attributes: [1] ‘Human embryonic stem cells have a unique epigenetic signature‘, Genome Research, Vol. 16, 2006 Challenge 3: Transfer Learning : http://www.cse.ust.hk/~qyang/ 12 Challenge 3: Transfer Learning In real life, data are hard to obtain Biological experiments are expensive However, biological data are related Can we leverage the knowledge learned in one task/domain/data set for prediction of another? Humans often do this: having learned one language, find it easier to learn another In Web mining, having learned to classify one web site, use the abstract knowledge to help classify another web site Challenge: can we leverage the knowledge learned from one data set to classify/cluster/predict another? Transfer Learning (Examples) : http://www.cse.ust.hk/~qyang/ 13 Transfer Learning (Examples) Problem: how to Propagate the classification knowledge? Difficulty: old and new data may have different distributions Transfer Learning to Classify Web : http://www.cse.ust.hk/~qyang/ 14 Transfer Learning to Classify Web [Dai, et al, 2007] 20 newsgroups (20,000 documents, 20 data sets) SRAA (A. McCallum, 70,000 articles) comp.graphics (comp) comp.os.mis-windows.misc (comp) sci.crypt (sci) sci.electronics (sci) comp.sys.ibm (?) comp.windows.x (?) sci.med (?) sci.space (?) Old New Document-word co-occurrence : http://www.cse.ust.hk/~qyang/ 15 Document-word co-occurrence [Dai, et al. 2007] Transfer Learning: Related Works : http://www.cse.ust.hk/~qyang/ 16 Transfer Learning: Related Works Semi-supervised Learning [Zhu, Survey, Blum and Mitchell “co-training”, Nigam et al, “EM-based”, Zeng et al “clustering”, Joachims, “transductive”] Distributions of training and test data are usually assumed to be the same Multi-task Learning [Caruana, MLJ] multiple Dis exist Domain specific knowledge jointly learned to benefit each other. Focused on how multiple tasks helping each other Semi-supervised Clustering Same distribution assumption, but can be relaxed when must-links are few Transfer to Classify Web : http://www.cse.ust.hk/~qyang/ 17 Transfer to Classify Web Co-clustering is applied between words and out-of-domain documents (new tasks) Word clustering is constrained by the labels of in-domain (Old) documents The word clustering part in both domains serve as a bridge A Biological Transfer Problem : http://www.cse.ust.hk/~qyang/ 18 A Biological Transfer Problem ‘Promoter prediction analysis on the whole human genome’, Nature Biotechnology, Vol. 22, 2004. Most of the promoter prediction programs are effective on individual chromosomes, e.g., Chr21, Chr22, But inadequate to generalize to the whole genome scale only 65% of accuracy rate on average too low Can we build a unifying model for transferring the learned knowledge to other chromosomes to predict across the whole genome? to cluster other genes and protein arrays? to classify related sequences? Challenge 4: Exploiting the network structure : http://www.cse.ust.hk/~qyang/ 19 Challenge 4: Exploiting the network structure We are short of labeled data The matrix structures are very sparse if we only have several hundred samples and a huge number of attributes Classification accuracy cannot be improved much Gene expression data: tens or low hundreds of samples, but tens of thousands of attributes (?) Accuracy ~ less than 80% Challenge: can we leverage the network structure? Social Network Mining : http://www.cse.ust.hk/~qyang/ 20 Social Network Mining Very large scale computational analysis of gene and social networks. Social networks: a social structure made of nodes (individuals or organizations) tied by one or more specific types of relations. Relations: financial exchanges, friends, web links, disease transmission (epidemiology), or gene interactions. Small world phenomenon: chain of social acquaintances required to connect on arbitrary person to another arbitrary person anywhere in the world is generally short, five to seven separation steps are sufficient. Centrality Eigenvector: measure the importance of node in a network. Collective Classification Collective Recommendation Social Net Mining: Engineering meets Science : http://www.cse.ust.hk/~qyang/ 21 ‘Empirical Analysis of an Evolving Social Network’, Science, Vol. 311, 2006. A dynamic social network comprising 43,553 students, faculty, and staff at a large University. Interactions between individuals are inferred from time-stamped e-mail headers recorded over one academic year and are matched with affiliations and attributes. Findings: when two students are in the same class, they are on average 3 times more likely to interact if they also share an acquaintance Netflix Challenge and KDDCUP 2007 Blog Evolution (NEC Work) Can we deduce the actors’ roles/functions/attributes from the topology of the network? Social Net Mining: Engineering meets Science Using Network Structure in Biology : http://www.cse.ust.hk/~qyang/ 22 Using Network Structure in Biology ‘Adaptive Response of a Gene Network to Environmental Changes by Fitness-Induced Attractor Selection’, Plos One, 2006 The gene network is formulated as differential equations, given some initial state the network stabilized at some attractors, corresponding to the different cell types. The complex dynamics of the gene networks can explain the high diversity of the species. Given some perturbations, how will the state of the gene networks change to adjust the levels of gene expression to environment factors? The dynamics of a gene network are described by differential equations, e.g., a simplified network involving only two gene nodes is formulated as: where m1 and m2 are the gene expression levels. S(A) and D(A) are the rate coefficients of synthesis and degradation. They depend on A, which represents cellular activity. g1 and g2 represent the noises in gene expression. ‘Adaptive Response of a Gene Network to Environmental Changes by Fitness-Induced Attractor Selection’, PlosOne, 2006. : http://www.cse.ust.hk/~qyang/ 23 ‘Adaptive Response of a Gene Network to Environmental Changes by Fitness-Induced Attractor Selection’, PlosOne, 2006. Given the initial condition, the gene expression levels stabilize at the attractors determined by the coefficients of equations. Real world gene networks can be much more complex, involving thousands of genes, leading to the complex patterns of attractors and cell activities. Conclusions : http://www.cse.ust.hk/~qyang/ 24 Conclusions I have listed four challenges Non-traditional Feature Selection Explainable and Accurate Data Mining Methods Transfer Learning Exploiting the network structure There are more… Solving these challenges requires biologists and computer scientists to work hand in hand