NaTang Inverse


Presentation Description

No description available.


Presentation Transcript

Solving Inverse Problems via Machine Learning and Knowledge Discovery: 

Solving Inverse Problems via Machine Learning and Knowledge Discovery -- Na Tang Computer Science, UC Davis

Outline : 

Inverse Problem Background Bayesian Network Related work Our Knowledge Discovery Approach Architecture C1: Documents Collection and Classification C2: Text Analysis and Information Extraction C3: Missing Information Restoration Experiments Conclusion andamp; Future Work Outline

Inverse Problem : 

Inverse Problem Internal Structure (Model) … … Inputs Outputs Goal: To obtain the internal structure from indirect noisy observations (input/output data). A specific internal structure (model): Bayesian Network Observable data: Incomplete Our Inverse problem: To learn Bayesian net models from incomplete data


Bayesian Network: 

Bayesian Network Bayesian Net -- directed graph where the nodes are variables (inputs/outputs) and the edges indicate the dependence among variables. Calculate P(Out=oi | In1=in1, … Inn=inn) for each possible oi Predict Out=oi which has the largest probability value. Ink – attributes, Out – class variable. Different learning methods are available to find a good Bayesian Net from the data. N nodes, 2n possible networks Complete Data Bayesian Net Construction Predict Inputs Outputs

Bayesian Network Cont.: 

Bayesian Network Cont. Example: Heart disease data sets (UCI). 'Age', 'Sex'… 'Thal' -- attributes 'Outcome' -- class variable. Once the Bayesian net is built, calculate P(Outcome| Age, Sex, … Thal) to predict heart disease Fig. 1: 'Cleveland' data (one of the four heart disease data sets) from UCI repository.

Bayesian Network Learning Methods: 

Bayesian Network Learning Methods Naïve Bayes (NB) Simple, Strong assumption of independency between any attribute Ai and Aj; Assumption not hold in reality but good accuracy. P(Outcome| Age, Sex, … Thal) = P(Outcome)P(Age|Outcome)…P(Thal|Outcome) / P(Age, Sex, … Thal) P(Outcome)P(Age|Outcome)…P(Thal|Outcome) Outcome Sex Age ChestPain … Thal Fig. 2: A Naïve Bayes network model for Heart Disease

Bayesian Network Learning Methods: 

Bayesian Network Learning Methods Tree Augmented Naïve Bayes (TAN) More complex than NB but outperforms NB Search restricted space -- all possible trees in which the class variable C is the parent of all attribute and each Ai has at most one more parent Aj . Find the one that fits the data most. Outcome Sex Age ChestPain Thal RestBP Vessel BloodSugar MaxHeartRate OldPeak STSlope ECG Cholesterol Angina Fig. 3: A TAN network model for Heart Disease

Bayesian Network Learning Methods: 

Bayesian Network Learning Methods Other learning methods Expensive, Outperform NB and TAN Important assumption for all methods -- Complete Data How to deal with incomplete data? Incomplete attribute -- attribute containing missing values Fig. 4: 'Switzerland' data (one of the four heart disease data sets) from UCI repository. All '-999's stand for missing values.

Related Work: 

Related Work Statistic Methods to deal with incomplete data Fill in missing values using available data as a guide. EM (expectation-maximization) Algorithm Evolutionary Algorithm Most of these methods cannot work well when a large percentage of data is missing

Our Knowledge Discovery Approach: 

Our Knowledge Discovery Approach WWW Search Results -- Documents containing the keywords Incomplete Data Documents Collection andamp; Classification Missing Information Restoration Text Analysis andamp; Information Extraction C2 C3 C1 Probability Information Keywords Available data Missing Data for incomplete attributes (Positive) Documents containing probability information Other Documents Search Engines (Negative) Model Construction Fig. 5: Implementation architecture for generating missing information

C1: Documents Collection and Classification : 

C1: Documents Collection and Classification Document Collection Phase: Collect N documents from Google using keywords E.g. 'Cholesterol' is an incomplete attribute. Keywords input to search engine would be about 'Cholesterol' and other attributes: 'cholesterol age', 'cholesterol gender', 'cholesterol heart disease' … Document Classification Phase: Naïve Bayes text classifier (Rainbow) Bag-of-words representation. Outperforms KNN (k-nearest neighbor) in our case. Divide collected documents into two classes: positive class – documents containing information on heart disease causes and probability information. negative class – all other documents.

C1 Cont.: 

Training data, Parameter setting for the classifier Training data: 292 documents from professional web sites that specialize in heart disease, hand-labeled as positive and negative (78: positive class, 214: negative class). Parameter setting: Token option: skip-html, stop-list, no stemming Event model: word-based (multinomial) Smoothing method: no C1 Cont.

C2: Text Analysis and Information Extraction : 

C2: Text Analysis and Information Extraction Rules are generated to extract the probability information. Two categories of probability information: Point probabilities P(Ai=ai| Aj=aj) = c , where c stands for some constant Qualitative influence Positive influence from Ai to Aj: choosing a higher value for Ai makes the higher value for Aj more likely. P(Aiandgt;ai | Aj =aj1) andgt; P(Aiandgt;ai | Aj =aj2) given aj1andgt;aj2 Other forms of probability information (e.g. comparison, qualitative synergies)

An example of Point Probability information: 

An example of Point Probability information a) 'In general, total cholesterol is considered high when 240 or more, borderline when 200-239, and desirable when 200 or less.' Output: P(Outcome=1|Cholesterol andlt; 160) = optimal [degree1], P(Outcome=1|Cholesterol andlt;200) = desirable [degree2], P(Outcome=1|200andlt;Cholesterol andlt;239) = borderline [degree3], P(Outcome=1|Cholesterol andlt;200) = high [degree4]. a) Point Probabilities expressed in table b) Point Probabilities expressed in regular sentences

An example of Qualitative Influence information: 

An example of Qualitative Influence information 'As people get older, their cholesterol levels rise.' 'Cholesterol levels naturally rise as men and women age.' 'Old people have higher cholesterol levels than the youth.' Output: Positive Influence from 'Age' to 'Cholesterol' P(Cholesterol andgt; v | Age=a1) andgt; P(Cholesterol andgt; v | Age=a2), given a1andgt;a2

C3: Missing Information Restoration: 

C3: Missing Information Restoration Inputs: Probability outputs from C2, Available incomplete data Outputs: Missing Data Approach: 1) Calculate the probabilities from the available data: e.g. P(Age, Sex, Outcome) = v ; 2) Convert all the probability information (outputs from C2 and the probability constraints from 1)) into a linear system of equalities and inequalities, calculate bounds on the probabilities of interest and elicit the required probabilities -- P(Cholesterol| Age, Sex, Outcome) [DG '95]; 3) Fill the missing values in the data set based on these probabilities.

C3 cont.: 

C3 cont. An example to process 3) -- Suppose after 1)2) we got: P(Cholesterolandlt;200|Ageandlt;50, Sex=female, Heart Disease=1)= v1, P(200andlt;Cholesterolandlt;240|Ageandlt;50, Sex=female, Heart Disease=1)= v2, P(Cholesterolandgt;240|Ageandlt;50, Sex=female, Heart Disease=1)= v3 If Age(patient)andlt;50 andamp;andamp; Sex(patient)=female andamp;andamp; HeartDisease(patient)=1 Then Cholesterol(patient) = one of the values from the set {andlt;200, 200-240, andgt;240} respectively with probabilities v1/v, v2/v and v3/v, where v= v1+ v2+ v3.

Experiment Results: 

Experiment Results Table 1: A Comparison of prediction accuracies with complete and filled-in data Cleveland data (Complete, no missing values) 2/3 data – training, 1/3 data – testing Construct NB and TAN on the complete training data; Assume all the cholesterol values in training data are missing, fill the values using our approach and then construct NB and TAN on the filled-in training data.

Conclusion & Future Work: 

Conclusion andamp; Future Work Conclusion WWW is used as a source for filling missing information in relational tables. Bayesian Models constructed on the restored data retrieve good prediction accuracy. Future Work Consideration of reliability of the web information Different automatic extraction methods The case when the available data is little. Extend our method to directly estimate the Bayesian parameters from the web knowledge. Feasible only if we can retrieve enough probability information from the web.

authorStream Live Help