Solving Inverse Problems via Machine Learning and Knowledge Discovery:

Solving Inverse Problems via Machine Learning and Knowledge Discovery -- Na Tang
Computer Science, UC Davis
natang@ucdavis.edu

Outline :

Inverse Problem
Background
Bayesian Network
Related work
Our Knowledge Discovery Approach
Architecture
C1: Documents Collection and Classification
C2: Text Analysis and Information Extraction
C3: Missing Information Restoration
Experiments
Conclusion andamp; Future Work
Outline

Inverse Problem :

Inverse Problem Internal Structure
(Model) … … Inputs Outputs Goal: To obtain the internal structure from indirect noisy observations (input/output data).
A specific internal structure (model): Bayesian Network
Observable data: Incomplete
Our Inverse problem: To learn Bayesian net models from incomplete data

Slide4:

Bayesian Network:

Bayesian Network Bayesian Net -- directed graph where the nodes are variables (inputs/outputs) and the edges indicate the dependence among variables.
Calculate P(Out=oi | In1=in1, … Inn=inn) for each possible oi
Predict Out=oi which has the largest probability value.
Ink – attributes, Out – class variable.
Different learning methods are available to find a good Bayesian Net from the data.
N nodes, 2n possible networks
Complete
Data Bayesian Net Construction Predict Inputs Outputs

Bayesian Network Cont.:

Bayesian Network Cont. Example:
Heart disease data sets (UCI).
'Age', 'Sex'… 'Thal' -- attributes
'Outcome' -- class variable.
Once the Bayesian net is built, calculate P(Outcome| Age, Sex, … Thal) to predict heart disease
Fig. 1: 'Cleveland' data (one of the four heart disease data sets) from UCI repository.

Bayesian Network Learning Methods:

Bayesian Network Learning Methods Naïve Bayes (NB)
Simple, Strong assumption of independency between
any attribute Ai and Aj;
Assumption not hold in reality but good accuracy.
P(Outcome| Age, Sex, … Thal) =
P(Outcome)P(Age|Outcome)…P(Thal|Outcome) / P(Age, Sex, … Thal)
P(Outcome)P(Age|Outcome)…P(Thal|Outcome) Outcome Sex Age ChestPain … Thal Fig. 2: A Naïve Bayes network model for Heart Disease

Bayesian Network Learning Methods:

Bayesian Network Learning Methods Tree Augmented Naïve Bayes (TAN)
More complex than NB but outperforms NB
Search restricted space -- all possible trees in which the class variable C is the parent of all attribute and each Ai has at most one more parent Aj . Find the one that fits the data most.
Outcome Sex Age ChestPain Thal RestBP Vessel BloodSugar MaxHeartRate OldPeak STSlope ECG Cholesterol Angina Fig. 3: A TAN network model for Heart Disease

Bayesian Network Learning Methods:

Bayesian Network Learning Methods Other learning methods
Expensive, Outperform NB and TAN
Important assumption for all methods -- Complete Data
How to deal with incomplete data?
Incomplete attribute -- attribute containing missing values
Fig. 4: 'Switzerland' data (one of the four heart disease data sets) from UCI repository. All '-999's stand for missing values.

Related Work:

Related Work Statistic Methods to deal with incomplete data
Fill in missing values using available data as a guide.
EM (expectation-maximization) Algorithm
Evolutionary Algorithm
Most of these methods cannot work well when a large
percentage of data is missing

Our Knowledge Discovery Approach:

Our Knowledge Discovery Approach
WWW Search Results -- Documents containing the keywords Incomplete
Data Documents Collection
andamp; Classification Missing Information Restoration
Text Analysis andamp; Information Extraction C2 C3 C1 Probability
Information Keywords
Available
data Missing
Data for incomplete attributes (Positive)
Documents containing
probability
information Other Documents Search
Engines (Negative) Model Construction
Fig. 5: Implementation architecture for generating missing information

C1: Documents Collection and Classification :

C1: Documents Collection and Classification Document Collection Phase:
Collect N documents from Google using keywords
E.g. 'Cholesterol' is an incomplete attribute. Keywords input to search engine would be about 'Cholesterol' and other attributes: 'cholesterol age', 'cholesterol gender', 'cholesterol heart disease' …
Document Classification Phase:
Naïve Bayes text classifier (Rainbow)
Bag-of-words representation.
Outperforms KNN (k-nearest neighbor) in our case.
Divide collected documents into two classes:
positive class – documents containing information on heart disease causes and probability information.
negative class – all other documents.

C1 Cont.:

Training data, Parameter setting for the classifier
Training data:
292 documents from professional web sites that specialize in heart disease, hand-labeled as positive and negative (78: positive class, 214: negative class).
Parameter setting:
Token option: skip-html, stop-list, no stemming
Event model: word-based (multinomial)
Smoothing method: no
C1 Cont.

C2: Text Analysis and Information Extraction :

C2: Text Analysis and Information Extraction Rules are generated to extract the probability information.
Two categories of probability information:
Point probabilities
P(Ai=ai| Aj=aj) = c , where c stands for some constant
Qualitative influence
Positive influence from Ai to Aj: choosing a higher value for Ai makes the higher value for Aj more likely.
P(Aiandgt;ai | Aj =aj1) andgt; P(Aiandgt;ai | Aj =aj2) given aj1andgt;aj2
Other forms of probability information (e.g. comparison, qualitative synergies)

An example of Point Probability information:

An example of Point Probability information a) 'In general, total cholesterol is considered high when 240 or more, borderline when 200-239, and desirable when 200 or less.' Output: P(Outcome=1|Cholesterol andlt; 160) = optimal [degree1],
P(Outcome=1|Cholesterol andlt;200) = desirable [degree2],
P(Outcome=1|200andlt;Cholesterol andlt;239) = borderline [degree3],
P(Outcome=1|Cholesterol andlt;200) = high [degree4]. a) Point Probabilities expressed in table b) Point Probabilities expressed in regular sentences

An example of Qualitative Influence information:

An example of Qualitative Influence information 'As people get older, their cholesterol levels rise.'
'Cholesterol levels naturally rise as men and women age.'
'Old people have higher cholesterol levels than the youth.' Output: Positive Influence from 'Age' to 'Cholesterol'
P(Cholesterol andgt; v | Age=a1) andgt; P(Cholesterol andgt; v | Age=a2),
given a1andgt;a2

C3: Missing Information Restoration:

C3: Missing Information Restoration Inputs: Probability outputs from C2, Available incomplete data
Outputs: Missing Data
Approach:
1) Calculate the probabilities from the available data:
e.g. P(Age, Sex, Outcome) = v ;
2) Convert all the probability information (outputs from C2 and the probability constraints from 1)) into a linear system of equalities and inequalities, calculate bounds on the probabilities of interest and elicit the required probabilities -- P(Cholesterol| Age, Sex, Outcome) [DG '95];
3) Fill the missing values in the data set based on these probabilities.

C3 cont.:

C3 cont. An example to process 3) -- Suppose after 1)2) we got:
P(Cholesterolandlt;200|Ageandlt;50, Sex=female, Heart Disease=1)= v1,
P(200andlt;Cholesterolandlt;240|Ageandlt;50, Sex=female, Heart Disease=1)= v2,
P(Cholesterolandgt;240|Ageandlt;50, Sex=female, Heart Disease=1)= v3
If
Age(patient)andlt;50 andamp;andamp; Sex(patient)=female andamp;andamp; HeartDisease(patient)=1
Then
Cholesterol(patient) = one of the values from the set {andlt;200, 200-240, andgt;240} respectively with probabilities v1/v, v2/v and v3/v, where v= v1+ v2+ v3.

Experiment Results:

Experiment Results Table 1: A Comparison of prediction accuracies with complete and filled-in data Cleveland data (Complete, no missing values)
2/3 data – training, 1/3 data – testing
Construct NB and TAN on the complete training data;
Assume all the cholesterol values in training data are missing, fill the values using our approach and then construct NB and TAN on the filled-in training data.

Conclusion & Future Work:

Conclusion andamp; Future Work Conclusion
WWW is used as a source for filling missing information in relational tables.
Bayesian Models constructed on the restored data retrieve good prediction accuracy.
Future Work
Consideration of reliability of the web information
Different automatic extraction methods
The case when the available data is little. Extend our method to directly estimate the Bayesian parameters from the web knowledge. Feasible only if we can retrieve enough probability information from the web.

You do not have the permission to view this presentation. In order to view it, please
contact the author of the presentation.

Send to Blogs and Networks

Processing ....

Premium member

Use HTTPs

HTTPS (Hypertext Transfer Protocol Secure) is a protocol used by Web servers to transfer and display Web content securely. Most web browsers block content or generate a “mixed content” warning when users access web pages via HTTPS that contain embedded content loaded via HTTP. To prevent users from facing this, Use HTTPS option.