logging in or signing up class Natalya Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 835 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: November 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript CS522 Advanced Database SystemsClassification: CS522 Advanced Database Systems Classification Chengyu Sun California State University, Los AngelesA Classification Problem: A Classification Problem Is a loan to a person who is 45 years old, divorced, renting an apartment, with two kids and annual income of 100K high risk or low risk?Terminology and Concepts ...: Terminology and Concepts ... Record (or tuple) Attributes E.g. age, marital status, # of kids, owns home or not, credit score ... Class label E.g. high risk, low risk ... Classification: predict the class label with given attribute values... Terminology and Concepts: ... Terminology and Concepts Classifier (or model) Training set: records with known class labels that are used to construct (i.e. train) the classifier Classifier Attribute values Class label Classifier Training set Step 1: Step 2:Classification vs. Regression: Classification vs. Regression Classification predicts categorical attribute values Regression predicts continuous numerical attribute values SID HW1 HW2 HW3 Final Pass/Fail 1 40 60 70 95 Passed 2 10 15 11 65 Failed 3 30 45 40 75 Passed 4 35 50 35 ? ?A Training Set: A Training Set TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K YesA Decision Tree: A Decision Tree Home owner Marital Status Annual income Defaulted=No Defaulted=No Defaulted=No Defaulted=Yes Yes No Married Single, Divorced <80K >=80KDecision Tree Induction: Decision Tree Induction Let D be the set of training record associated with current node If all record in D belong to the same class C, current node is a leaf node and is labeled as C. If D contains records that belong to more than one class, select an attribute to split D into subsets, and create a child node for each subset. Apply the algorithm recursively on each child node.Decision Tree Induction Example: Decision Tree Induction Example Training set Decision treeSplit on Attributes with Discrete Values: Split on Attributes with Discrete Values Marital Status Married Single, Divorced Marital Status Married Single DivorcedSplit on Attributes with Continuous Values: Split on Attributes with Continuous Values Annual income <80K >=80K Annual income 0-40K 80K-100K 40K-80K >100kTerminating Conditions: Terminating Conditions All records in D belong to the same class No more attribute to split Class label?? No records associated with the node Class label?? Splitting Attribute Selection: Splitting Attribute Selection After a split, ideally each subset would “pure”, i.e. contains only one class of records Gender Age Preferred color female 20 pink male 20 black female 15 pink male 15 blackAttribute Selection Measures: Attribute Selection Measures Entropy (Information Gain) Gini index Gain RatioEntropy: Entropy pi is the fraction of records in D that belongs to class Ci m is the number of classes in DEntropy Example: Entropy Example Preferred color 2 black and 2 pink?? 3 black and 1 pink?? 4 black??Information Gain: Information Gain Suppose D is split into v subsets on attribute AInformation Gain Example: Information Gain Example Preferred color Gain(Gender)?? Gain(Age)??Split Information: Split Information Information gain favors attributes with lots of distinct values Split information can be used to “normalized” information gainGain Ratio: Gain RatioGini Index: Gini Index Used in the CART algorithm for binary splitGini Index Example: Gini Index Example Preferred color 2 black and 2 pink?? 3 black and 1 pink?? 4 black?? Split on gender?? Split on age??Training Error and Testing Error: Training Error and Testing Error Training error Misclassification of training records Testing (Generalization) error Misclassification of testing recordsModel Overfitting and Underfitting: Model Overfitting and Underfitting # of nodes in the decision tree Error Rate Testing error Training error overfitting underfittingOverfitting Due to Outliers ...: Overfitting Due to Outliers ... 1 2 3 4 5 6 7 8 1 2 3 4... Overfitting Due to Outliers: ... Overfitting Due to Outliers 1 2 3 4 5 6 7 8 1 2 3 4 Other Reasons That Cause Overfitting : Other Reasons That Cause Overfitting Noise (mislabeled training records) Lack of representative samples Tree Pruning – Prepruning: Tree Pruning – Prepruning Prune during decision tree construction Number of records < threshold “Purity gain” < thresholdTree Pruning – Postpruning: Tree Pruning – Postpruning Buttom-up pruning of a fully constructed tree Replace a subtree with a leaf node if it reduces testing error How do we know whether it reduces testing error or not?? Pruning based on Minimum Description Length (MDL)Estimate Testing Errors: Estimate Testing Errors Use a pruning set in addition to the training set Optimistic error estimation The training set is a good representation of the overall data (optimistic!), so the training error is the testing error Pessimistic error estimation Training error + penalty term for model complexityPessimistic Error Estimation: Pessimistic Error Estimation Nt – Total number of records e(ti) – Training error at leaf node ti (ti) – Penalty term associated with tiExample of Pessimistic Error Estimation: Example of Pessimistic Error Estimation eg(T) with (ti)=0.5?? (ti)=1?? C1:3 C2:1 C1:2 C2:1 C1:0 C2:2 C1:1 C2:2 C1:3 C2:0 C1:3 C2:1 C1:0 C2:5 C1:5 C2:2 C1:1 C2:4 C1:3 C2:0 C1:3 C2:6Pruning with Estimated Testing Error: Pruning with Estimated Testing Error With optimistic error estimation?? With pessimistic error estimation?? With a pruning set?? C1:11 C2:3 C1:2 C2:4 C1:14 C2:3 C1:2 C2:2Minimum Description Length (MDL): Minimum Description Length (MDL) The best model is the one that minimizes the number of bits to encode both the model and the exceptions to the model A B Class labels Model + exceptions orMDL Example ...: MDL Example ... n records m binary attributes k classes Cost(Internal Node) = log2m Cost(Leaf node) = log2k Cost(Error) = log2n Cost = Cost(All Nodes)+Cost(All Errors)... MDL Example: ... MDL Example 16 binary attributes and 3 classes Left tree has 7 errors and right tree has 4 errors C1 C2 C3 C1 C1 C2 C2 C3About Decision Tree Classification ...: About Decision Tree Classification ... Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets... About Decision Tree Classification: ... About Decision Tree Classification Data fragmentation Finding an optimal decision tree is NP-hard Limitation on expressiveness Tree replication 1 0 1 P Q R S 0 0 1 Q S 0 0 1 1Bayes’ Theorem: Bayes’ Theorem Prior and posterior probabilities P(A) and P(A|B) P(B) and P(B|A)Bayesian Classification: Bayesian Classification X is a given record with attribute values (x1,x2,...,xn), and Ci is a class P(Ci|X) is the probability of X belonging to class Ci given X’s attribute values We predict that X belong to Ci if P(Ci|X)>P(Cj|X) for jiCalculate P(Ci|X): Calculate P(Ci|X) P(X) does not need to be calculated Why?? P(Ci)?? P(X|Ci)??Naive Bayesian Classification: Naive Bayesian Classification X=(x1,x2,...,xn) Assume the attribute values are conditionally independent of one another (the naive assumption)Attribute Ak is Categorical: Attribute Ak is Categorical P(xk|Ci) is the fraction of number of records in Ci with value xk for attribute AkAttribute Ak is Continuous-valued ...: Attribute Ak is Continuous-valued ... Assume Ak follows a Gaussian distribution with a mean and standard deviation ... Attribute Ak is Continuous-valued : ... Attribute Ak is Continuous-valued Sample Data: Sample Data TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 120K ?? Naive Bayesian Classification Example ...: Naive Bayesian Classification Example ... P(No|HO=No,MS=M,AI=120K) vs. P(Yes|HO=No,MS=M,AI=120K)... Naive Bayesian Classification Example: ... Naive Bayesian Classification Example Annual Income, Default=No =110, =54.54 P(AI=120K|No)=0.0072 Annual Income, Default=Yes =90, =5 P(AI=120K|Yes)=1.210-9Avoid Zero P(xk|Ci): Avoid Zero P(xk|Ci) A zero P(xk|Ci) would make the whole P(X|Ci) zero To avoid this problem, add 1 to to each count, assuming the training set is sufficiently large that the effect of adding one is negligible Example Low income:0 Medium income: 990 High income: 10About Naive Bayesian Classification: About Naive Bayesian Classification The most accurate classification if the conditional independence assumption holds In practice, some attributes may be correlated E.g. education level and annual incomeBayesian Belief Network (BBN): Bayesian Belief Network (BBN) A directed acyclic graph (dag) encoding the dependencies among a set of variables A conditional probability table (CPT) for each node given its immediate parent nodesA BBN Example: A BBN Example FamilyHistory Smoker LungCancer Emphysema PositiveXRay Dyspnea CPT for LungCancerBBN Terminology: BBN Terminology If there is a directed arc from X to Y X is a parent of Y Y is a child of X If there is a directed path from X to Y X is an ancestor of Y Y is a descendent of XConditional Independence in BBN: Conditional Independence in BBN A node in a Bayesian network is conditionally independent of its non-descendants if its parents are knownNaive Bayesian is a Special Case of BBN: Naive Bayesian is a Special Case of BBN C A1 A2 A3 AnConstruct a BBN: Construct a BBN Create the structure of the network From domain knowledge From training data Calculate the CPT for each node All variables are observable Some variable may be hiddenAnother BBN Example: Another BBN ExampleBayesian Classification Examples: Bayesian Classification Examples Output node – Heart Disease Testing data () (BP=high) (BP=high,D=Healthy,E=Yes)Bayesian Classification Examples – 1: Bayesian Classification Examples – 1Bayesian Classification Examples – 2: Bayesian Classification Examples – 2Bayesian Classification Examples – 3: Bayesian Classification Examples – 3Other Classification Methods: Other Classification Methods Rule-based Artificial Neural Network (ANN) Support Vector Machine (SVM) Association rule analysis Nearest neighbor Genetic algorithms Rough Set and Fuzzy Set thoery ...Ensemble Methods: Ensemble Methods Use a number of base classifiers, and make a predication by combining the predications of all the classifiers Example Binary classification 25 classifiers, each with error rate 35% Predict by majority vote Error rate of the ensemble classifier??Construct an Ensemble Classifier: Construct an Ensemble Classifier Train k classifiers with one dataset Use the same dataset for each classifier?? Divide the dataset into k subsets?? Bagging and Boosting Bagging (Bootstrap Aggregation): Bagging (Bootstrap Aggregation) A bootstrap sampling of |D| Uniform sampling with replacement (vs. without replacement) Allow duplicates in the sample |D| samples Roughly contains 63.2% of the original records. Why?? Bagging Use a bootstrap sample as the training set for each classifierBagging Example: Bagging Example Record (x,y) x: attribute y: class label Base classifier: decision tree with one level xk Ensemble classifier: 10 classifiers, majority voteBagging Example – Dataset: Bagging Example – Dataset X Y 0.1 1 0.2 1 0.3 1 0.4 -1 0.5 -1 0.6 -1 0.7 -1 0.8 1 0.9 1 1.0 1Bagging Example – Bagging: Bagging Example – BaggingBagging Example – Classification: Bagging Example – ClassificationAbout Bagging: About Bagging Reduces the errors associated with random fluctuations in the training data for unstable classifiers, e.g. decision trees, rule-based classifiers, ANN May degrade the performance of stable classifiers, e.g. Bayesian network, SVM, k-NNIntuition for Boosting: Intuition for Boosting Sample with weights hard-to-classify records should be chosen more often Combine the prediction of the base classifiers with weights Classifiers with lower error rates get more voting power Boosting – Training: Boosting – Training For k classifiers, do k rounds of Assign a weight to each record Sample with replacement according to the weights Train a classifier Mi Calculate error(Mi) Update the weights of the records Increase the weights of the misclassified records Decrease the weights of the correctly classified recordsBoosting – Classification: Boosting – Classification For each class, sum up the weights of the classifiers that vote for that class The class that gets the highest sum is the predicted class Boosting Implementation: Boosting Implementation How the record weights are updated How the classifier weights are calculatedAdaboost: Adaboost Update the weights of the correctly classified records: Error rate of classifier Mi: Weight of classifier Mi:Evaluate the Accuracy of a Classifier: Evaluate the Accuracy of a Classifier Accuracy measures Accuracy rate and error rate Confusion matrix Precision and Recall (for binary classification)Example of Accuracy Measures: Example of Accuracy Measures Example Two classes C1 and C2 100 testing records with 50 C1 records and 50 C2 records 20 C1 records misclassified as C2, and 10 C2 records misclassified as C1 Accuracy measures Accuracy and error rates?? Confusion matrix?? Precision and Recall??Evaluate the Accuracy of a Classifier: Evaluate the Accuracy of a Classifier The Holdout Method Given a set of records with known class labels, use half of them for training and the other half for testing (or 2/3 for training and 1/3 for testing)Problems of the Holdout Method: Problems of the Holdout Method More records for training means less for testing, and vice versa Distribution of the data in the training/testing set may be different from the original dataset Some classifiers are sensitive to random fluctuations in the training dataRandom Subsampling: Random Subsampling Repeat the holdout method k times Take the average accuracy over the k iterations Random subsampling methods Cross-validation BootstrapK-fold Cross-validation: K-fold Cross-validation Divide the original dataset into k non-overlapping subsets Each iteration uses (k-1) subsets for training, and the remaining subset for testing Total errors are the sum of the errors in each iterationBootstrap (.632 Bootstrap): Bootstrap (.632 Bootstrap) Each iteration uses a bootstrap sample to train the classifier, and the remaining records for testing Calculate the overall accuracy:Predicating Continuous Values: Predicating Continuous Values Regression methods Linear regression Non-linear regression Other methods Some classification methods can be adapted to predict continuous valuesLinear Regression: Linear Regression Record (x,y) x: predictor variable y: response variable Model y = w0 + w1x Linear Regression Using Least-Squares Method: Linear Regression Using Least-Squares MethodMultiple Linear Regression: Multiple Linear Regression Record (x1,...,xn,y) Model:Summary: Summary Classification Problem definition and terminology Decision Tree Induction Information gain and gain ratio Model overfitting Tree pruning Naive Bayesian classification and BBN Ensemble methods Evaluation of classification accuracy Linear regression You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
class Natalya Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 835 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: November 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript CS522 Advanced Database SystemsClassification: CS522 Advanced Database Systems Classification Chengyu Sun California State University, Los AngelesA Classification Problem: A Classification Problem Is a loan to a person who is 45 years old, divorced, renting an apartment, with two kids and annual income of 100K high risk or low risk?Terminology and Concepts ...: Terminology and Concepts ... Record (or tuple) Attributes E.g. age, marital status, # of kids, owns home or not, credit score ... Class label E.g. high risk, low risk ... Classification: predict the class label with given attribute values... Terminology and Concepts: ... Terminology and Concepts Classifier (or model) Training set: records with known class labels that are used to construct (i.e. train) the classifier Classifier Attribute values Class label Classifier Training set Step 1: Step 2:Classification vs. Regression: Classification vs. Regression Classification predicts categorical attribute values Regression predicts continuous numerical attribute values SID HW1 HW2 HW3 Final Pass/Fail 1 40 60 70 95 Passed 2 10 15 11 65 Failed 3 30 45 40 75 Passed 4 35 50 35 ? ?A Training Set: A Training Set TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K YesA Decision Tree: A Decision Tree Home owner Marital Status Annual income Defaulted=No Defaulted=No Defaulted=No Defaulted=Yes Yes No Married Single, Divorced <80K >=80KDecision Tree Induction: Decision Tree Induction Let D be the set of training record associated with current node If all record in D belong to the same class C, current node is a leaf node and is labeled as C. If D contains records that belong to more than one class, select an attribute to split D into subsets, and create a child node for each subset. Apply the algorithm recursively on each child node.Decision Tree Induction Example: Decision Tree Induction Example Training set Decision treeSplit on Attributes with Discrete Values: Split on Attributes with Discrete Values Marital Status Married Single, Divorced Marital Status Married Single DivorcedSplit on Attributes with Continuous Values: Split on Attributes with Continuous Values Annual income <80K >=80K Annual income 0-40K 80K-100K 40K-80K >100kTerminating Conditions: Terminating Conditions All records in D belong to the same class No more attribute to split Class label?? No records associated with the node Class label?? Splitting Attribute Selection: Splitting Attribute Selection After a split, ideally each subset would “pure”, i.e. contains only one class of records Gender Age Preferred color female 20 pink male 20 black female 15 pink male 15 blackAttribute Selection Measures: Attribute Selection Measures Entropy (Information Gain) Gini index Gain RatioEntropy: Entropy pi is the fraction of records in D that belongs to class Ci m is the number of classes in DEntropy Example: Entropy Example Preferred color 2 black and 2 pink?? 3 black and 1 pink?? 4 black??Information Gain: Information Gain Suppose D is split into v subsets on attribute AInformation Gain Example: Information Gain Example Preferred color Gain(Gender)?? Gain(Age)??Split Information: Split Information Information gain favors attributes with lots of distinct values Split information can be used to “normalized” information gainGain Ratio: Gain RatioGini Index: Gini Index Used in the CART algorithm for binary splitGini Index Example: Gini Index Example Preferred color 2 black and 2 pink?? 3 black and 1 pink?? 4 black?? Split on gender?? Split on age??Training Error and Testing Error: Training Error and Testing Error Training error Misclassification of training records Testing (Generalization) error Misclassification of testing recordsModel Overfitting and Underfitting: Model Overfitting and Underfitting # of nodes in the decision tree Error Rate Testing error Training error overfitting underfittingOverfitting Due to Outliers ...: Overfitting Due to Outliers ... 1 2 3 4 5 6 7 8 1 2 3 4... Overfitting Due to Outliers: ... Overfitting Due to Outliers 1 2 3 4 5 6 7 8 1 2 3 4 Other Reasons That Cause Overfitting : Other Reasons That Cause Overfitting Noise (mislabeled training records) Lack of representative samples Tree Pruning – Prepruning: Tree Pruning – Prepruning Prune during decision tree construction Number of records < threshold “Purity gain” < thresholdTree Pruning – Postpruning: Tree Pruning – Postpruning Buttom-up pruning of a fully constructed tree Replace a subtree with a leaf node if it reduces testing error How do we know whether it reduces testing error or not?? Pruning based on Minimum Description Length (MDL)Estimate Testing Errors: Estimate Testing Errors Use a pruning set in addition to the training set Optimistic error estimation The training set is a good representation of the overall data (optimistic!), so the training error is the testing error Pessimistic error estimation Training error + penalty term for model complexityPessimistic Error Estimation: Pessimistic Error Estimation Nt – Total number of records e(ti) – Training error at leaf node ti (ti) – Penalty term associated with tiExample of Pessimistic Error Estimation: Example of Pessimistic Error Estimation eg(T) with (ti)=0.5?? (ti)=1?? C1:3 C2:1 C1:2 C2:1 C1:0 C2:2 C1:1 C2:2 C1:3 C2:0 C1:3 C2:1 C1:0 C2:5 C1:5 C2:2 C1:1 C2:4 C1:3 C2:0 C1:3 C2:6Pruning with Estimated Testing Error: Pruning with Estimated Testing Error With optimistic error estimation?? With pessimistic error estimation?? With a pruning set?? C1:11 C2:3 C1:2 C2:4 C1:14 C2:3 C1:2 C2:2Minimum Description Length (MDL): Minimum Description Length (MDL) The best model is the one that minimizes the number of bits to encode both the model and the exceptions to the model A B Class labels Model + exceptions orMDL Example ...: MDL Example ... n records m binary attributes k classes Cost(Internal Node) = log2m Cost(Leaf node) = log2k Cost(Error) = log2n Cost = Cost(All Nodes)+Cost(All Errors)... MDL Example: ... MDL Example 16 binary attributes and 3 classes Left tree has 7 errors and right tree has 4 errors C1 C2 C3 C1 C1 C2 C2 C3About Decision Tree Classification ...: About Decision Tree Classification ... Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets... About Decision Tree Classification: ... About Decision Tree Classification Data fragmentation Finding an optimal decision tree is NP-hard Limitation on expressiveness Tree replication 1 0 1 P Q R S 0 0 1 Q S 0 0 1 1Bayes’ Theorem: Bayes’ Theorem Prior and posterior probabilities P(A) and P(A|B) P(B) and P(B|A)Bayesian Classification: Bayesian Classification X is a given record with attribute values (x1,x2,...,xn), and Ci is a class P(Ci|X) is the probability of X belonging to class Ci given X’s attribute values We predict that X belong to Ci if P(Ci|X)>P(Cj|X) for jiCalculate P(Ci|X): Calculate P(Ci|X) P(X) does not need to be calculated Why?? P(Ci)?? P(X|Ci)??Naive Bayesian Classification: Naive Bayesian Classification X=(x1,x2,...,xn) Assume the attribute values are conditionally independent of one another (the naive assumption)Attribute Ak is Categorical: Attribute Ak is Categorical P(xk|Ci) is the fraction of number of records in Ci with value xk for attribute AkAttribute Ak is Continuous-valued ...: Attribute Ak is Continuous-valued ... Assume Ak follows a Gaussian distribution with a mean and standard deviation ... Attribute Ak is Continuous-valued : ... Attribute Ak is Continuous-valued Sample Data: Sample Data TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 120K ?? Naive Bayesian Classification Example ...: Naive Bayesian Classification Example ... P(No|HO=No,MS=M,AI=120K) vs. P(Yes|HO=No,MS=M,AI=120K)... Naive Bayesian Classification Example: ... Naive Bayesian Classification Example Annual Income, Default=No =110, =54.54 P(AI=120K|No)=0.0072 Annual Income, Default=Yes =90, =5 P(AI=120K|Yes)=1.210-9Avoid Zero P(xk|Ci): Avoid Zero P(xk|Ci) A zero P(xk|Ci) would make the whole P(X|Ci) zero To avoid this problem, add 1 to to each count, assuming the training set is sufficiently large that the effect of adding one is negligible Example Low income:0 Medium income: 990 High income: 10About Naive Bayesian Classification: About Naive Bayesian Classification The most accurate classification if the conditional independence assumption holds In practice, some attributes may be correlated E.g. education level and annual incomeBayesian Belief Network (BBN): Bayesian Belief Network (BBN) A directed acyclic graph (dag) encoding the dependencies among a set of variables A conditional probability table (CPT) for each node given its immediate parent nodesA BBN Example: A BBN Example FamilyHistory Smoker LungCancer Emphysema PositiveXRay Dyspnea CPT for LungCancerBBN Terminology: BBN Terminology If there is a directed arc from X to Y X is a parent of Y Y is a child of X If there is a directed path from X to Y X is an ancestor of Y Y is a descendent of XConditional Independence in BBN: Conditional Independence in BBN A node in a Bayesian network is conditionally independent of its non-descendants if its parents are knownNaive Bayesian is a Special Case of BBN: Naive Bayesian is a Special Case of BBN C A1 A2 A3 AnConstruct a BBN: Construct a BBN Create the structure of the network From domain knowledge From training data Calculate the CPT for each node All variables are observable Some variable may be hiddenAnother BBN Example: Another BBN ExampleBayesian Classification Examples: Bayesian Classification Examples Output node – Heart Disease Testing data () (BP=high) (BP=high,D=Healthy,E=Yes)Bayesian Classification Examples – 1: Bayesian Classification Examples – 1Bayesian Classification Examples – 2: Bayesian Classification Examples – 2Bayesian Classification Examples – 3: Bayesian Classification Examples – 3Other Classification Methods: Other Classification Methods Rule-based Artificial Neural Network (ANN) Support Vector Machine (SVM) Association rule analysis Nearest neighbor Genetic algorithms Rough Set and Fuzzy Set thoery ...Ensemble Methods: Ensemble Methods Use a number of base classifiers, and make a predication by combining the predications of all the classifiers Example Binary classification 25 classifiers, each with error rate 35% Predict by majority vote Error rate of the ensemble classifier??Construct an Ensemble Classifier: Construct an Ensemble Classifier Train k classifiers with one dataset Use the same dataset for each classifier?? Divide the dataset into k subsets?? Bagging and Boosting Bagging (Bootstrap Aggregation): Bagging (Bootstrap Aggregation) A bootstrap sampling of |D| Uniform sampling with replacement (vs. without replacement) Allow duplicates in the sample |D| samples Roughly contains 63.2% of the original records. Why?? Bagging Use a bootstrap sample as the training set for each classifierBagging Example: Bagging Example Record (x,y) x: attribute y: class label Base classifier: decision tree with one level xk Ensemble classifier: 10 classifiers, majority voteBagging Example – Dataset: Bagging Example – Dataset X Y 0.1 1 0.2 1 0.3 1 0.4 -1 0.5 -1 0.6 -1 0.7 -1 0.8 1 0.9 1 1.0 1Bagging Example – Bagging: Bagging Example – BaggingBagging Example – Classification: Bagging Example – ClassificationAbout Bagging: About Bagging Reduces the errors associated with random fluctuations in the training data for unstable classifiers, e.g. decision trees, rule-based classifiers, ANN May degrade the performance of stable classifiers, e.g. Bayesian network, SVM, k-NNIntuition for Boosting: Intuition for Boosting Sample with weights hard-to-classify records should be chosen more often Combine the prediction of the base classifiers with weights Classifiers with lower error rates get more voting power Boosting – Training: Boosting – Training For k classifiers, do k rounds of Assign a weight to each record Sample with replacement according to the weights Train a classifier Mi Calculate error(Mi) Update the weights of the records Increase the weights of the misclassified records Decrease the weights of the correctly classified recordsBoosting – Classification: Boosting – Classification For each class, sum up the weights of the classifiers that vote for that class The class that gets the highest sum is the predicted class Boosting Implementation: Boosting Implementation How the record weights are updated How the classifier weights are calculatedAdaboost: Adaboost Update the weights of the correctly classified records: Error rate of classifier Mi: Weight of classifier Mi:Evaluate the Accuracy of a Classifier: Evaluate the Accuracy of a Classifier Accuracy measures Accuracy rate and error rate Confusion matrix Precision and Recall (for binary classification)Example of Accuracy Measures: Example of Accuracy Measures Example Two classes C1 and C2 100 testing records with 50 C1 records and 50 C2 records 20 C1 records misclassified as C2, and 10 C2 records misclassified as C1 Accuracy measures Accuracy and error rates?? Confusion matrix?? Precision and Recall??Evaluate the Accuracy of a Classifier: Evaluate the Accuracy of a Classifier The Holdout Method Given a set of records with known class labels, use half of them for training and the other half for testing (or 2/3 for training and 1/3 for testing)Problems of the Holdout Method: Problems of the Holdout Method More records for training means less for testing, and vice versa Distribution of the data in the training/testing set may be different from the original dataset Some classifiers are sensitive to random fluctuations in the training dataRandom Subsampling: Random Subsampling Repeat the holdout method k times Take the average accuracy over the k iterations Random subsampling methods Cross-validation BootstrapK-fold Cross-validation: K-fold Cross-validation Divide the original dataset into k non-overlapping subsets Each iteration uses (k-1) subsets for training, and the remaining subset for testing Total errors are the sum of the errors in each iterationBootstrap (.632 Bootstrap): Bootstrap (.632 Bootstrap) Each iteration uses a bootstrap sample to train the classifier, and the remaining records for testing Calculate the overall accuracy:Predicating Continuous Values: Predicating Continuous Values Regression methods Linear regression Non-linear regression Other methods Some classification methods can be adapted to predict continuous valuesLinear Regression: Linear Regression Record (x,y) x: predictor variable y: response variable Model y = w0 + w1x Linear Regression Using Least-Squares Method: Linear Regression Using Least-Squares MethodMultiple Linear Regression: Multiple Linear Regression Record (x1,...,xn,y) Model:Summary: Summary Classification Problem definition and terminology Decision Tree Induction Information gain and gain ratio Model overfitting Tree pruning Naive Bayesian classification and BBN Ensemble methods Evaluation of classification accuracy Linear regression