class

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

CS522 Advanced Database Systems Classification: 

CS522 Advanced Database Systems Classification Chengyu Sun California State University, Los Angeles

A Classification Problem: 

A Classification Problem Is a loan to a person who is 45 years old, divorced, renting an apartment, with two kids and annual income of 100K high risk or low risk?

Terminology and Concepts ...: 

Terminology and Concepts ... Record (or tuple) Attributes E.g. age, marital status, # of kids, owns home or not, credit score ... Class label E.g. high risk, low risk ... Classification: predict the class label with given attribute values

... Terminology and Concepts: 

... Terminology and Concepts Classifier (or model) Training set: records with known class labels that are used to construct (i.e. train) the classifier Classifier Attribute values Class label Classifier Training set Step 1: Step 2:

Classification vs. Regression: 

Classification vs. Regression Classification predicts categorical attribute values Regression predicts continuous numerical attribute values SID HW1 HW2 HW3 Final Pass/Fail 1 40 60 70 95 Passed 2 10 15 11 65 Failed 3 30 45 40 75 Passed 4 35 50 35 ? ?

A Training Set: 

A Training Set TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

A Decision Tree: 

A Decision Tree Home owner Marital Status Annual income Defaulted=No Defaulted=No Defaulted=No Defaulted=Yes Yes No Married Single, Divorced <80K >=80K

Decision Tree Induction: 

Decision Tree Induction Let D be the set of training record associated with current node If all record in D belong to the same class C, current node is a leaf node and is labeled as C. If D contains records that belong to more than one class, select an attribute to split D into subsets, and create a child node for each subset. Apply the algorithm recursively on each child node.

Decision Tree Induction Example: 

Decision Tree Induction Example Training set  Decision tree

Split on Attributes with Discrete Values: 

Split on Attributes with Discrete Values Marital Status Married Single, Divorced Marital Status Married Single Divorced

Split on Attributes with Continuous Values: 

Split on Attributes with Continuous Values Annual income <80K >=80K Annual income 0-40K 80K-100K 40K-80K >100k

Terminating Conditions: 

Terminating Conditions All records in D belong to the same class No more attribute to split Class label?? No records associated with the node Class label??

Splitting Attribute Selection: 

Splitting Attribute Selection After a split, ideally each subset would “pure”, i.e. contains only one class of records Gender Age Preferred color female 20 pink male 20 black female 15 pink male 15 black

Attribute Selection Measures: 

Attribute Selection Measures Entropy (Information Gain) Gini index Gain Ratio

Entropy: 

Entropy pi is the fraction of records in D that belongs to class Ci m is the number of classes in D

Entropy Example: 

Entropy Example Preferred color 2 black and 2 pink?? 3 black and 1 pink?? 4 black??

Information Gain: 

Information Gain Suppose D is split into v subsets on attribute A

Information Gain Example: 

Information Gain Example Preferred color Gain(Gender)?? Gain(Age)??

Split Information: 

Split Information Information gain favors attributes with lots of distinct values Split information can be used to “normalized” information gain

Gain Ratio: 

Gain Ratio

Gini Index: 

Gini Index Used in the CART algorithm for binary split

Gini Index Example: 

Gini Index Example Preferred color 2 black and 2 pink?? 3 black and 1 pink?? 4 black?? Split on gender?? Split on age??

Training Error and Testing Error: 

Training Error and Testing Error Training error Misclassification of training records Testing (Generalization) error Misclassification of testing records

Model Overfitting and Underfitting: 

Model Overfitting and Underfitting # of nodes in the decision tree Error Rate Testing error Training error overfitting underfitting

Overfitting Due to Outliers ...: 

Overfitting Due to Outliers ... 1 2 3 4 5 6 7 8 1 2 3 4

... Overfitting Due to Outliers: 

... Overfitting Due to Outliers 1 2 3 4 5 6 7 8 1 2 3 4

Other Reasons That Cause Overfitting : 

Other Reasons That Cause Overfitting Noise (mislabeled training records) Lack of representative samples

Tree Pruning – Prepruning: 

Tree Pruning – Prepruning Prune during decision tree construction Number of records < threshold “Purity gain” < threshold

Tree Pruning – Postpruning: 

Tree Pruning – Postpruning Buttom-up pruning of a fully constructed tree Replace a subtree with a leaf node if it reduces testing error How do we know whether it reduces testing error or not?? Pruning based on Minimum Description Length (MDL)

Estimate Testing Errors: 

Estimate Testing Errors Use a pruning set in addition to the training set Optimistic error estimation The training set is a good representation of the overall data (optimistic!), so the training error is the testing error Pessimistic error estimation Training error + penalty term for model complexity

Pessimistic Error Estimation: 

Pessimistic Error Estimation Nt – Total number of records e(ti) – Training error at leaf node ti (ti) – Penalty term associated with ti

Example of Pessimistic Error Estimation: 

Example of Pessimistic Error Estimation eg(T) with (ti)=0.5?? (ti)=1?? C1:3 C2:1 C1:2 C2:1 C1:0 C2:2 C1:1 C2:2 C1:3 C2:0 C1:3 C2:1 C1:0 C2:5 C1:5 C2:2 C1:1 C2:4 C1:3 C2:0 C1:3 C2:6

Pruning with Estimated Testing Error: 

Pruning with Estimated Testing Error With optimistic error estimation?? With pessimistic error estimation?? With a pruning set?? C1:11 C2:3 C1:2 C2:4 C1:14 C2:3 C1:2 C2:2

Minimum Description Length (MDL): 

Minimum Description Length (MDL) The best model is the one that minimizes the number of bits to encode both the model and the exceptions to the model A B Class labels Model + exceptions or

MDL Example ...: 

MDL Example ... n records m binary attributes k classes Cost(Internal Node) = log2m Cost(Leaf node) = log2k Cost(Error) = log2n Cost = Cost(All Nodes)+Cost(All Errors)

... MDL Example: 

... MDL Example 16 binary attributes and 3 classes Left tree has 7 errors and right tree has 4 errors C1 C2 C3 C1 C1 C2 C2 C3

About Decision Tree Classification ...: 

About Decision Tree Classification ... Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

... About Decision Tree Classification: 

... About Decision Tree Classification Data fragmentation Finding an optimal decision tree is NP-hard Limitation on expressiveness Tree replication 1 0 1 P Q R S 0 0 1 Q S 0 0 1 1

Bayes’ Theorem: 

Bayes’ Theorem Prior and posterior probabilities P(A) and P(A|B) P(B) and P(B|A)

Bayesian Classification: 

Bayesian Classification X is a given record with attribute values (x1,x2,...,xn), and Ci is a class P(Ci|X) is the probability of X belonging to class Ci given X’s attribute values We predict that X belong to Ci if P(Ci|X)>P(Cj|X) for ji

Calculate P(Ci|X): 

Calculate P(Ci|X) P(X) does not need to be calculated Why?? P(Ci)?? P(X|Ci)??

Naive Bayesian Classification: 

Naive Bayesian Classification X=(x1,x2,...,xn) Assume the attribute values are conditionally independent of one another (the naive assumption)

Attribute Ak is Categorical: 

Attribute Ak is Categorical P(xk|Ci) is the fraction of number of records in Ci with value xk for attribute Ak

Attribute Ak is Continuous-valued ...: 

Attribute Ak is Continuous-valued ... Assume Ak follows a Gaussian distribution with a mean  and standard deviation 

... Attribute Ak is Continuous-valued : 

... Attribute Ak is Continuous-valued

Sample Data: 

Sample Data TID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 120K ??

Naive Bayesian Classification Example ...: 

Naive Bayesian Classification Example ... P(No|HO=No,MS=M,AI=120K) vs. P(Yes|HO=No,MS=M,AI=120K)

... Naive Bayesian Classification Example: 

... Naive Bayesian Classification Example Annual Income, Default=No =110, =54.54 P(AI=120K|No)=0.0072 Annual Income, Default=Yes =90, =5 P(AI=120K|Yes)=1.210-9

Avoid Zero P(xk|Ci): 

Avoid Zero P(xk|Ci) A zero P(xk|Ci) would make the whole P(X|Ci) zero To avoid this problem, add 1 to to each count, assuming the training set is sufficiently large that the effect of adding one is negligible Example Low income:0 Medium income: 990 High income: 10

About Naive Bayesian Classification: 

About Naive Bayesian Classification The most accurate classification if the conditional independence assumption holds In practice, some attributes may be correlated E.g. education level and annual income

Bayesian Belief Network (BBN): 

Bayesian Belief Network (BBN) A directed acyclic graph (dag) encoding the dependencies among a set of variables A conditional probability table (CPT) for each node given its immediate parent nodes

A BBN Example: 

A BBN Example FamilyHistory Smoker LungCancer Emphysema PositiveXRay Dyspnea CPT for LungCancer

BBN Terminology: 

BBN Terminology If there is a directed arc from X to Y X is a parent of Y Y is a child of X If there is a directed path from X to Y X is an ancestor of Y Y is a descendent of X

Conditional Independence in BBN: 

Conditional Independence in BBN A node in a Bayesian network is conditionally independent of its non-descendants if its parents are known

Naive Bayesian is a Special Case of BBN: 

Naive Bayesian is a Special Case of BBN C A1 A2 A3 An

Construct a BBN: 

Construct a BBN Create the structure of the network From domain knowledge From training data Calculate the CPT for each node All variables are observable Some variable may be hidden

Another BBN Example: 

Another BBN Example

Bayesian Classification Examples: 

Bayesian Classification Examples Output node – Heart Disease Testing data () (BP=high) (BP=high,D=Healthy,E=Yes)

Bayesian Classification Examples – 1: 

Bayesian Classification Examples – 1

Bayesian Classification Examples – 2: 

Bayesian Classification Examples – 2

Bayesian Classification Examples – 3: 

Bayesian Classification Examples – 3

Other Classification Methods: 

Other Classification Methods Rule-based Artificial Neural Network (ANN) Support Vector Machine (SVM) Association rule analysis Nearest neighbor Genetic algorithms Rough Set and Fuzzy Set thoery ...

Ensemble Methods: 

Ensemble Methods Use a number of base classifiers, and make a predication by combining the predications of all the classifiers Example Binary classification 25 classifiers, each with error rate 35% Predict by majority vote Error rate of the ensemble classifier??

Construct an Ensemble Classifier: 

Construct an Ensemble Classifier Train k classifiers with one dataset Use the same dataset for each classifier?? Divide the dataset into k subsets?? Bagging and Boosting

Bagging (Bootstrap Aggregation): 

Bagging (Bootstrap Aggregation) A bootstrap sampling of |D| Uniform sampling with replacement (vs. without replacement) Allow duplicates in the sample |D| samples Roughly contains 63.2% of the original records. Why?? Bagging Use a bootstrap sample as the training set for each classifier

Bagging Example: 

Bagging Example Record (x,y) x: attribute y: class label Base classifier: decision tree with one level xk Ensemble classifier: 10 classifiers, majority vote

Bagging Example – Dataset: 

Bagging Example – Dataset X Y 0.1 1 0.2 1 0.3 1 0.4 -1 0.5 -1 0.6 -1 0.7 -1 0.8 1 0.9 1 1.0 1

Bagging Example – Bagging: 

Bagging Example – Bagging

Bagging Example – Classification: 

Bagging Example – Classification

About Bagging: 

About Bagging Reduces the errors associated with random fluctuations in the training data for unstable classifiers, e.g. decision trees, rule-based classifiers, ANN May degrade the performance of stable classifiers, e.g. Bayesian network, SVM, k-NN

Intuition for Boosting: 

Intuition for Boosting Sample with weights hard-to-classify records should be chosen more often Combine the prediction of the base classifiers with weights Classifiers with lower error rates get more voting power

Boosting – Training: 

Boosting – Training For k classifiers, do k rounds of Assign a weight to each record Sample with replacement according to the weights Train a classifier Mi Calculate error(Mi) Update the weights of the records Increase the weights of the misclassified records Decrease the weights of the correctly classified records

Boosting – Classification: 

Boosting – Classification For each class, sum up the weights of the classifiers that vote for that class The class that gets the highest sum is the predicted class

Boosting Implementation: 

Boosting Implementation How the record weights are updated How the classifier weights are calculated

Adaboost: 

Adaboost Update the weights of the correctly classified records: Error rate of classifier Mi: Weight of classifier Mi:

Evaluate the Accuracy of a Classifier: 

Evaluate the Accuracy of a Classifier Accuracy measures Accuracy rate and error rate Confusion matrix Precision and Recall (for binary classification)

Example of Accuracy Measures: 

Example of Accuracy Measures Example Two classes C1 and C2 100 testing records with 50 C1 records and 50 C2 records 20 C1 records misclassified as C2, and 10 C2 records misclassified as C1 Accuracy measures Accuracy and error rates?? Confusion matrix?? Precision and Recall??

Evaluate the Accuracy of a Classifier: 

Evaluate the Accuracy of a Classifier The Holdout Method Given a set of records with known class labels, use half of them for training and the other half for testing (or 2/3 for training and 1/3 for testing)

Problems of the Holdout Method: 

Problems of the Holdout Method More records for training means less for testing, and vice versa Distribution of the data in the training/testing set may be different from the original dataset Some classifiers are sensitive to random fluctuations in the training data

Random Subsampling: 

Random Subsampling Repeat the holdout method k times Take the average accuracy over the k iterations Random subsampling methods Cross-validation Bootstrap

K-fold Cross-validation: 

K-fold Cross-validation Divide the original dataset into k non-overlapping subsets Each iteration uses (k-1) subsets for training, and the remaining subset for testing Total errors are the sum of the errors in each iteration

Bootstrap (.632 Bootstrap): 

Bootstrap (.632 Bootstrap) Each iteration uses a bootstrap sample to train the classifier, and the remaining records for testing Calculate the overall accuracy:

Predicating Continuous Values: 

Predicating Continuous Values Regression methods Linear regression Non-linear regression Other methods Some classification methods can be adapted to predict continuous values

Linear Regression: 

Linear Regression Record (x,y) x: predictor variable y: response variable Model y = w0 + w1x

Linear Regression Using Least-Squares Method: 

Linear Regression Using Least-Squares Method

Multiple Linear Regression: 

Multiple Linear Regression Record (x1,...,xn,y) Model:

Summary: 

Summary Classification Problem definition and terminology Decision Tree Induction Information gain and gain ratio Model overfitting Tree pruning Naive Bayesian classification and BBN Ensemble methods Evaluation of classification accuracy Linear regression