Comparison of red & white wine quality decision tree classifiers: Comparison of red & white wine quality decision tree classifiers Roberto Kao DSCI501 Project Presentation 5/5/2017
Goals: Compare classifier accuracy between red wine, white wine, and both wines. Analyze accuracy rate as classifiers are fine tuned. Explore idea of red wine classifier predicting white wine quality scores, and vice versa. Goals
The Data: Predictors are physicochemical properties of wine. Class variable “quality” is ordinal between 0 and 10. Related to red and white variants of Portuguese “ Vinho Verde” wine. The Data
Methodology: Set random seed to 3 for reproducibility & consistency. Training (70%) and testing (30%). Baseline model Feature selection Maximum tree depth Class weights Aggregate model Methodology
Methodology 2: Red wine data set as training set, white wine data set as testing set. Vice versa. Methodology 2
Class Imbalance Issue: Much more “normal” wine than bad or great wine. Pruning not supported in SKLEARN python package. Solution is to remove classes with lowest amount of observations. Class 3 and 9 have lowest number of members; 30 and 5, respectively. Class Imbalance Issue
CASE 0: Baseline Models: Red accuracy: 64.15% White accuracy: 58.48% Both accuracy: 59.93% CASE 0: Baseline Models
CASE 1: Feature Selection: Red accuracy: 58.07% White accuracy: 58.00% Both accuracy: 58.43% From baseline, select features based on its importance weight vs. the mean of all features’ importance weights. If greater than or equal, keep. Else, discard. CASE 1: Feature Selection
CASE 2: Max Depth: Red accuracy: 54.30% White accuracy: 56.84% Both accuracy: 60.03% Grid search for max depth (values 3 to 20) CASE 2: Max Depth
CASE 3: Class Weights: Red accuracy: 59.54% White accuracy: 56.70% Both accuracy: 58.48% Weights are assigned to members of each respective class to address class imbalance. CASE 3: Class Weights
CASE X: Aggregate of previous cases: Red accuracy: 58.28% White accuracy: 58.07% Both accuracy: 55.80% Combination of previous three adjustments. CASE X: Aggregate of previous cases
CASE Y: Red Predicting White: Red wine classifier predicting white wine quality accuracy: 46.15% CASE Y: Red Predicting White
CASE Z: White Predicting Red: White wine classifier predicting red wine quality accuracy: 23.22% CASE Z: White Predicting Red
Highlight of Results: Tuning decision tree classifiers have marginal effect on test performance. Red wine performs better than white. In fact, red wine predicting white wine is approximately twice as accurate as white wine predicting red wine. Highlight of Results
Future Directions: Consider other hyper-parameter adjustments. Classifier type: support vector machines, random forests, naïve Bayes. Balanced data set. Wines from all over the world. Future Directions
References: P. Cortez, A. Cerdeira , F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009. References