Presentation Transcript
Mapping Nominal Values to Numbers for Effective Visualization : Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward
Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew Ward
Computer Science Department, Worcester Polytechnic Institute
Supported by NSF grant IIS-0119276.
Presented at InfoVis2003, October 20, 2003.
Visualizing Nominal Variables : Visualizing Nominal Variables What if variable is
nominal? Most tools which are
designed for nominal
variables cannot handle
large # of values. Most data visualization
tools are designed for
numeric variables.
Targeted Result : Targeted Result
Goals : Goals Main goal: To display data sets containing nominal variables in visual exploration tools
Sub-goals: For each nominal variable
To provide order and spacing to the values
To group similar values together
Desired Features of the Solution:
Data-driven
Multivariate
Scalable
Distance-preserving
Association-preserving
Proposed Approach : Proposed Approach Distance – transform the data so that the distance between 2 nominal values can be calculated (based on the variable’s relationship with other variables)
Quantification – assign order and spacing to the nominal values
Classing – determine which values are similar to each other and can be grouped together Pre-process nominal variables using a Distance-Quantification-Classing (DQC) approach Each step can be accomplished using more than one technique. Multiple Correspondence
Analysis
Focused Correspondence
Analysis Modified Optimal
Scaling Hierarchical Cluster
Analysis
Slide6 : DISTANCE STEP QUANTIFICATION STEP CLASSING STEP Transformed data for distance calculation Nominal-to-numeric
mapping Classing tree Target variable andamp;
data set with nominal variables Distance-Quantification-Classing Approach
Example Input to Output : Example Input to Output Data:
Quality (3): good,ok,bad
Color (6) : blue,green,orange,
purple,red,white
Size (10) : a to j Task: Pre-process color based on its patterns across quality and size.
Distance Step: Correspondence Analysis : Distance Step: Correspondence Analysis How strong is the association
between COLOR and QUALITY? Similar profiles: (blue,purple) Can we find similar COLORs based
on its association with QUALITY? Row Percentages
Good Ok Bad
Blue 13 50 37 100
Green 23 46 31 100
Orange 31 47 22 100
Purple 16 46 38 100
Red 30 32 38 100
White 40 32 28 100
Slide9 : Similar column profiles are
combined to produce fewer
independent dimensions.
[Singular Value Decomposition, etc.] Similar row profiles:
(blue,purple), … Similar column profiles:
(ok,bad), … Coordinates for
Independent Dimensions
Dim1 Dim2
Blue - 0.02 - 0.28
Green - 0.54 0.14
Orange 0.55 0.10
Purple 0 - 0.25
Red - 0.50 0.20
White 0.57 0.19 color quality size quality color size color quality size Focused
Corresp
Analysis
(FCA) Multiple
Corresp
Analysis
(MCA)
Quantification Step: Modified Optimal Scaling : Quantification Step: Modified Optimal Scaling Coordinates for
Independent Dimensions
Dim1 Dim2
Blue - 0.02 - 0.28
Green - 0.54 0.14
Orange 0.55 0.10
Purple 0 - 0.25
Red - 0.50 0.20
White 0.57 0.19 Nominal-to-numeric
mapping
Classing Step: Hierarchical Cluster Analysis : Cluster Analysis
weighted by counts blue purple green red orange white [from FCA] Classing Step: Hierarchical Cluster Analysis 0 100 50 Info loss
Experimental Evaluation : Experimental Evaluation Wrong quantification and classing can introduce artificial patterns and cause errors in interpretation
Evaluation measures:
Believability
Quality of Visual Display
Quality of classing
Quality of quantification
Space – FCA less space
Run time – MCA faster perception computational statistical
Test Data Sets : Test Data Sets * UCI Repository of Machine Learning Databases
Believability and Quality of Visual Display : Believability and Quality of Visual Display Given two displays resulting from different nominal-to-numeric mappings:
Which mapping gives a more believable ordering and spacing?
Based on your domain knowledge, are the values that are positioned close together similar to each other?
Are the values that are positioned far from the rest of the values really outliers?
Which display has less clutter?
Believability and Quality of Visual Display : Are these
patterns
believable? Automobile
Data:
Alphabetical
Order, equal
spacing Believability and Quality of Visual Display
Believability and Quality of Visual Display : Are these
patterns
believable? Automobile
Data:
FCA Believability and Quality of Visual Display
Quality of Classing : Quality of Classing Classing A is better than classing B if, given a classing tree, the rate of information loss with each merging is slower
Depends on data set Information loss
due to classing
for one variable
[The lower the line,
the slower the info loss,
the better the classing.] Calculate
difference
between
the lines, then
summarize.
Quality of Quantification : Quality of Quantification A quantification is good if …
If data points that are close together in nominal space are also close together in numeric space
If two variables are highly associated with each other, then their quantified versions should also have high correlation.
MCA gives better quantification for most data
sets based on average squared correlation
measure.
Summary : Summary DQC is a general-purpose approach for pre-processing nominal variables for data analysis techniques requiring numeric variables (linear regression) or low cardinality nominal variables (association rules)
DQC – multivariate, data-driven, scalable, distance-preserving, association-preserving
FCA is a viable alternative to MCA when memory space is limited
Quality of classing and quantification
depends on strength of associations within the data set.
is in the eye of the user
Next Steps : Next Steps Stress test the technique with more experiments
Perform user study that measures the quality of the visual display resulting from MCA vs. FCA
Further investigate tuning parameters and sensitivity to characteristics of the data set
Mixed or numeric variables as analysis variables
Cascaded Focused Correspondence Analysis
Related Work : Related Work Visualizing nominal data:
CA plots [Fri99], sieve diagrams, mosaic displays, fourfold displays, Dimensional Stacking, TreeMaps
Quantification:
optimal scaling, homogeneity analysis [Gre93]
Classing nominal variables:
loss of inertia [Gre93], decision trees, concept hierarchy
Clustering nominal variables:
k-prototypes [Hua97b]
For further information : For further information
XmdvTool Homepage:
http://davis.wpi.edu/~xmdv
xmdv@cs.wpi.edu
Code is free for research and education.
Contact author: ger@wpi.edu
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.