infovis03 talk slides

Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Mapping Nominal Values to Numbers for Effective Visualization : 

Mapping Nominal Values to Numbers for Effective Visualization Presented by Matthew O. Ward Geraldine Rosario, Elke Rundensteiner, David Brown, Matthew Ward Computer Science Department, Worcester Polytechnic Institute Supported by NSF grant IIS-0119276. Presented at InfoVis2003, October 20, 2003.

Visualizing Nominal Variables: 

Visualizing Nominal Variables What if variable is nominal? Most tools which are designed for nominal variables cannot handle large # of values. Most data visualization tools are designed for numeric variables.

Targeted Result: 

Targeted Result

Goals: 

Goals Main goal: To display data sets containing nominal variables in visual exploration tools Sub-goals: For each nominal variable To provide order and spacing to the values To group similar values together Desired Features of the Solution: Data-driven Multivariate Scalable Distance-preserving Association-preserving

Proposed Approach: 

Proposed Approach Distance – transform the data so that the distance between 2 nominal values can be calculated (based on the variable’s relationship with other variables) Quantification – assign order and spacing to the nominal values Classing – determine which values are similar to each other and can be grouped together Pre-process nominal variables using a Distance-Quantification-Classing (DQC) approach Each step can be accomplished using more than one technique. Multiple Correspondence Analysis Focused Correspondence Analysis Modified Optimal Scaling Hierarchical Cluster Analysis

Slide6: 

DISTANCE STEP QUANTIFICATION STEP CLASSING STEP Transformed data for distance calculation Nominal-to-numeric mapping Classing tree Target variable andamp; data set with nominal variables Distance-Quantification-Classing Approach

Example Input to Output: 

Example Input to Output Data: Quality (3): good,ok,bad Color (6) : blue,green,orange, purple,red,white Size (10) : a to j Task: Pre-process color based on its patterns across quality and size.

Distance Step: Correspondence Analysis: 

Distance Step: Correspondence Analysis How strong is the association between COLOR and QUALITY? Similar profiles: (blue,purple) Can we find similar COLORs based on its association with QUALITY? Row Percentages Good Ok Bad Blue 13 50 37 100 Green 23 46 31 100 Orange 31 47 22 100 Purple 16 46 38 100 Red 30 32 38 100 White 40 32 28 100

Slide9: 

Similar column profiles are combined to produce fewer independent dimensions. [Singular Value Decomposition, etc.] Similar row profiles: (blue,purple), … Similar column profiles: (ok,bad), … Coordinates for Independent Dimensions Dim1 Dim2 Blue - 0.02 - 0.28 Green - 0.54 0.14 Orange 0.55 0.10 Purple 0 - 0.25 Red - 0.50 0.20 White 0.57 0.19 color quality size quality color size color quality size Focused Corresp Analysis (FCA) Multiple Corresp Analysis (MCA)

Quantification Step: Modified Optimal Scaling: 

Quantification Step: Modified Optimal Scaling Coordinates for Independent Dimensions Dim1 Dim2 Blue - 0.02 - 0.28 Green - 0.54 0.14 Orange 0.55 0.10 Purple 0 - 0.25 Red - 0.50 0.20 White 0.57 0.19 Nominal-to-numeric mapping

Classing Step: Hierarchical Cluster Analysis: 

Cluster Analysis weighted by counts blue purple green red orange white [from FCA] Classing Step: Hierarchical Cluster Analysis 0 100 50 Info loss

Experimental Evaluation: 

Experimental Evaluation Wrong quantification and classing can introduce artificial patterns and cause errors in interpretation Evaluation measures: Believability Quality of Visual Display Quality of classing Quality of quantification Space – FCA less space Run time – MCA faster perception computational statistical

Test Data Sets: 

Test Data Sets * UCI Repository of Machine Learning Databases

Believability and Quality of Visual Display: 

Believability and Quality of Visual Display Given two displays resulting from different nominal-to-numeric mappings: Which mapping gives a more believable ordering and spacing? Based on your domain knowledge, are the values that are positioned close together similar to each other? Are the values that are positioned far from the rest of the values really outliers? Which display has less clutter?

Believability and Quality of Visual Display: 

Are these patterns believable? Automobile Data: Alphabetical Order, equal spacing Believability and Quality of Visual Display

Believability and Quality of Visual Display: 

Are these patterns believable? Automobile Data: FCA Believability and Quality of Visual Display

Quality of Classing: 

Quality of Classing Classing A is better than classing B if, given a classing tree, the rate of information loss with each merging is slower Depends on data set Information loss due to classing for one variable  [The lower the line, the slower the info loss, the better the classing.] Calculate difference between the lines, then summarize.

Quality of Quantification: 

Quality of Quantification A quantification is good if … If data points that are close together in nominal space are also close together in numeric space If two variables are highly associated with each other, then their quantified versions should also have high correlation. MCA gives better quantification for most data sets based on average squared correlation measure.

Summary: 

Summary DQC is a general-purpose approach for pre-processing nominal variables for data analysis techniques requiring numeric variables (linear regression) or low cardinality nominal variables (association rules) DQC – multivariate, data-driven, scalable, distance-preserving, association-preserving FCA is a viable alternative to MCA when memory space is limited Quality of classing and quantification depends on strength of associations within the data set. is in the eye of the user

Next Steps: 

Next Steps Stress test the technique with more experiments Perform user study that measures the quality of the visual display resulting from MCA vs. FCA Further investigate tuning parameters and sensitivity to characteristics of the data set Mixed or numeric variables as analysis variables Cascaded Focused Correspondence Analysis

Related Work: 

Related Work Visualizing nominal data: CA plots [Fri99], sieve diagrams, mosaic displays, fourfold displays, Dimensional Stacking, TreeMaps Quantification: optimal scaling, homogeneity analysis [Gre93] Classing nominal variables: loss of inertia [Gre93], decision trees, concept hierarchy Clustering nominal variables: k-prototypes [Hua97b]

For further information: 

For further information XmdvTool Homepage: http://davis.wpi.edu/~xmdv [email protected] Code is free for research and education. Contact author: [email protected]

authorStream Live Help