logging in or signing up larsen jsm2003 Arley33 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 27 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Comparison of Alternative Latent Class Clusterings of Survey Data: Comparison of Alternative Latent Class Clusterings of Survey Data Michael D. Larsen University of Chicago/ Iowa State UniversityOutline: Outline Survey and variables Latent class models Comparing clusterings Some comparisons Conclusions and future plansSurvey: Survey 1997 Survey of Doctoral Recipients NSF survey every 2 years 1 of 3 surveys in SESTAT database Respondents PhDs 1990-1996 Physical (n=2216) and biological (n=1019) sciences, engineering (n=516) Work in higher educational institutionsVariables: Variables Demographics: Sex, Race, Ethnicity, Age, etc. %F: biology (49%), physical (33%), eng. (23%) Several sets on career preparation Limitations on career path job searches Work activities Job search resources (which used?) Adequacy of PhD program career preparation Assorted other questions (e.g., postdoc?)One set of variables example: One set of variables example Adequacy of career preparation Very adequate vs. Somewhat or not adeq. 11 areas (211 table) Biology, 3 significant differences, F vs. M Communication (F>M) z= 2.73 Ethics (F>M) z= 2.48 Computer (M>F) z= -2.58Why cluster?: Why cluster? Interest in clusters themselves Are there identifiable groups? Are clusters stable over time? Are the clusters related to demographic subpopulations? How do outcomes vary across clusters? Latent Class Models: Latent Class Models G latent classes (subpopulations) K categorical variables define contingency table, each person in one cell of table Observed pattern of responses in table is mixture of patterns from latent classes. Response probability on each variable (conditionally) independent within each class (prob’s differ across classes).Latent Class Models, cont.: Latent Class Models, cont. P(response pattern) = sum over classes of [ P(class) P(response pattern | class) ] EM algorithm (Dempster, Laird, Rubin 1977) Compute P(class | response pattern).Comparing clusterings: Comparing clusterings Different sets of variables will group respondents differently. Cross tabulations Adjusted Rand Index (ARI) Rand Index = # of pairs in same cluster ARI = (Rand – Exp.)/(Max –Exp.) -- assumes hyper geometric distributionCalibrating the ARI (or other): Calibrating the ARI (or other) Simulation Generate 1000 samples from the hyper geometric distribution, which corresponds to null of no association Compute ARI for 1000 samples Report # of samples >= ARIobserved A comparison: A comparison Biology, Adequacy of Career Preparation Communication, ARI = 0.002, tail = 0.015 Ethics, ARI = 0.039, tail = 0.039 Computer, ARI = 0.002, tail = 0.021 4 latent classes (interesting patterns) ARI value is lower, tail area is largerComments: Comments ARI values are not large (not near 1) for tables with large n Simulated values are similar to P-values from standard tests Small ARI values can be significant in the way that small log odds (near 0) can be significant for large n Latent classes fit better than simple classifications, but ARI doesn’t increase. More on comment 4.: More on comment 4. Two classes (females, males) and CI. vs. Four latent classes (based on BCI) and CI. Latter fits (much) better. ARI not larger than largest on individual variables.Future plans: Future plans 1. Repeat on next waves (1999, 2001) 2. Additional comparison methods: Diversity measures Slight modification of ARI Machine Learning, Stats, Discovery, 2003, Marina Meila, U of Washington 3. Missing data (DK, RF, Missing)References: References Larsen, Statistics in Transition, 2003 Larsen, submitted to “Retaining Women in Early Academic SMET Careers,” 2002, under revision Hubert and Arabie, 1985, J. of Classification NSF, EIA-0089930, ITWF Contact Information: Contact Information Mike Larsen, U of Chicago, Statistics larsen@galton.uchicago.edu http://galton.uchicago.edu/~larsen/jsm03 Email for contact at Iowa State University, Statistics You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
larsen jsm2003 Arley33 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 27 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Comparison of Alternative Latent Class Clusterings of Survey Data: Comparison of Alternative Latent Class Clusterings of Survey Data Michael D. Larsen University of Chicago/ Iowa State UniversityOutline: Outline Survey and variables Latent class models Comparing clusterings Some comparisons Conclusions and future plansSurvey: Survey 1997 Survey of Doctoral Recipients NSF survey every 2 years 1 of 3 surveys in SESTAT database Respondents PhDs 1990-1996 Physical (n=2216) and biological (n=1019) sciences, engineering (n=516) Work in higher educational institutionsVariables: Variables Demographics: Sex, Race, Ethnicity, Age, etc. %F: biology (49%), physical (33%), eng. (23%) Several sets on career preparation Limitations on career path job searches Work activities Job search resources (which used?) Adequacy of PhD program career preparation Assorted other questions (e.g., postdoc?)One set of variables example: One set of variables example Adequacy of career preparation Very adequate vs. Somewhat or not adeq. 11 areas (211 table) Biology, 3 significant differences, F vs. M Communication (F>M) z= 2.73 Ethics (F>M) z= 2.48 Computer (M>F) z= -2.58Why cluster?: Why cluster? Interest in clusters themselves Are there identifiable groups? Are clusters stable over time? Are the clusters related to demographic subpopulations? How do outcomes vary across clusters? Latent Class Models: Latent Class Models G latent classes (subpopulations) K categorical variables define contingency table, each person in one cell of table Observed pattern of responses in table is mixture of patterns from latent classes. Response probability on each variable (conditionally) independent within each class (prob’s differ across classes).Latent Class Models, cont.: Latent Class Models, cont. P(response pattern) = sum over classes of [ P(class) P(response pattern | class) ] EM algorithm (Dempster, Laird, Rubin 1977) Compute P(class | response pattern).Comparing clusterings: Comparing clusterings Different sets of variables will group respondents differently. Cross tabulations Adjusted Rand Index (ARI) Rand Index = # of pairs in same cluster ARI = (Rand – Exp.)/(Max –Exp.) -- assumes hyper geometric distributionCalibrating the ARI (or other): Calibrating the ARI (or other) Simulation Generate 1000 samples from the hyper geometric distribution, which corresponds to null of no association Compute ARI for 1000 samples Report # of samples >= ARIobserved A comparison: A comparison Biology, Adequacy of Career Preparation Communication, ARI = 0.002, tail = 0.015 Ethics, ARI = 0.039, tail = 0.039 Computer, ARI = 0.002, tail = 0.021 4 latent classes (interesting patterns) ARI value is lower, tail area is largerComments: Comments ARI values are not large (not near 1) for tables with large n Simulated values are similar to P-values from standard tests Small ARI values can be significant in the way that small log odds (near 0) can be significant for large n Latent classes fit better than simple classifications, but ARI doesn’t increase. More on comment 4.: More on comment 4. Two classes (females, males) and CI. vs. Four latent classes (based on BCI) and CI. Latter fits (much) better. ARI not larger than largest on individual variables.Future plans: Future plans 1. Repeat on next waves (1999, 2001) 2. Additional comparison methods: Diversity measures Slight modification of ARI Machine Learning, Stats, Discovery, 2003, Marina Meila, U of Washington 3. Missing data (DK, RF, Missing)References: References Larsen, Statistics in Transition, 2003 Larsen, submitted to “Retaining Women in Early Academic SMET Careers,” 2002, under revision Hubert and Arabie, 1985, J. of Classification NSF, EIA-0089930, ITWF Contact Information: Contact Information Mike Larsen, U of Chicago, Statistics larsen@galton.uchicago.edu http://galton.uchicago.edu/~larsen/jsm03 Email for contact at Iowa State University, Statistics