logging in or signing up Yu pres Heng Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 197 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 04, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Text classification for political science research: Text classification for political science research Bei Yu Daniel Diermeier Stefan Kaufmann Northwestern University October 22, 2007The booming of using text data for political science research: The booming of using text data for political science researchText classification for political science research: Text classification for political science research Ideology and party classification Party manifestos (Laver, Gary and Benoit 2003) Senatorial speeches (Diermeier et al. 2006) Newsgroup discussions (Mullen and Malouf 2006) Opinion classification Newsgroup discussions (Agrawal et a. 2003) Public comments (Kwon et al. 2006) Congressional floor debate (Thomas et al. 2006) A data mining process: A data mining process Graph taken from http://alg.ncsa.uiuc.edu/tools/docs/d2k/manual/dataMining.htmlPolitical text classification: Political text classification doc vectors text representation model Training set Classification methods Class labels Classifier All political texts of interest Text samples Sampling method X Y (X, Y) GeneralizationAssumptions in classification methods: Assumptions in classification methods Clear external criteria for class definitions Zero-noise class labels Independently and identically distributed data from a fixed distribution (X, Y) A model’s generalizability is problematic when the assumptions are violatedAssumption violations in real data: Assumption violations in real data Subjective class definitions Unreliable class labels Errors in manually coded labels Mismatch between convenient labels and true labels Drifted distribution Non-i.i.d. data Debate – interactive process Sample bias Small number of examples vs. large number of features The hidden classes in convenient data set Text classification for political scientists: Text classification for political scientists An analytical tool for hypothesis testing An example of political science problem Argument - ideology is a belief system which rules a person’s view on various issues. Evidence – voting records Concern – many factors affect voting decisions More evidence? Can we predict someone’s ideology based on what they said instead of how they voted? Text classification for political scientists: Text classification for political scientists Ideology classification Training data 101st-107th Senatorial speeches Test data 108th Senatorial speeches Result interpretation High accuracy as a confidence level of the evidence that the target concept can be generalized to the whole text collection set of interest. Impact of assumption violation on generalization: Impact of assumption violation on generalization Possible violations Biased Sample Changing distribution Non-i.i.d. data They are not rare in political text classification It is not easy to foresee them How can we find them out? Black-box approach? Some classifiers never used for prediction White-box approach? Feature weighting for linear text classifiers Checking unexpected features Feature analysis for linear text classifier interpretation: Feature analysis for linear text classifier interpretation I am not sure what I’m looking for, but I am sure what I’m NOT looking for.Case study: party classification: Case study: party classification Ideology and party highly similar classes according to available labels Choose party as the labels Experiment round 1 Training data – 101st-107th Senators Testing data – 108th Senators SVM and naïve Bayes algorithms Accuracy > 90% Success? Feature analysis Unexpected features – senator names and state namesProblems in experiment round 1 : Problems in experiment round 1 Possible coincidence Person classifier? The target concept should be person-independent Possible revision in experiment design Remove the names Use different senators for training and testing Experiment round 2: Experiment round 2 Remove the names Accuracy > 90% Concerns: not completely cleaned Train and test on different senators Accuracy < 60% Failure? Maybe not Training senators: 101-104 Test senators: 108 Problems in experiment round 2: Problems in experiment round 2 Can’t control person coincidence The concept of party membership is possibly time-dependent BOW representation Vocabulary changeExperiment round 3: Experiment round 3 Control time, cross person Training set: 2005 House representatives Test set: 2005 Senators Accuracy ~ 80% Conclusion: cross person Cross person, cross time Training set: 2005 House representatives Test set: 2005-1989 SenatorsAccuracy curve over time: Accuracy curve over timeThe accuracy change over time: The accuracy change over time Why? Vocabulary change over time? The Chamber is more partisan than before? More experiments Same-senate party classificationSame-senate party classification: Same-senate party classificationThe lesson: The lesson Multiple hidden factors might lead to a high classification accuracy Personal characteristics in speech Vocabulary similarity during a period of time Topic similarity Our real goal: Find evidence that ideology is a concept cross-person, cross-issue, and cross-timeConclusion: Conclusion The importance of scrutinizing assumption violations when using text classification as an analytical tool for political science research Series of experiments and careful result interpretation for assumption validation One accuracy number is not enough Accuracy measure - black box Feature analysis – white box Explanation needed for “unexpected” relevant features You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Yu pres Heng Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 197 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 04, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Text classification for political science research: Text classification for political science research Bei Yu Daniel Diermeier Stefan Kaufmann Northwestern University October 22, 2007The booming of using text data for political science research: The booming of using text data for political science researchText classification for political science research: Text classification for political science research Ideology and party classification Party manifestos (Laver, Gary and Benoit 2003) Senatorial speeches (Diermeier et al. 2006) Newsgroup discussions (Mullen and Malouf 2006) Opinion classification Newsgroup discussions (Agrawal et a. 2003) Public comments (Kwon et al. 2006) Congressional floor debate (Thomas et al. 2006) A data mining process: A data mining process Graph taken from http://alg.ncsa.uiuc.edu/tools/docs/d2k/manual/dataMining.htmlPolitical text classification: Political text classification doc vectors text representation model Training set Classification methods Class labels Classifier All political texts of interest Text samples Sampling method X Y (X, Y) GeneralizationAssumptions in classification methods: Assumptions in classification methods Clear external criteria for class definitions Zero-noise class labels Independently and identically distributed data from a fixed distribution (X, Y) A model’s generalizability is problematic when the assumptions are violatedAssumption violations in real data: Assumption violations in real data Subjective class definitions Unreliable class labels Errors in manually coded labels Mismatch between convenient labels and true labels Drifted distribution Non-i.i.d. data Debate – interactive process Sample bias Small number of examples vs. large number of features The hidden classes in convenient data set Text classification for political scientists: Text classification for political scientists An analytical tool for hypothesis testing An example of political science problem Argument - ideology is a belief system which rules a person’s view on various issues. Evidence – voting records Concern – many factors affect voting decisions More evidence? Can we predict someone’s ideology based on what they said instead of how they voted? Text classification for political scientists: Text classification for political scientists Ideology classification Training data 101st-107th Senatorial speeches Test data 108th Senatorial speeches Result interpretation High accuracy as a confidence level of the evidence that the target concept can be generalized to the whole text collection set of interest. Impact of assumption violation on generalization: Impact of assumption violation on generalization Possible violations Biased Sample Changing distribution Non-i.i.d. data They are not rare in political text classification It is not easy to foresee them How can we find them out? Black-box approach? Some classifiers never used for prediction White-box approach? Feature weighting for linear text classifiers Checking unexpected features Feature analysis for linear text classifier interpretation: Feature analysis for linear text classifier interpretation I am not sure what I’m looking for, but I am sure what I’m NOT looking for.Case study: party classification: Case study: party classification Ideology and party highly similar classes according to available labels Choose party as the labels Experiment round 1 Training data – 101st-107th Senators Testing data – 108th Senators SVM and naïve Bayes algorithms Accuracy > 90% Success? Feature analysis Unexpected features – senator names and state namesProblems in experiment round 1 : Problems in experiment round 1 Possible coincidence Person classifier? The target concept should be person-independent Possible revision in experiment design Remove the names Use different senators for training and testing Experiment round 2: Experiment round 2 Remove the names Accuracy > 90% Concerns: not completely cleaned Train and test on different senators Accuracy < 60% Failure? Maybe not Training senators: 101-104 Test senators: 108 Problems in experiment round 2: Problems in experiment round 2 Can’t control person coincidence The concept of party membership is possibly time-dependent BOW representation Vocabulary changeExperiment round 3: Experiment round 3 Control time, cross person Training set: 2005 House representatives Test set: 2005 Senators Accuracy ~ 80% Conclusion: cross person Cross person, cross time Training set: 2005 House representatives Test set: 2005-1989 SenatorsAccuracy curve over time: Accuracy curve over timeThe accuracy change over time: The accuracy change over time Why? Vocabulary change over time? The Chamber is more partisan than before? More experiments Same-senate party classificationSame-senate party classification: Same-senate party classificationThe lesson: The lesson Multiple hidden factors might lead to a high classification accuracy Personal characteristics in speech Vocabulary similarity during a period of time Topic similarity Our real goal: Find evidence that ideology is a concept cross-person, cross-issue, and cross-timeConclusion: Conclusion The importance of scrutinizing assumption violations when using text classification as an analytical tool for political science research Series of experiments and careful result interpretation for assumption validation One accuracy number is not enough Accuracy measure - black box Feature analysis – white box Explanation needed for “unexpected” relevant features