Share PowerPoint. Anywhere!

CS 595 Presentation

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 7
Like it  ( Likes) Dislike it  ( Dislikes)
Added: October 17, 2007 This presentation is Public
Presentation Category :Entertainment
Presentation StatisticsNew!
Views on authorSTREAM: 7
Presentation Transcript

Classifying Gender on Shakespeare’s Characters : Classifying Gender on Shakespeare’s Characters By Sobhan Advisor: Dr. Argamon


Outline : Outline Introduction to Problem Data Collection, Meta Data Generation, Feature Description File Selection Importing Data, Generating ARFF File - ATMan Vector Calculation ML Algorithms Used for this Classification Problem Experiments and Results Top - Bottom 20 Features – Responsible for Gender Classification Machine/OS/Tools Used Future Work - References


Introduction : Introduction Research in Gender Classification – Email Authorship, Written Text, Authorship on Novels Do Male/Female playwright writes the same way for their Male/Female characters into their plays or they writes in different manner ? Finding Accuracy on Gender Classification for Shakespeare’s Characters Features used by Character Gender from Plays Finding Accuracy on Social Class Classification for Shakespeare’s Characters


Data Collection : Data Collection Version used : Moby Shakespeare Available at: http://www-tech.mit.edu/Shakespeare/ Collected all HTML files using “wget” Class used(html2txt): Converted html files to text files for each individual play and also based on scenes


Data Cleaning : Data Cleaning Unwanted data were removed from each scene exeunt Exit


Meta Data Generation : Meta Data Generation Meta Data: Data about Data For each character acting on the play has the following 6 information to be captured. Data about a Character Type of Play: Comedy Name of the Play: Midsummer Night’s Dream Name of the Character: CLOWN Speech Length: 1024.0 Gender: Male Social Class: Low


Corpus Selection : Corpus Selection Initially All Scenes were selected. Speech Length for each character was added to Metadata and then the following selection were made Characters with more than 100, 200, 300, 400, 500 speech length were taken into consideration. (For scenes, acts and on Play) Separates files per character were created for more than 500, 200


Features File Selection : Features File Selection Most Frequent 500 Words from Plays (FDescMostFrequentAttr - Sterling) Function Words( Standard FWs from Bar Ilan University - #471) Function Words Collected from ARFF received from Bar-Ilan (#364) Shakespearean Function Words from Plays(# 491) All Stop Words (#645) Appraisal Features(#47) Systemic Features(#94)


System Architecture : System Architecture Corpus ATMAN Importer ImportShakespeareData ARFF FILE Cdesc ATXT TOKEN Atxt, Token Fdesc ATMAN QuickARFF


A Meta-Info Tag from an Atxt File : A Meta-Info Tag from an Atxt File


Vector Calculation : Vector Calculation C(w,c) = # of occurrences of FW w for character c N(c) = total # of word occurrences for character c (number of tokens) Vector_Value(w) = is then C(w,c)/N(c)


Algorithms : Algorithms Decision Trees J48 Decision Stump Functions SMO e-1 SMO e-2 Rules PART Meta AdaBoostM1 + J48 (- 30 I) AdaBoostM1 + DecisionStump(- 30 I) MultiBoost + J48 (- 30 I) MultiBoost + DecisionStump(- 30 I)


Experiments : Experiments Strategy Used: 10 different partitions on each of the following categories. Experiments were made with Total Female characters with equal number of Random Male characters All Comedy History Tragedy High Low Testing Option – 10 Fold CV


All & Comedy – MF - 500 : All & Comedy – MF - 500


Tragedy & History - MF - 500 : Tragedy & History - MF - 500


High & Low - MF - 500 : High & Low - MF - 500


Bar-ILan FWs(#471) : Bar-ILan FWs(#471)


364 FWs for Characters with Speech Length more than 100 – Acts Based : 364 FWs for Characters with Speech Length more than 100 – Acts Based


364 FWs Characters with speech length> 500 : 364 FWs Characters with speech length> 500


364 FWs + Quote Features Characters with Speech Length > 500 : 364 FWs + Quote Features Characters with Speech Length > 500


BAR-ILAN Results F - 55 - M : BAR-ILAN Results F - 55 - M


364 FWs(F - 89 - V M) Characters with speech length> 200 : 364 FWs(F - 89 - V M) Characters with speech length> 200


Stop Words-Appraisal-Systemic : Stop Words-Appraisal-Systemic


Machine/OS/Tools : Machine/OS/Tools Altaic – Linux OS – Altaic 4GB RAM – Importing, Generating ARFF using ATMan My PC – Windows XP - 1GB RAM - Running Experiments in Weka-3-4 HLL – Java1.4.2 File Zilla – Transferring Files from remotely Putty – To Run commands Remotely in Server TextPad – Tool for Text Processing Edit Plus – IDE for Generating Scripts and Programs


Future Work : Future Work Experiments with Individual Category of Play Type, Social Class Accuracy for Social Class Features, Combination of Features Get subtle features to distinguish Gender Character Get subtle features to distinguish Social Class Combination of Features for Gender/Social class Classification Combination of Features allows to predict characteristics on Appraisal or Systemic behavior


Reference : Reference Authorship Verification as a One-Class Classification Problem, Moshe Koppel, Jonathan Schler Automatic Authorship Attribution – E.Stamatatos, N. Fakotakis, G. Kokkinakis Gender Preferential Text Mining of E-mail Discourse – Malcolm Corney, Olivier de Vel, Alison Anderson, George Mohay Mining E-mail Authorship – Oliver de Vel Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results - S3, Shlomo Argamon, Marin Automatically Categorizing Written Texts by Author Gender - Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni Gender, Genre and Writing Style in Formal Written Texts - Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni, Jonathan Fine


References : References MEASURING THE USEFULNESS OF FUNCTION WORDS FOR AUTHORSHIP ATTRIBUTION – Shlomo Argamon, Shlomo Levitan A short introduction to Boosting : Yoav Freund, Robert E. Schapire A competitive Analysis of Automated Authorship Attribution Techniques – Jason Sorenson Text Categorization with Support Vector Machines: Learning with Many Relevant Features - Thorsten Joachims