CS 595 Presentation

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Classifying Gender on Shakespeare’s Characters: 

Classifying Gender on Shakespeare’s Characters By Sobhan Advisor: Dr. Argamon

Outline: 

Outline Introduction to Problem Data Collection, Meta Data Generation, Feature Description File Selection Importing Data, Generating ARFF File - ATMan Vector Calculation ML Algorithms Used for this Classification Problem Experiments and Results Top - Bottom 20 Features – Responsible for Gender Classification Machine/OS/Tools Used Future Work - References

Introduction: 

Introduction Research in Gender Classification – Email Authorship, Written Text, Authorship on Novels Do Male/Female playwright writes the same way for their Male/Female characters into their plays or they writes in different manner ? Finding Accuracy on Gender Classification for Shakespeare’s Characters Features used by Character Gender from Plays Finding Accuracy on Social Class Classification for Shakespeare’s Characters

Data Collection : 

Data Collection Version used : Moby Shakespeare Available at: http://www-tech.mit.edu/Shakespeare/ Collected all HTML files using “wget” Class used(html2txt): Converted html files to text files for each individual play and also based on scenes

Data Cleaning : 

Data Cleaning Unwanted data were removed from each scene exeunt Exit

Meta Data Generation: 

Meta Data Generation Meta Data: Data about Data For each character acting on the play has the following 6 information to be captured. Data about a Character Type of Play: Comedy Name of the Play: Midsummer Night’s Dream Name of the Character: CLOWN Speech Length: 1024.0 Gender: Male Social Class: Low

Corpus Selection: 

Corpus Selection Initially All Scenes were selected. Speech Length for each character was added to Metadata and then the following selection were made Characters with more than 100, 200, 300, 400, 500 speech length were taken into consideration. (For scenes, acts and on Play) Separates files per character were created for more than 500, 200

Features File Selection: 

Features File Selection Most Frequent 500 Words from Plays (FDescMostFrequentAttr - Sterling) Function Words( Standard FWs from Bar Ilan University - #471) Function Words Collected from ARFF received from Bar-Ilan (#364) Shakespearean Function Words from Plays(# 491) All Stop Words (#645) Appraisal Features(#47) Systemic Features(#94)

System Architecture: 

System Architecture Corpus ATMAN Importer ImportShakespeareData ARFF FILE Cdesc ATXT TOKEN Atxt, Token Fdesc ATMAN QuickARFF

A Meta-Info Tag from an Atxt File: 

A Meta-Info Tag from an Atxt File

Vector Calculation: 

Vector Calculation C(w,c) = # of occurrences of FW w for character c N(c) = total # of word occurrences for character c (number of tokens) Vector_Value(w) = is then C(w,c)/N(c)

Algorithms: 

Algorithms Decision Trees J48 Decision Stump Functions SMO e-1 SMO e-2 Rules PART Meta AdaBoostM1 + J48 (- 30 I) AdaBoostM1 + DecisionStump(- 30 I) MultiBoost + J48 (- 30 I) MultiBoost + DecisionStump(- 30 I)

Experiments: 

Experiments Strategy Used: 10 different partitions on each of the following categories. Experiments were made with Total Female characters with equal number of Random Male characters All Comedy History Tragedy High Low Testing Option – 10 Fold CV

All & Comedy – MF - 500: 

All & Comedy – MF - 500

Tragedy & History - MF - 500: 

Tragedy & History - MF - 500

High & Low - MF - 500: 

High & Low - MF - 500

Bar-ILan FWs(#471): 

Bar-ILan FWs(#471)

364 FWs for Characters with Speech Length more than 100 – Acts Based : 

364 FWs for Characters with Speech Length more than 100 – Acts Based

364 FWs Characters with speech length> 500: 

364 FWs Characters with speech length> 500

364 FWs + Quote Features Characters with Speech Length > 500: 

364 FWs + Quote Features Characters with Speech Length > 500

BAR-ILAN Results F - 55 - M: 

BAR-ILAN Results F - 55 - M

364 FWs(F - 89 - V M) Characters with speech length> 200: 

364 FWs(F - 89 - V M) Characters with speech length> 200

Stop Words-Appraisal-Systemic: 

Stop Words-Appraisal-Systemic

Machine/OS/Tools : 

Machine/OS/Tools Altaic – Linux OS – Altaic 4GB RAM – Importing, Generating ARFF using ATMan My PC – Windows XP - 1GB RAM - Running Experiments in Weka-3-4 HLL – Java1.4.2 File Zilla – Transferring Files from remotely Putty – To Run commands Remotely in Server TextPad – Tool for Text Processing Edit Plus – IDE for Generating Scripts and Programs

Future Work : 

Future Work Experiments with Individual Category of Play Type, Social Class Accuracy for Social Class Features, Combination of Features Get subtle features to distinguish Gender Character Get subtle features to distinguish Social Class Combination of Features for Gender/Social class Classification Combination of Features allows to predict characteristics on Appraisal or Systemic behavior

Reference: 

Reference Authorship Verification as a One-Class Classification Problem, Moshe Koppel, Jonathan Schler Automatic Authorship Attribution – E.Stamatatos, N. Fakotakis, G. Kokkinakis Gender Preferential Text Mining of E-mail Discourse – Malcolm Corney, Olivier de Vel, Alison Anderson, George Mohay Mining E-mail Authorship – Oliver de Vel Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results - S3, Shlomo Argamon, Marin Automatically Categorizing Written Texts by Author Gender - Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni Gender, Genre and Writing Style in Formal Written Texts - Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni, Jonathan Fine

References: 

References MEASURING THE USEFULNESS OF FUNCTION WORDS FOR AUTHORSHIP ATTRIBUTION – Shlomo Argamon, Shlomo Levitan A short introduction to Boosting : Yoav Freund, Robert E. Schapire A competitive Analysis of Automated Authorship Attribution Techniques – Jason Sorenson Text Categorization with Support Vector Machines: Learning with Many Relevant Features - Thorsten Joachims