Presentation Transcript
Classifying Gender on Shakespeare’s Characters : Classifying Gender on Shakespeare’s Characters By Sobhan
Advisor: Dr. Argamon
Outline : Outline Introduction to Problem
Data Collection, Meta Data Generation, Feature Description File Selection
Importing Data, Generating ARFF File - ATMan
Vector Calculation
ML Algorithms Used for this Classification Problem
Experiments and Results
Top - Bottom 20 Features – Responsible for Gender Classification
Machine/OS/Tools Used
Future Work - References
Introduction : Introduction Research in Gender Classification – Email Authorship, Written Text, Authorship on Novels
Do Male/Female playwright writes the same way for their Male/Female characters into their plays or they writes in different manner ?
Finding Accuracy on Gender Classification for Shakespeare’s Characters
Features used by Character Gender from Plays
Finding Accuracy on Social Class Classification for Shakespeare’s Characters
Data Collection : Data Collection Version used : Moby Shakespeare
Available at: http://www-tech.mit.edu/Shakespeare/
Collected all HTML files using “wget”
Class used(html2txt): Converted html files to text files for each individual play and also based on scenes
Data Cleaning : Data Cleaning Unwanted data were removed from each scene
exeunt
Exit
Meta Data Generation : Meta Data Generation Meta Data: Data about Data
For each character acting on the play has the following 6 information to be captured.
Data about a Character
Type of Play: Comedy
Name of the Play: Midsummer Night’s Dream
Name of the Character: CLOWN
Speech Length: 1024.0
Gender: Male
Social Class: Low
Corpus Selection : Corpus Selection Initially All Scenes were selected.
Speech Length for each character was added to Metadata and then the following selection were made
Characters with more than 100, 200, 300, 400, 500 speech length were taken into consideration. (For scenes, acts and on Play)
Separates files per character were created for more than 500, 200
Features File Selection : Features File Selection Most Frequent 500 Words from Plays (FDescMostFrequentAttr - Sterling)
Function Words( Standard FWs from Bar Ilan University - #471)
Function Words Collected from ARFF received from Bar-Ilan (#364)
Shakespearean Function Words from Plays(# 491)
All Stop Words (#645)
Appraisal Features(#47)
Systemic Features(#94)
System Architecture : System Architecture Corpus ATMAN
Importer
ImportShakespeareData ARFF FILE Cdesc ATXT TOKEN Atxt, Token Fdesc ATMAN
QuickARFF
A Meta-Info Tag from an Atxt File : A Meta-Info Tag from an Atxt File
Vector Calculation : Vector Calculation C(w,c) = # of occurrences of FW w for character c
N(c) = total # of word occurrences for character c (number of tokens)
Vector_Value(w) = is then C(w,c)/N(c)
Algorithms : Algorithms Decision Trees
J48
Decision Stump
Functions
SMO e-1
SMO e-2
Rules
PART
Meta
AdaBoostM1 + J48 (- 30 I)
AdaBoostM1 + DecisionStump(- 30 I)
MultiBoost + J48 (- 30 I)
MultiBoost + DecisionStump(- 30 I)
Experiments : Experiments Strategy Used:
10 different partitions on each of the following categories. Experiments were made with Total Female characters with equal number of Random Male characters
All
Comedy
History
Tragedy
High
Low
Testing Option – 10 Fold CV
All & Comedy – MF - 500 : All & Comedy – MF - 500
Tragedy & History - MF - 500 : Tragedy & History - MF - 500
High & Low - MF - 500 : High & Low - MF - 500
Bar-ILan FWs(#471) : Bar-ILan FWs(#471)
364 FWs for Characters with Speech Length more than 100 – Acts Based : 364 FWs for Characters with Speech Length more than 100 – Acts Based
364 FWsCharacters with speech length> 500 : 364 FWs Characters with speech length> 500
364 FWs + Quote FeaturesCharacters with Speech Length > 500 : 364 FWs + Quote Features Characters with Speech Length > 500
BAR-ILAN Results F - 55 - M : BAR-ILAN Results F - 55 - M
364 FWs(F - 89 - V M) Characters with speech length> 200 : 364 FWs(F - 89 - V M) Characters with speech length> 200
Stop Words-Appraisal-Systemic : Stop Words-Appraisal-Systemic
Machine/OS/Tools : Machine/OS/Tools Altaic – Linux OS – Altaic 4GB RAM – Importing, Generating ARFF using ATMan
My PC – Windows XP - 1GB RAM - Running Experiments in Weka-3-4
HLL – Java1.4.2
File Zilla – Transferring Files from remotely
Putty – To Run commands Remotely in Server
TextPad – Tool for Text Processing
Edit Plus – IDE for Generating Scripts and Programs
Future Work : Future Work Experiments with Individual Category of Play Type, Social Class
Accuracy for Social Class
Features, Combination of Features
Get subtle features to distinguish Gender Character
Get subtle features to distinguish Social Class
Combination of Features for Gender/Social class Classification
Combination of Features allows to predict characteristics on Appraisal or Systemic behavior
Reference : Reference Authorship Verification as a One-Class Classification Problem, Moshe Koppel, Jonathan Schler
Automatic Authorship Attribution – E.Stamatatos, N. Fakotakis, G. Kokkinakis
Gender Preferential Text Mining of E-mail Discourse – Malcolm Corney, Olivier de Vel, Alison Anderson, George Mohay
Mining E-mail Authorship – Oliver de Vel
Style Mining of Electronic Messages for Multiple Authorship Discrimination: First Results - S3, Shlomo Argamon, Marin
Automatically Categorizing Written Texts by Author Gender - Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni
Gender, Genre and Writing Style in Formal Written Texts - Moshe Koppel, Shlomo Argamon, Anat Rachel Shimoni, Jonathan Fine
References : References MEASURING THE USEFULNESS OF FUNCTION WORDS FOR AUTHORSHIP ATTRIBUTION – Shlomo Argamon, Shlomo Levitan
A short introduction to Boosting : Yoav Freund, Robert E. Schapire
A competitive Analysis of Automated Authorship Attribution Techniques – Jason Sorenson
Text Categorization with Support Vector Machines: Learning with Many Relevant Features - Thorsten Joachims
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.