logging in or signing up lecture15 Laurence Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 350 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript SIMS 290-2: Applied Natural Language Processing: SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov) Today: Today Feature selection TF.IDF Term Weighting Weka Input File Format Slide3: Features for Text Categorization Linguistic features Words lowercase? (should we convert to?) normalized? (e.g. “texts” “text”) Phrases Word-level n-grams Character-level n-grams Punctuation Part of Speech Non-linguistic features document formatting informative character sequences (e.g. <)Slide4: If the algorithm cannot handle all possible features e.g. language identification for 100 languages using all words text classification using n-grams Good features can result in higher accuracy What if we just keep all features? Even the unreliable features can be helpful. But we need to weight them: In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection! When Do We Need Feature Selection?Slide5: Why Feature Selection? Not all features are equally good! Bad features: best to remove Infrequent unlikely to be seen again co-occurrence with a class can be due to chance Too frequent mostly function words Uniform across all categories Good features: should be kept Co-occur with a particular category Do not co-occur with other categories The rest: good to keep Slide6: Types Of Feature Selection? Feature selection reduces the number of features Usually: Eliminating features Weighting features Normalizing features Sometimes by transforming parameters e.g. Latent Semantic Indexing using Singular Value Decomposition Method may depend on problem type For classification and filtering, may want to use information from example documents to guide selection. Feature Selection: Feature Selection Task independent methods Document Frequency (DF) Term Strength (TS) Task-dependent methods Information Gain (IG) Mutual Information (MI) 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997) Pedersen & Yang Experiments: Pedersen & Yang Experiments Compared feature selection methods for text categorization 5 feature selection methods: DF, MI, CHI, (IG, TS) Features were just words, not phrases 2 classifiers: kNN: k-Nearest Neighbor LLSF: Linear Least Squares Fit 2 data collections: Reuters-22173 OHSUMED: subset of MEDLINE (1990&1991 used) Slide9: DF: number of documents a term appears in Based on Zipf’s Law Remove the rare terms: (seen 1-2 times) Spurious Unreliable – can be just noise Unlikely to appear in new documents Plus Easy to compute Task independent: do not need to know the classes Minus Ad hoc criterion For some applications, rare terms can be good discriminators (e.g., in IR) Document Frequency (DF)Slide10: Common words from a predefined list Mostly from closed-class categories: unlikely to have a new word added include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles But also some open-class words like numerals Bad discriminators uniformly spread across all classes can be safely removed from the vocabulary Is this always a good idea? (e.g. author identification) Stop Word RemovalSlide11: 2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and a category (Yang & Pedersen) 2 statistic (CHI) Slide12: Is “jaguar” a good predictor for the “auto” class? We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent 2 statistic (CHI) Slide13: Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? If independent: Pr(j,a) = Pr(j) Pr(a) So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a) Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 Which = N(5/N)(502/N)=2510/N=2510/10005 0.25 2 statistic (CHI) Slide14: Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? 2 statistic (CHI) expected: fe observed: foSlide15: Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? 2 statistic (CHI) expected: fe observed: foSlide16: 2 is interested in (fo – fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). 2 statistic (CHI) expected: fe observed: foSlide17: There is a simpler formula for 2: 2 statistic (CHI) N = A + B + C + DSlide18: How to use 2 for multiple categories? Compute 2 for each category and then combine: To require a feature to discriminate well across all categories, then we need to take the expected value of 2: Or to weight for a single category, take the maximum: 2 statistic (CHI) Slide19: Pluses normalized and thus comparable across terms 2(t,c) is 0, when t and c are independent can be compared to 2 distribution, 1 degree of freedom Minuses unreliable for low frequency terms 2 statistic (CHI) Information Gain: Information Gain A measure of importance of the feature for predicting the presence of the class. Has an information theoretic justification Defined as: The number of “bits of information” gained by knowing the term is present or absent Based on Information Theory We won’t go into this in detail here. Slide21: Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, ci is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)Slide22: The probability of seeing x and y together vs The probably of seeing x anywhere times the probability of seeing y anywhere (independently). MI = log ( P(x,y) / P(x)P(y) ) = log(P(x,y)) – log(P(x)P(y)) From Bayes law: P(x,y) = P(x|y)P(y) = log(P(x|y)P(y)) – log(P(x)P(y)) MI = log(P(x|y) – log(P(x)) Mutual Information (MI) Slide23: Approximation: Mutual Information (MI) rare terms get higher scores does not use term absence N = A + B + C + DSlide24: Compute MI for each category and then combine If we want to discriminate well across all categories, then we need to take the expected value of MI: To discriminate well for a single category, then we take the maximum: Using Mutual Information Mutual Information: Mutual Information Pluses I(t,c) is 0, when t and c are independent Has a sound information-theoretic interpretation Minuses Small numbers produce unreliable results Does not use term absenceSlide26: Mutual information Term strength From Yang & Pedersen ‘97 CHI max, IG, DFSlide27: DF, IG and CHI are good and strongly correlated thus using DF is good, cheap and task independent can be used when IG and CHI are too expensive MI is bad favors rare terms (which are typically bad) Feature Comparison Term Weighting: Term Weighting In the study just shown, terms were (mainly) treated as binary features If a term occurred in a document, it was assigned 1 Else 0 Often it us useful to weight the selected features Standard technique: tf.idfSlide29: TF: term frequency definition: TF = tij frequency of term i in document j purpose: makes the frequent words for the document more important IDF: inverted document frequency definition: IDF = log(N/ni) ni : number of documents containing term i N : total number of documents purpose: makes rare words across documents more important TF.IDF (for term i in document j) definition: tij log(N/ni) TF.IDF Term WeightingTerm Normalization: Term Normalization Combine different words into a single representation Stemming/morphological analysis bought, buy, buys -> buy General word categories $23.45, 5.30 Yen -> MONEY 1984, 10,000 -> DATE, NUM PERSON ORGANIZATION (Covered in Information Extraction segment) Generalize with lexical hierarchies WordNet, MeSH (Covered later in the semester) Slide31: Feature selection infrequent term removal infrequent across the whole collection (i.e. DF) seen in a single document most frequent term removal (i.e. stop words) Normalization: Stemming. (often) Word classes (sometimes) Feature weighting: TF.IDF or IDF Dimensionality reduction (sometimes) What Do People Do In Practice?Weka: Weka Java-based tool for large-scale machine-learning problems Tailored towards text analysis http://weka.sourceforge.net/wekadoc/ Weka Input Format: Weka Input Format Expects a particular input file format Called ARFF: Attribute-Relation File Format Consists of a Header and a Data section http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6) Slide34: @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... WEKA File Format: ARFF Other attribute types: String Date Numerical attribute Nominal attribute Missing value http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6) Slide35: Value 0 is not represented explicitly Same header (i.e @relation and @attribute tags) the @data section is different Instead of @data 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" We have @data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"} This saves LOTS of space for text applications. Why? WEKA Sparse File FormatNext Time: Next Time Wed: Guest lecture by Peter Jackson: Pure and Applied Research in NLP: The Good, the Bad, and the Lucky. Following week: Text Categorization Algorithms How to use Weka You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
lecture15 Laurence Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 350 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript SIMS 290-2: Applied Natural Language Processing: SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov) Today: Today Feature selection TF.IDF Term Weighting Weka Input File Format Slide3: Features for Text Categorization Linguistic features Words lowercase? (should we convert to?) normalized? (e.g. “texts” “text”) Phrases Word-level n-grams Character-level n-grams Punctuation Part of Speech Non-linguistic features document formatting informative character sequences (e.g. <)Slide4: If the algorithm cannot handle all possible features e.g. language identification for 100 languages using all words text classification using n-grams Good features can result in higher accuracy What if we just keep all features? Even the unreliable features can be helpful. But we need to weight them: In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection! When Do We Need Feature Selection?Slide5: Why Feature Selection? Not all features are equally good! Bad features: best to remove Infrequent unlikely to be seen again co-occurrence with a class can be due to chance Too frequent mostly function words Uniform across all categories Good features: should be kept Co-occur with a particular category Do not co-occur with other categories The rest: good to keep Slide6: Types Of Feature Selection? Feature selection reduces the number of features Usually: Eliminating features Weighting features Normalizing features Sometimes by transforming parameters e.g. Latent Semantic Indexing using Singular Value Decomposition Method may depend on problem type For classification and filtering, may want to use information from example documents to guide selection. Feature Selection: Feature Selection Task independent methods Document Frequency (DF) Term Strength (TS) Task-dependent methods Information Gain (IG) Mutual Information (MI) 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997) Pedersen & Yang Experiments: Pedersen & Yang Experiments Compared feature selection methods for text categorization 5 feature selection methods: DF, MI, CHI, (IG, TS) Features were just words, not phrases 2 classifiers: kNN: k-Nearest Neighbor LLSF: Linear Least Squares Fit 2 data collections: Reuters-22173 OHSUMED: subset of MEDLINE (1990&1991 used) Slide9: DF: number of documents a term appears in Based on Zipf’s Law Remove the rare terms: (seen 1-2 times) Spurious Unreliable – can be just noise Unlikely to appear in new documents Plus Easy to compute Task independent: do not need to know the classes Minus Ad hoc criterion For some applications, rare terms can be good discriminators (e.g., in IR) Document Frequency (DF)Slide10: Common words from a predefined list Mostly from closed-class categories: unlikely to have a new word added include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles But also some open-class words like numerals Bad discriminators uniformly spread across all classes can be safely removed from the vocabulary Is this always a good idea? (e.g. author identification) Stop Word RemovalSlide11: 2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and a category (Yang & Pedersen) 2 statistic (CHI) Slide12: Is “jaguar” a good predictor for the “auto” class? We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent 2 statistic (CHI) Slide13: Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? If independent: Pr(j,a) = Pr(j) Pr(a) So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a) Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 Which = N(5/N)(502/N)=2510/N=2510/10005 0.25 2 statistic (CHI) Slide14: Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? 2 statistic (CHI) expected: fe observed: foSlide15: Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? 2 statistic (CHI) expected: fe observed: foSlide16: 2 is interested in (fo – fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). 2 statistic (CHI) expected: fe observed: foSlide17: There is a simpler formula for 2: 2 statistic (CHI) N = A + B + C + DSlide18: How to use 2 for multiple categories? Compute 2 for each category and then combine: To require a feature to discriminate well across all categories, then we need to take the expected value of 2: Or to weight for a single category, take the maximum: 2 statistic (CHI) Slide19: Pluses normalized and thus comparable across terms 2(t,c) is 0, when t and c are independent can be compared to 2 distribution, 1 degree of freedom Minuses unreliable for low frequency terms 2 statistic (CHI) Information Gain: Information Gain A measure of importance of the feature for predicting the presence of the class. Has an information theoretic justification Defined as: The number of “bits of information” gained by knowing the term is present or absent Based on Information Theory We won’t go into this in detail here. Slide21: Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, ci is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)Slide22: The probability of seeing x and y together vs The probably of seeing x anywhere times the probability of seeing y anywhere (independently). MI = log ( P(x,y) / P(x)P(y) ) = log(P(x,y)) – log(P(x)P(y)) From Bayes law: P(x,y) = P(x|y)P(y) = log(P(x|y)P(y)) – log(P(x)P(y)) MI = log(P(x|y) – log(P(x)) Mutual Information (MI) Slide23: Approximation: Mutual Information (MI) rare terms get higher scores does not use term absence N = A + B + C + DSlide24: Compute MI for each category and then combine If we want to discriminate well across all categories, then we need to take the expected value of MI: To discriminate well for a single category, then we take the maximum: Using Mutual Information Mutual Information: Mutual Information Pluses I(t,c) is 0, when t and c are independent Has a sound information-theoretic interpretation Minuses Small numbers produce unreliable results Does not use term absenceSlide26: Mutual information Term strength From Yang & Pedersen ‘97 CHI max, IG, DFSlide27: DF, IG and CHI are good and strongly correlated thus using DF is good, cheap and task independent can be used when IG and CHI are too expensive MI is bad favors rare terms (which are typically bad) Feature Comparison Term Weighting: Term Weighting In the study just shown, terms were (mainly) treated as binary features If a term occurred in a document, it was assigned 1 Else 0 Often it us useful to weight the selected features Standard technique: tf.idfSlide29: TF: term frequency definition: TF = tij frequency of term i in document j purpose: makes the frequent words for the document more important IDF: inverted document frequency definition: IDF = log(N/ni) ni : number of documents containing term i N : total number of documents purpose: makes rare words across documents more important TF.IDF (for term i in document j) definition: tij log(N/ni) TF.IDF Term WeightingTerm Normalization: Term Normalization Combine different words into a single representation Stemming/morphological analysis bought, buy, buys -> buy General word categories $23.45, 5.30 Yen -> MONEY 1984, 10,000 -> DATE, NUM PERSON ORGANIZATION (Covered in Information Extraction segment) Generalize with lexical hierarchies WordNet, MeSH (Covered later in the semester) Slide31: Feature selection infrequent term removal infrequent across the whole collection (i.e. DF) seen in a single document most frequent term removal (i.e. stop words) Normalization: Stemming. (often) Word classes (sometimes) Feature weighting: TF.IDF or IDF Dimensionality reduction (sometimes) What Do People Do In Practice?Weka: Weka Java-based tool for large-scale machine-learning problems Tailored towards text analysis http://weka.sourceforge.net/wekadoc/ Weka Input Format: Weka Input Format Expects a particular input file format Called ARFF: Attribute-Relation File Format Consists of a Header and a Data section http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6) Slide34: @relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ... WEKA File Format: ARFF Other attribute types: String Date Numerical attribute Nominal attribute Missing value http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6) Slide35: Value 0 is not represented explicitly Same header (i.e @relation and @attribute tags) the @data section is different Instead of @data 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" We have @data {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"} This saves LOTS of space for text applications. Why? WEKA Sparse File FormatNext Time: Next Time Wed: Guest lecture by Peter Jackson: Pure and Applied Research in NLP: The Good, the Bad, and the Lucky. Following week: Text Categorization Algorithms How to use Weka