logging in or signing up Lecture17 Bernadette Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 145 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Introduction to Content Analysis: Introduction to Content Analysis SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 1999 Today: Today Finish Boolean Intro to Content Analysis Tokenization Stemming/Morphological analysis Statistical Characteristics of Text Collections Zipf distribution Query Languages: Query Languages A way to express the question (information need) Types: Boolean Natural Language Stylized Natural Language Form-Based (GUI)Simple query language: Boolean: Simple query language: Boolean Terms + Connectors terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOTBoolean Queries: Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash) Boolean Queries: Boolean Queries (Cat OR Dog) AND (Collar OR Leash) Each of the following combinations satisfies this statement: Cat x x x x x Dog x x x x Collar x x x x Leash x x x x xBoolean Queries: Boolean Queries (Cat OR Dog) AND (Collar OR Leash) None of the following combinations work: Cat x Dog x x Collar x x Leash x x Boolean Queries: Boolean Queries Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators (a AND b AND c AND d) Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a Boolean Searching: Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)Boolean Logic: Boolean Logic t3 t1 t2 D1 D2 D3 D4 D5 D6 D8 D7 D9 D10 D11 m1 m2 m3 m5 m4 m7 m8 m6 m2 = t1 t2 t3 m1 = t1 t2 t3 m4 = t1 t2 t3 m3 = t1 t2 t3 m6 = t1 t2 t3 m5 = t1 t2 t3 m8 = t1 t2 t3 m7 = t1 t2 t3 Precedence Ordering: Precedence Ordering In what order do we evaluate the components of the Boolean expression? Parenthesis get done first (a or b) and (c or d) (a or (b and c) or d) Usually start from the left and work right (in case of ties) Usually (if there are no parentheses) NOT before AND AND before OR Pseudo-Boolean Queries: Pseudo-Boolean Queries A new notation, from web search +cat dog +collar leash These are prefix operators Does not mean the same thing as AND/OR! + means “mandatory, must be in document” - means “cannot be in the document” Phrases: “stray cat” AND “frayed collar” is equivalent to +“stray cat” +“frayed collar”Boolean: Boolean Advantages simple queries are easy to understand relatively easy to implement Disadvantages difficult to specify what is wanted too much returned, or too little ordering not well determined Dominant language in commercial systems until the WWWResult Sets: Result Sets Run a query, get a result set Two choices Reformulate query, run on entire collection Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documentsFaceted Boolean Query: Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) AND AND Ordering of Retrieved Documents: Ordering of Retrieved Documents Pure Boolean has no ordering In practice: order chronologically order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated p-norm is most famous usually impractical to implement usually hard for user to understandFaceted Boolean Query: Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking Order results in terms of how many facets (disjuncts) are satisfied Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: Assign weights to facets Proximity Searches: Proximity Searches Proximity: terms occur within K positions of one another pen w/5 paper A “Near” function can be more vague near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations “United Nations” “Bill Clinton” Phrase Variants “retrieval of information” “information retrieval” Filters: Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata restrict by: date range internet domain (.edu .com .berkeley.edu) author size limit number of documents returnedContent Analysis: Content Analysis Automated Transformation of raw text into a form that represent some aspect(s) of its meaning Including, but not limited to: Automated Thesaurus Generation Phrase Detection Categorization Clustering SummarizationTechniques for Content Analysis: Techniques for Content Analysis Statistical Single Document Full Collection Linguistic Syntactic Semantic Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)Text Processing: Text Processing Standard Steps: Recognize document structure titles, sections, paragraphs, etc. Break into tokens usually space and punctuation delineated special issues with Asian languages Stemming/morphological analysis Store in inverted index (to be discussed later) Document Processing Steps: Document Processing StepsStemming and Morphological Analysis: Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class dog, dogs tengo, tienes, tiene, tenemos, tienen Derivational Morphology Derive one word from another, Often change grammatical class build, building; health, healthy Errors Generated by Porter Stemmer (Krovetz 93): Errors Generated by Porter Stemmer (Krovetz 93)Automated Methods: Automated Methods Stemmers: Very dumb rules work well (for English) Porter Stemmer: Iteratively remove suffixes Improvement: pass results through a lexicon Powerful multilingual tools exist for morphological analysis PCKimmo, Xerox Lexical technology Require a grammar and dictionary Use “two-level” automataStatistical Properties of Text: Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution Plotting Word Frequency by Rank: Plotting Word Frequency by Rank Main idea: count How many tokens occur 1 time How many tokens occur 2 times How many tokens occur 3 times … Now rank these according to how of they occur. This is called the rank.Plotting Word Frequency by Rank: Plotting Word Frequency by Rank Say for a text with 100 tokens Count How many tokens occur 1 time (50) How many tokens occur 2 times (20) … How many tokens occur 7 times (10) … How many tokens occur 12 times (1) How many tokens occur 14 times (1) So things that occur the most time shave the highest rank (rank 1). Things that occur the fewest times have the lowest rank (rank n).A Standard Collection: A Standard Collection 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Government documents, 157734 tokens, 32259 uniqueHousing Listing Frequency Data: Housing Listing Frequency Data 6208 tokens, 1318 unique (very small collection)Observation: MANY phenomena can be characterized this way.: Observation: MANY phenomena can be characterized this way. Words in a text collection Library book checkout patterns Incoming Web Page Requests (Nielsen) Outgoing Web Page Requests (Cunha & Crovella) Document Size on Web (Cunha & Crovella) Zipf Distribution(linear and log scale): Zipf Distribution (linear and log scale) Zipf Distribution: Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: Say the most common term occurs C times The second most common occurs C/2 times The third most common occurs C/3 times …Zipf Distribution: Zipf Distribution The Important Points: a few elements occur very frequently a medium number of elements have medium frequency many elements occur very infrequently Very frequent word stems (Cha-Cha Web Index): Very frequent word stems (Cha-Cha Web Index)Words that occur few times (Cha-Cha Web Index): Words that occur few times (Cha-Cha Web Index)Word Frequency vs. Resolving Power (from van Rijsbergen 79): Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive. Upper cut-off Significant words Words by rank order Frequency of words Lower cut-off Resolving power of Significant words Summary: Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties Word frequencies have a Zipf distribution Word co-occurrences exhibit dependencies Text documents are transformed to vectors Pre-processing includes tokenization, stemming, collocations/phrases Documents occupy multi-dimensional space.Next Time: Next Time Statistical Properties, Cont. Statistical Dependence Word Associations Implementation Issues Inverted Indices You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Lecture17 Bernadette Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 145 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Introduction to Content Analysis: Introduction to Content Analysis SIMS 202 Profs. Hearst & Larson UC Berkeley SIMS Fall 1999 Today: Today Finish Boolean Intro to Content Analysis Tokenization Stemming/Morphological analysis Statistical Characteristics of Text Collections Zipf distribution Query Languages: Query Languages A way to express the question (information need) Types: Boolean Natural Language Stylized Natural Language Form-Based (GUI)Simple query language: Boolean: Simple query language: Boolean Terms + Connectors terms words normalized (stemmed) words phrases thesaurus terms connectors AND OR NOTBoolean Queries: Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash) Boolean Queries: Boolean Queries (Cat OR Dog) AND (Collar OR Leash) Each of the following combinations satisfies this statement: Cat x x x x x Dog x x x x Collar x x x x Leash x x x x xBoolean Queries: Boolean Queries (Cat OR Dog) AND (Collar OR Leash) None of the following combinations work: Cat x Dog x x Collar x x Leash x x Boolean Queries: Boolean Queries Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators (a AND b AND c AND d) Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a Boolean Searching: Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)Boolean Logic: Boolean Logic t3 t1 t2 D1 D2 D3 D4 D5 D6 D8 D7 D9 D10 D11 m1 m2 m3 m5 m4 m7 m8 m6 m2 = t1 t2 t3 m1 = t1 t2 t3 m4 = t1 t2 t3 m3 = t1 t2 t3 m6 = t1 t2 t3 m5 = t1 t2 t3 m8 = t1 t2 t3 m7 = t1 t2 t3 Precedence Ordering: Precedence Ordering In what order do we evaluate the components of the Boolean expression? Parenthesis get done first (a or b) and (c or d) (a or (b and c) or d) Usually start from the left and work right (in case of ties) Usually (if there are no parentheses) NOT before AND AND before OR Pseudo-Boolean Queries: Pseudo-Boolean Queries A new notation, from web search +cat dog +collar leash These are prefix operators Does not mean the same thing as AND/OR! + means “mandatory, must be in document” - means “cannot be in the document” Phrases: “stray cat” AND “frayed collar” is equivalent to +“stray cat” +“frayed collar”Boolean: Boolean Advantages simple queries are easy to understand relatively easy to implement Disadvantages difficult to specify what is wanted too much returned, or too little ordering not well determined Dominant language in commercial systems until the WWWResult Sets: Result Sets Run a query, get a result set Two choices Reformulate query, run on entire collection Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documentsFaceted Boolean Query: Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) AND AND Ordering of Retrieved Documents: Ordering of Retrieved Documents Pure Boolean has no ordering In practice: order chronologically order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated p-norm is most famous usually impractical to implement usually hard for user to understandFaceted Boolean Query: Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking Order results in terms of how many facets (disjuncts) are satisfied Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: Assign weights to facets Proximity Searches: Proximity Searches Proximity: terms occur within K positions of one another pen w/5 paper A “Near” function can be more vague near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations “United Nations” “Bill Clinton” Phrase Variants “retrieval of information” “information retrieval” Filters: Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata restrict by: date range internet domain (.edu .com .berkeley.edu) author size limit number of documents returnedContent Analysis: Content Analysis Automated Transformation of raw text into a form that represent some aspect(s) of its meaning Including, but not limited to: Automated Thesaurus Generation Phrase Detection Categorization Clustering SummarizationTechniques for Content Analysis: Techniques for Content Analysis Statistical Single Document Full Collection Linguistic Syntactic Semantic Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)Text Processing: Text Processing Standard Steps: Recognize document structure titles, sections, paragraphs, etc. Break into tokens usually space and punctuation delineated special issues with Asian languages Stemming/morphological analysis Store in inverted index (to be discussed later) Document Processing Steps: Document Processing StepsStemming and Morphological Analysis: Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class dog, dogs tengo, tienes, tiene, tenemos, tienen Derivational Morphology Derive one word from another, Often change grammatical class build, building; health, healthy Errors Generated by Porter Stemmer (Krovetz 93): Errors Generated by Porter Stemmer (Krovetz 93)Automated Methods: Automated Methods Stemmers: Very dumb rules work well (for English) Porter Stemmer: Iteratively remove suffixes Improvement: pass results through a lexicon Powerful multilingual tools exist for morphological analysis PCKimmo, Xerox Lexical technology Require a grammar and dictionary Use “two-level” automataStatistical Properties of Text: Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution Plotting Word Frequency by Rank: Plotting Word Frequency by Rank Main idea: count How many tokens occur 1 time How many tokens occur 2 times How many tokens occur 3 times … Now rank these according to how of they occur. This is called the rank.Plotting Word Frequency by Rank: Plotting Word Frequency by Rank Say for a text with 100 tokens Count How many tokens occur 1 time (50) How many tokens occur 2 times (20) … How many tokens occur 7 times (10) … How many tokens occur 12 times (1) How many tokens occur 14 times (1) So things that occur the most time shave the highest rank (rank 1). Things that occur the fewest times have the lowest rank (rank n).A Standard Collection: A Standard Collection 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Government documents, 157734 tokens, 32259 uniqueHousing Listing Frequency Data: Housing Listing Frequency Data 6208 tokens, 1318 unique (very small collection)Observation: MANY phenomena can be characterized this way.: Observation: MANY phenomena can be characterized this way. Words in a text collection Library book checkout patterns Incoming Web Page Requests (Nielsen) Outgoing Web Page Requests (Cunha & Crovella) Document Size on Web (Cunha & Crovella) Zipf Distribution(linear and log scale): Zipf Distribution (linear and log scale) Zipf Distribution: Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: Say the most common term occurs C times The second most common occurs C/2 times The third most common occurs C/3 times …Zipf Distribution: Zipf Distribution The Important Points: a few elements occur very frequently a medium number of elements have medium frequency many elements occur very infrequently Very frequent word stems (Cha-Cha Web Index): Very frequent word stems (Cha-Cha Web Index)Words that occur few times (Cha-Cha Web Index): Words that occur few times (Cha-Cha Web Index)Word Frequency vs. Resolving Power (from van Rijsbergen 79): Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive. Upper cut-off Significant words Words by rank order Frequency of words Lower cut-off Resolving power of Significant words Summary: Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties Word frequencies have a Zipf distribution Word co-occurrences exhibit dependencies Text documents are transformed to vectors Pre-processing includes tokenization, stemming, collocations/phrases Documents occupy multi-dimensional space.Next Time: Next Time Statistical Properties, Cont. Statistical Dependence Word Associations Implementation Issues Inverted Indices