radev

Uploaded from authorPOINTLite
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Information Retrieval: 

Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005

About the instructor: 

About the instructor Dragomir R. Radev Associate Professor, University of Michigan School of Information Department of Electrical Engineering and Computer Science Department of Linguistics Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan Treasurer, North American Chapter of the ACL Ph.D., 1998, Computer Science, Columbia University Email: radev@umich.edu Home page: http://tangra.si.umich.edu/~radev

Introduction: 

Introduction

IR systems: 

IR systems Google Vivísimo AskJeeves NSIR Lemur MG Nutch

Examples of IR systems: 

Examples of IR systems Conventional (library catalog). Search by keyword, title, author, etc. Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language

Need for IR: 

Need for IR Advent of WWW - more than 8 Billion documents indexed on Google How much information? 200TB according to Lyman and Varian 2003. http://www.sims.berkeley.edu/research/projects/how-much-info/ Search, routing, filtering User’s information need

Some definitions of Information Retrieval (IR): 

Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”

Sample queries (from Excite): 

Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running

Mappings and abstractions: 

Mappings and abstractions Reality Data Information need Query From Korfhage’s book

Typical IR system: 

Typical IR system (Crawling) Indexing Retrieval User interface

Key Terms Used in IR: 

Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase. DOCUMENT: an information entity that the user wants to retrieve COLLECTION: a set of documents INDEX: a representation of information that makes querying easier TERM: word or concept that appears in a document or a query

Documents: 

Documents

Documents: 

Documents Not just printed paper collections vs. documents data structures: representations Bag of words method document surrogates: keywords, summaries encoding: ASCII, Unicode, etc.

Document preprocessing: 

Document preprocessing Formatting Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc) Casing (cat vs. CAT) Stemming (computer, computation) Soundex

Document representations: 

Document representations Term-document matrix (m x n) term-term matrix (m x m x n) document-document matrix (n x n) Example: 3,000,000 documents (n) with 50,000 terms (m) sparse matrices Boolean vs. integer matrices

Document representations: 

Document representations Term-document matrix Evaluating queries (e.g., (AB)C) Storage issues Inverted files Storage issues Evaluating queries Advantages and disadvantages

IR models: 

IR models

Major IR models: 

Major IR models Boolean Vector Probabilistic Language modeling Fuzzy retrieval Latent semantic indexing

Major IR tasks: 

Major IR tasks Ad-hoc Filtering and routing Question answering Spoken document retrieval Multimedia retrieval

Venn diagrams: 

Venn diagrams x w y z D1 D2

Boolean model: 

Boolean model A B

Boolean queries: 

restaurants AND (Mideastern OR vegetarian) AND inexpensive Boolean queries What types of documents are returned? Stemming thesaurus expansion inclusive vs. exclusive OR confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal)

Boolean queries: 

Boolean queries Weighting (Beethoven AND sonatas) precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses Use of negation: potential problems Conjunctive and Disjunctive normal forms Full CNF and DNF

Transformations: 

Transformations De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) CNF or DNF? Reference librarians prefer CNF - why?

Boolean model: 

Boolean model Partition Partial relevance? Operators: AND, NOT, OR, parentheses

Exercise: 

Exercise D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information” Q1 = “information  retrieval” Q2 = “information  ¬computer”

Exercise: 

Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Stop lists: 

Stop lists 250-300 most common words in English account for 50% or more of a given text. Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63

Vector models: 

Vector models Term 1 Term 2 Term 3 Doc 1 Doc 2 Doc 3

Vector queries: 

Vector queries Each document is represented as a vector non-efficient representations (bit vectors) dimensional compatibility

The matching process: 

The matching process Document space Matching is done between a document and a query (or between two documents) distance vs. similarity Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.

Miscellaneous similarity measures: 

Miscellaneous similarity measures The Cosine measure  (D,Q) = =  (di x qi)  (di)2 *  (qi)2 |X  Y| |X| * |Y|  (D,Q) = |X  Y| |X  Y| The Jaccard coefficient

Exercise: 

Exercise Compute the cosine measures  (D1,D2) and  (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.

Evaluation: 

Evaluation

Relevance: 

Relevance Difficult to change: fuzzy, inconsistent Methods: exhaustive, sampling, pooling, search-based

Contingency table: 

Contingency table w x y z n2 = w + y n1 = w + x N relevant not relevant retrieved not retrieved

Precision and Recall: 

Precision and Recall Recall: Precision: w w+y w+x w

Exercise: 

Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.

Slide41: 

[From Salton’s book]

Slide42: 

Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5?

Issues: 

Issues Why not use accuracy A=(w+z)/N? Average precision Average P at given “document cutoff values” Report when P=R F measure: F=(b2+1)PR/(b2P+R) F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R

Kappa: 

Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators

Kappa example (from Manning, Schuetze, Raghavan): 

Kappa example (from Manning, Schuetze, Raghavan)

Kappa (cont’d): 

Kappa (cont’d) P(A) = 370/400 P (-) = (10+20+20+70)/800 = 0.2125 P (+) = (10+20+300+300)/800 = 0.7878 P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878 = 0.665 K = (0.925-0.665)/(1-0.665) = 0.776 Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good

Relevance collections: 

Relevance collections TREC ad hoc collections, 2-6 GB TREC Web collections, 2-100GB

Sample TREC query: 

Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208

Slide49: 

<DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC>

TREC (cont’d): 

TREC (cont’d) http://trec.nist.gov/tracks.html http://trec.nist.gov/presentations/presentations.html

Word distribution models: 

Word distribution models

Shakespeare: 

Shakespeare Romeo and Juliet: And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; … A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html

The BNC (Adam Kilgarriff): 

The BNC (Adam Kilgarriff) 1 6187267 the det 2 4239632 be v 3 3093444 of prep 4 2687863 and conj 5 2186369 a det 6 1924315 in prep 7 1620850 to infinitive-marker 8 1375636 have v 9 1090186 it pron 10 1039323 to prep 11 887877 for prep 12 884599 i pron 13 760399 that conj 14 695498 you pron 15 681255 he pron 16 680739 on prep 17 675027 with prep 18 559596 do v 19 534162 at prep 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) 1997. Pp 135--155

Stop lists: 

Stop lists 250-300 most common words in English account for 50% or more of a given text. Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). Token/type ratio: 2256/859 = 2.63

Zipf’s law: 

Zipf’s law Rank x Frequency  Constant

Zipf's law is fairly general!: 

Zipf's law is fairly general! Frequency of accesses to web pages in particular the access counts on the Wikipedia page, with s approximately equal to 0.3 page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5 Words in the English language for instance, in Shakespeare’s play Hamlet with s approximately 0.5 Sizes of settlements Income distributions amongst individuals Size of earthquakes Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law

Zipf’s law (cont’d): 

Zipf’s law (cont’d) Limitations: Low and high frequencies Lack of convergence Power law with coefficient c = -1 Y=kxc Li (1992) – typing words one letter at a time, including spaces

Heap’s law: 

Heap’s law Size of vocabulary: V(n) = Knb In English, K is between 10 and 100, β is between 0.4 and 0.6. n V(n) http://en.wikipedia.org/wiki/Heaps%27_law

Heap’s law (cont’d): 

Heap’s law (cont’d) Related to Zipf’s law: generative models Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.

Indexing: 

Indexing

Methods: 

Methods Manual: e.g., Library of Congress subject headings, MeSH Automatic

LOC subject headings: 

LOC subject headings http://www.loc.gov/catdir/cpso/lcco/lcco.html A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

Medicine: 

Medicine CLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine

Finding the most frequent terms in a document: 

Finding the most frequent terms in a document Typically stop words: the, and, in Not content-bearing Terms vs. words Luhn’s method

Luhn’s method: 

Luhn’s method WORDS FREQUENCY E

Computing term salience: 

Computing term salience Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF)

Applications of TFIDF: 

Applications of TFIDF Cosine similarity Indexing Clustering

Vector-based matching: 

Vector-based matching The cosine measure sim (D,C) = (dk . ck . idf(k)) (dk)2 . (ck)2 k S S S k k

IDF: Inverse document frequency: 

IDF: Inverse document frequency N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1 TF * IDF is used for automated indexing and for topic discrimination:

Asian and European news: 

Asian and European news 622.941 deng 306.835 china 196.725 beijing 153.608 chinese 152.113 xiaoping 124.591 jiang 108.777 communist 102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people 97.487 nato 92.151 albright 74.652 belgrade 46.657 enlargement 34.778 alliance 34.778 french 33.803 opposition 32.571 russia 14.095 government 9.389 told 9.154 would 8.459 their 6.059 which

Other topics: 

Other topics 120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center 74.652 compuserve 65.321 massey 55.989 salizzoni 29.996 bob 27.994 online 27.198 executive 15.890 interim 15.271 chief 11.647 service 11.174 second 6.781 world 6.315 president

Compression: 

Compression

Compression: 

Compression Methods Fixed length codes Huffman coding Ziv-Lempel codes

Fixed length codes: 

Fixed length codes Binary representations ASCII Representational power (2k symbols where k is the number of bits)

Variable length codes: 

Variable length codes Alphabet: A .-  N -.  0 ----- B -...  O ---  1 .---- C -.-.  P .--.  2 ..--- D -..  Q --.-  3 ...— E .  R .-. 4 ....- F ..-. S ... 5 ..... G --. T -  6 -.... H .... U ..-  7 --... I ..  V ...-  8 ---.. J .---  W .--  9 ----. K -.-  X -..- L .-..  Y -.— M --  Z --.. Demo: http://www.babbage.demon.co.uk/morse.html http://www.scphillips.com/morse/

Most frequent letters in English: 

Most frequent letters in English Most frequent letters: E T A O I N S H R D L U http://www.math.cornell.edu/~mec/modules/cryptography/subs/frequencies.html Demo: http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm Also: bigrams: TH HE IN ER AN RE ND AT ON NT http://www.math.cornell.edu/~mec/modules/cryptography/subs/digraphs.html

Useful links about cryptography: 

Useful links about cryptography http://world.std.com/~franl/crypto.html http://www.faqs.org/faqs/cryptography-faq/ http://en.wikipedia.org/wiki/Cryptography

Huffman coding: 

Huffman coding Developed by David Huffman (1952) Average of 5 bits per character (37.5% compression) Based on frequency distributions of symbols Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols

Slide80: 

0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 c b d f g i j h e a

Exercise: 

Exercise Consider the bit string: 01101101111000100110001110100111000110101101011101 Use the Huffman code from the example to decode it. Try inserting, deleting, and switching some bits at random locations and try decoding.

Ziv-Lempel coding: 

Ziv-Lempel coding Two types - one is known as LZ77 (used in GZIP) Code: set of triples <a,b,c> a: how far back in the decoded text to look for the upcoming text segment b: how many characters to copy c: new character to add to complete segment

Slide84: 

<0,0,p> p <0,0,e> pe <0,0,t> pet <2,1,r> peter <0,0,_> peter_ <6,1,i> peter_pi <8,2,r> peter_piper <6,3,c> peter_piper_pic <0,0,k> peter_piper_pick <7,1,d> peter_piper_picked <7,1,a> peter_piper_picked_a <9,2,e> peter_piper_picked_a_pe <9,2,_> peter_piper_picked_a_peck_ <0,0,o> peter_piper_picked_a_peck_o <0,0,f> peter_piper_picked_a_peck_of <17,5,l> peter_piper_picked_a_peck_of_pickl <12,1,d> peter_piper_picked_a_peck_of_pickled <16,3,p> peter_piper_picked_a_peck_of_pickled_pep <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

Links on text compression: 

Links on text compression Data compression: http://www.data-compression.info/ Calgary corpus: http://links.uwaterloo.ca/calgary.corpus.html Huffman coding: http://www.compressconsult.com/huffman/ http://en.wikipedia.org/wiki/Huffman_coding LZ http://en.wikipedia.org/wiki/LZ77

Relevance feedback and query expansion: 

Relevance feedback and query expansion

Relevance feedback: 

Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space

Relevance feedback: 

Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

Example: 

Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non-relevant R = ? S = ? Q’ = ?

Pseudo relevance feedback: 

Pseudo relevance feedback Automatic query expansion Thesaurus-based expansion (e.g., using latent semantic indexing – later…) Distributional similarity Query log mining

Examples: 

Examples Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

Examples (query logs): 

Examples (query logs) Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn

Problems with automatic query expansion: 

Problems with automatic query expansion Adding frequent words may dilute the results (example?)

Stemming: 

Stemming

Goals: 

Goals Motivation: Computer, computers, computerize, computational, computerization User, users, using, used Representing related words as one token Simplify matching Reduce storage and computation Also known as: term conflation

Methods: 

Methods Manual (tables) Achievement  achiev Achiever  achiev Etc. Affix removal (Harman 1991, Frakes 1992) if a word ends in “ies” but not “eies” or “aies” then “ies”  “y” If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es”  “e” If a word ends in “s” but not “us” or “ss” then “s”  NULL (apply only the first applicable rule)

Porter’s algorithm (Porter 1980): 

Porter’s algorithm (Porter 1980) Home page: http://www.tartarus.org/~martin/PorterStemmer Reading assignment: http://www.tartarus.org/~martin/PorterStemmer/def.txt Consonant-vowel sequences: CVCV ... C CVCV ... V VCVC ... C VCVC ... V Shorthand: [C]VCVC ... [V]

Porter’s algorithm (cont’d): 

Porter’s algorithm (cont’d) [C](VC){m}[V] {m} indicates repetition Examples: m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN Conditions: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

Slide99: 

Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

Slide100: 

Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

Slide101: 

Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

Slide102: 

Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

Porter’s algorithm (cont’d): 

Porter’s algorithm (cont’d) Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf % ./stem.pl computers computers comput

Porter’s algorithm: 

Porter’s algorithm

Stemming: 

Stemming Not always appropriate (e.g., proper names, titles) The same applies to casing (e.g., CAT vs. cat)

String matching: 

String matching

String matching methods: 

String matching methods Index-based Full or approximate E.g., theater = theatre

Index-based matching: 

Index-based matching Inverted files Position-based inverted files Block-based inverted files 1 6 9 11 1719 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55

Inverted index (trie): 

Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n

Sequential searching: 

Sequential searching No indexing structure given Given: database d and search pattern p. Example: find “words” in the earlier example Brute force method try all possible starting positions O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) Typical runtime is actually O(n) given that mismatches are easy to notice

Knuth-Morris-Pratt: 

Knuth-Morris-Pratt Average runtime similar to BF Worst case runtime is linear: O(n) Idea: reuse knowledge Need preprocessing of the pattern

Knuth-Morris-Pratt (cont’d): 

Knuth-Morris-Pratt (cont’d) Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0 1234567 ABCDABD ABCDABD

Knuth-Morris-Pratt (cont’d): 

Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

Boyer-Moore: 

Boyer-Moore Used in text editors Demos http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html http://www.blarg.com/~doyle/pages/bmi.html

Word similarity: 

Word similarity Hamming distance - when words are of the same length Levenshtein distance - number of edits (insertions, deletions, replacements) color --> colour (1) survey --> surgery (2) com puter --> computer ? Longest common subsequence (LCS) lcs (survey, surgery) = surey

Levenshtein edit distance: 

Levenshtein edit distance Examples: Theatre-> theater Ghaddafi->Qadafi Computer->counter Edit distance (inserts, deletes, substitutions) Edit transcript Done through dynamic programming

Recurrence relation: 

Recurrence relation Three dependencies D(i,0)=i D(0,j)=j D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] Simple edit distance: t(i,j) = 0 iff S1(i)=S2(j)

Example: 

Example Gusfield 1997

Example (cont’d): 

Example (cont’d) Gusfield 1997

Tracebacks: 

Tracebacks Gusfield 1997

Weighted edit distance: 

Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics Homology information BLAST Blosum http://eta.embl-heidelberg.de:8000/misc/mat/blosum50.html

Slide122: 

Web sites: http://www.merriampark.com/ld.htm http://odur.let.rug.nl/~kleiweg/lev/

Clustering: 

Clustering

Clustering: 

Clustering Exclusive/overlapping clusters Hierarchical/flat clusters The cluster hypothesis Documents in the same cluster are relevant to the same query

Representations for document clustering: 

Representations for document clustering Typically: vector-based Words: “cat”, “dog”, etc. Features: document length, author name, etc. Each document is represented as a vector in an n-dimensional space Similar documents appear nearby in the vector space (distance measures are needed)

Hierarchical clustering Dendrograms: 

Hierarchical clustering Dendrograms http://odur.let.rug.nl/~kleiweg/clustering/clustering.html E.g., language similarity:

Another example: 

Another example Kingdom = animal Phylum = Chordata Subphylum = Vertebrata Class = Osteichthyes Subclass = Actinoptergyii Order = Salmoniformes Family = Salmonidae Genus = Oncorhynchus Species = Oncorhynchus kisutch (Coho salmon)

Clustering using dendrograms: 

Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A

Methods: 

Methods Single-linkage One common pair is sufficient disadvantages: long chains Complete-linkage All pairs have to match Disadvantages: too conservative Average-linkage Centroid-based (online) Look at distances to centroids Demo: /clair4/class/ir-w05/clustering

k-means: 

k-means Needed: small number k of desired clusters hard vs. soft decisions Example: Weka

k-means: 

k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while

Example: 

Example Cluster the following vectors into two groups: A = <1,6> B = <2,2> C = <4,0> D = <3,3> E = <2,5> F = <2,1>

Complexity: 

Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids.

Weka: 

Weka A general environment for machine learning (e.g. for classification and clustering) Book by Witten and Frank www.cs.waikato.ac.nz/ml/weka

Demos: 

Demos http://vivisimo.com/ http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster http://www.cc.gatech.edu/~dellaert/html/software.html http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf http://www.ece.neu.edu/groups/rpl/projects/kmeans/

Human clustering: 

Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).

Lexical networks: 

Lexical networks

Lexical Networks: 

Lexical Networks Used to represent relationships between words Example: WordNet - created by George Miller’s team at Princeton Based on synsets (synonyms, interchangeable words) and lexical matrices

Lexical matrix: 

Lexical matrix

Synsets: 

Synsets Disambiguation {board, plank} {board, committee} Synonyms substitution weak substitution synonyms must be of the same part of speech

Slide141: 

$ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

Slide142: 

Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

Slide143: 

Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Antonymy: 

Antonymy “x” vs. “not-x” “rich” vs. “poor”? {rise, ascend} vs. {fall, descend}

Other relations: 

Other relations Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. Hyponymy: {tree} is a hyponym of {plant}. Hierarchical structure based on hyponymy (and hypernymy).

Other features of WordNet: 

Other features of WordNet Index of familiarity Polysemy

Familiarity and polysemy: 

board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1) Familiarity and polysemy

Compound nouns: 

Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees

Overview of senses: 

Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

Top-level concepts: 

Top-level concepts {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time}

WordNet and DistSim: 

WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /data2/tools/relatedwords/relate reason

System comparison: 

System comparison

Comparing two systems: 

Comparing two systems Comparing A and B One query? Average performance? Need: A to consistently outperform B [this slide: courtesy James Allan]

The sign test: 

The sign test Example 1: A > B (12 times) A = B (25 times) A < B (3 times) p < 0.035 (significant at the 5% level) Example 2: A > B (18 times) A < B (9 times) p < 0.122 (not significant at the 5% level) [this slide: courtesy James Allan]

Other tests: 

Other tests The t test: Takes into account the actual performances, not just which system is better http://nimitz.mcs.kent.edu/~blewis/stat/tTest.html The sign test: http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html

Techniques for dimensionality reduction: SVD and LSI: 

Techniques for dimensionality reduction: SVD and LSI

Techniques for dimensionality reduction: 

Techniques for dimensionality reduction Based on matrix decomposition (goal: preserve clusters, explain away variance) A quick review of matrices Vectors Matrices Matrix multiplication

SVD: Singular Value Decomposition: 

SVD: Singular Value Decomposition A=USVT This decomposition exists for all matrices, dense or sparse If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 In Matlab, use [U,S,V] = svd (A)

Term matrix normalization: 

Term matrix normalization D1 D2 D3 D4 D5 D1 D2 D3 D4 D5

Example (Berry and Browne): 

Example (Berry and Browne) T1: baby T2: child T3: guide T4: health T5: home T6: infant T7: proofing T8: safety T9: toddler D1: infant & toddler first aid D2: babies & children’s room (for your home) D3: child safety at home D4: your baby’s health and safety: from infant to toddler D5: baby proofing basics D6: your guide to easy rust proofing D7: beanie babies collector’s guide

Document term matrix: 

Document term matrix

Decomposition: 

Decomposition u = -0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356 v = -0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407

Decomposition: 

Decomposition s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Spread on the v1 axis

Rank-4 approximation: 

Rank-4 approximation s4 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Rank-4 approximation: 

Rank-4 approximation u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008

Rank-4 approximation: 

Rank-4 approximation u*s4 -1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0

Rank-4 approximation: 

Rank-4 approximation s4*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Rank-2 approximation: 

Rank-2 approximation s2 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Rank-2 approximation: 

Rank-2 approximation u*s2*v' 0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048

Rank-2 approximation: 

Rank-2 approximation u*s2 -1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0

Rank-2 approximation: 

Rank-2 approximation s2*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Documents to concepts and terms to concepts: 

Documents to concepts and terms to concepts A(:,1)'*u*s -0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095 >> A(:,1)'*u*s4 -0.4238 0.6784 -0.8541 0.1446 0 0 0 >> A(:,1)'*u*s2 -0.4238 0.6784 0 0 0 0 0 >> A(:,2)'*u*s2 -1.1233 0.3650 0 0 0 0 0 >> A(:,3)'*u*s2 -0.6762 0.6807 0 0 0 0 0

Documents to concepts and terms to concepts: 

Documents to concepts and terms to concepts >> A(:,4)'*u*s2 -0.9972 0.6478 0 0 0 0 0 >> A(:,5)'*u*s2 -1.1809 -0.4914 0 0 0 0 0 >> A(:,6)'*u*s2 -0.7918 -0.8121 0 0 0 0 0 >> A(:,7)'*u*s2 -1.1809 -0.4914 0 0 0 0 0

Cont’d: 

Cont’d >> (s2*v'*A(1,:)')' -1.7523 -0.1530 0 0 0 0 0 0 0 >> (s2*v'*A(2,:)')' -0.6585 0.4768 0 0 0 0 0 0 0 >> (s2*v'*A(3,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(4,:)')' -0.2831 0.2291 0 0 0 0 0 0 0 >> (s2*v'*A(5,:)')' -0.6585 0.4768 0 0 0 0 0 0 0

Cont’d: 

Cont’d >> (s2*v'*A(6,:)')' -0.4730 0.6078 0 0 0 0 0 0 0 >> (s2*v'*A(7,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(8,:)')' -0.5306 0.5395 0 0 0 0 0 0 0 >> (s2*v'*A(9,:)')‘ -0.4730 0.6078 0 0 0 0 0 0 0

Properties: 

Properties A*A' 1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 A'*A 1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082 A is a document to term matrix. What is A*A’, what is A’*A?

Latent semantic indexing (LSI): 

Latent semantic indexing (LSI) Dimensionality reduction = identification of hidden (latent) concepts Query matching in latent space

Useful pointers: 

Useful pointers http://lsa.colorado.edu http://lsi.research.telcordia.com/ http://www.cs.utk.edu/~lsi/ http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm http://citeseer.nj.nec.com/deerwester90indexing.html http://www.pcug.org.au/~jdowling/

Models of the Web: 

Models of the Web

Size: 

Size The Web is the largest repository of data and it grows exponentially. 320 Million Web pages [Lawrence & Giles 1998] 800 Million Web pages, 15 TB [Lawrence & Giles 1999] 8 Billion Web pages indexed [Google 2005] Amount of data roughly 200 TB [Lyman et al. 2003]

Bow-tie model of the Web: 

Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page

Power laws: 

Power laws Web site size (Huberman and Adamic 1999) Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.

Small-world networks: 

Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example… Milgram experiment (1967) Kansas/Omaha --> Boston (42/160 letters) diameter = 6 Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89. Six degrees of separation

Clustering coefficient: 

Clustering coefficient Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors. Examples:

Models of the Web: 

Models of the Web Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology

Self-triggerability across hyperlinks: 

Self-triggerability across hyperlinks Document closures for information retrieval Self-triggerability [Mosteller&Wallace 84]  Poisson distribution Two-Poisson [Bookstein&Swanson 74] Negative Binomial, K-mixture [Church&Gale 95] Triggerability across hyperlinks?

Evolving Word-based Web: 

Evolving Word-based Web Observations: Links are made based on topics Topics are expressed with words Words are distributed very unevenly (Zipf, Benford, self-triggerability laws) Model Pick n Generate n lengths according to a power-law distribution Generate n documents using a trigram model Model (cont’d) Pick words in decreasing order of r. Generate hyperlinks with random directionality Outcome Generates power-law degree distributions Generates topical communities Natural variation of PageRank: LexRank

Social network analysis for IR: 

Social network analysis for IR

Social networks: 

Social networks Induced by a relation Symmetric or not Examples: Friendship networks Board membership Citations Power grid of the US WWW

Slide190: 

Krebs 2004

Prestige and centrality: 

Prestige and centrality Degree centrality: how many neighbors each node has. Closeness centrality: how close a node is to all of the other nodes Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. Prestige = same as centrality but for directed graphs.

Graph-based representations: 

Graph-based representations Square connectivity (incidence) matrix Graph G (V,E)

Markov chains: 

Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. Path = sequence (x0, x1, …, xn). Xi = xi-1*E The probability of a path can be computed as a product of probabilities for each step i. Random walk = find Xj given x0, E, and j.

Stationary solutions: 

Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: E is stochastic E is irreducible E is aperiodic To make these conditions true: All rows of E add up to 1 (and no value is negative) Make sure that E is strongly connected Make sure that E is not bipartite Example: PageRank [Brin and Page 1998]: use “teleportation”

Slide195: 

Example This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph.

Eigenvectors: 

Eigenvectors An eigenvector is an implicit “direction” for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 In other words, ETp = p.

Computing the stationary distribution: 

Computing the stationary distribution function PowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1) L = ||p(i)-p(i-1)||1; i = i + 1; until L <  return p(i) end Solution for the stationary distribution

Slide198: 

Example

How Google works: 

How Google works Crawling Anchor text Fast query processing Pagerank

More about PageRank: 

More about PageRank Named after Larry Page, founder of Google (and UM alum) Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.

HITS: 

HITS Query-dependent model (Kleinberg 97) Hubs and authorities (e.g., cars, Honda) Algorithm obtain root set using input query expanded the root set by radius one run iterations on the hub and authority scores together report top-ranking authorities and hubs

The link-content hypothesis: 

The link-content hypothesis Topical locality: page is similar () to the page that points to it (). Davison (TF*IDF, 100K pages) 0.31 same domain 0.23 linked pages 0.19 sibling 0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 1=1.8, 2=0.6,

Measuring the Web: 

Measuring the Web

Bharat and Broder 1998: 

Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them

Example (from Bharat&Broder): 

Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).

Crawling the web: 

Crawling the web

Basic principles: 

Basic principles The HTTP/HTML protocols Following hyperlinks Some problems: Link extraction Link normalization Robot exclusion Loops Spider traps Server overload

Example: 

Example U-M’s root robots.txt file: http://www.umich.edu/robots.txt User-agent: * Disallow: /~websvcs/projects/ Disallow: /%7Ewebsvcs/projects/ Disallow: /~homepage/ Disallow: /%7Ehomepage/ Disallow: /~smartgl/ Disallow: /%7Esmartgl/ Disallow: /~gateway/ Disallow: /%7Egateway/

Example crawler: 

Example crawler E.g., poacher http://search.cpan.org/~neilb/Robot-0.011/examples/poacher /data0/projects/perltree-index

Slide210: 

&ParseCommandLine(); &Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. #======================================================================= sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, 'EMAIL' => $EMAIL, 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } #======================================================================= # follow_url_test() - tell the robot module whether is should follow link #======================================================================= sub follow_url_test {} #======================================================================= # process_get_error() - hook function invoked whenever a GET fails #======================================================================= sub process_get_error {} #======================================================================= # process_contents() - process the contents of a URL we've retrieved #======================================================================= sub process_contents { run_command($COMMAND, $filename) if defined $COMMAND; }

Focused crawling: 

Focused crawling Topical locality Pages that are linked are similar in content (and vice-versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling Keeping a priority queue of the most relevant pages

Question answering: 

Question answering

People ask questions: 

People ask questions Excite corpus of 2,477,283 queries (one day’s worth) 8.4% of them are questions 43.9% factual (what is the country code for Belgium) 56.1% procedural (how do I set up TCP/IP) or other In other words, 100 K questions per day

People ask questions: 

People ask questions In what year did baseball become an offical sport? Who is the largest man in the world? Where can i get information on Raphael? where can i find information on puritan religion? Where can I find how much my house is worth? how do i get out of debt? Where can I found out how to pass a drug test? When is the Super Bowl? who is California's District State Senator? where can I buy extra nibs for a foutain pen? how do i set up tcp/ip ? what time is it in west samoa? Where can I buy a little kitty cat? what are the symptoms of attention deficit disorder? Where can I get some information on Michael Jordan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live? Which Frenchman declined the Nobel Prize for Literature for ideological reasons? What is the largest city in Northern Afghanistan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live?

Question answering: 

Question answering What is the largest city in Northern Afghanistan?

Possible approaches: 

Possible approaches Map? Knowledge base Find x: city (x)  located (x,”Northern Afghanistan”)   ¬exists (y): city (y)  located (y,”Northern Afghanistan”)   greaterthan (population (y), population (x)) Database? World factbook? Search engine?

The TREC Q&A evaluation: 

The TREC Q&A evaluation Run by NIST [Voorhees and Tice 2000] 2GB of input 200 questions Essentially fact extraction Who was Lincoln’s secretary of state? What does the Peugeot company manufacture? Questions are based on text Answers are assumed to be present No inference needed

Question answering: 

Q: When did Nelson Mandela become president of South Africa? A: 10 May 1994 Q: How tall is the Matterhorn? A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn at Disneyland? A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years Q: If Iraq attacks a neighboring country, what should the US do? A: ?? Question answering

NSIR: 

NSIR Current project at U-M http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi Reading: [Radev et al., 2005a] Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005 http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html

Slide223: 

... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined. Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603 ... died in Kano, northern Nigeria's largest city, during two days of anti-American riots led by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/ ... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significant blow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k ... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889), is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html ... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlying regions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/ ... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: Education Offers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/ ... within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html Google

Slide224: 

Document retrieval Query modulation Sentence retrieval Answer extraction Answer ranking What is the largest city in Northern Afghanistan? (largest OR biggest) city “Northern Afghanistan” www.infoplease.com/cgi-bin/id/A0855603 www.washingtonpost.com/wp-dyn/print/world/ Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regions within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan Gudermes Mazer-e-Sharif Mazer-e-Sharif Gudermes

Research problems: 

Research problems Source identification: semi-structured vs. text sources Query modulation: best paraphrase of a NL question given the syntax of a search engine? Compare two approaches: noisy channel model and rule-based Sentence ranking n-gram matching, Okapi, co-reference? Answer extraction question type identification phrase chunking no general-purpose named entity tagger available Answer ranking what are the best predictors of a phrase being the answer to a given question: question type, proximity to query words, frequency Evaluation (MRDR) accuracy, reliability, timeliness

Document retrieval: 

Document retrieval Use existing search engines: Google, AlltheWeb, NorthernLight No modifications to question CF: work on QASM (ACM CIKM 2001)

Sentence ranking: 

Sentence ranking Weighted N-gram matching: Weights are determined empirically, e.g., 0.6, 0.3, and 0.1

Probabilistic phrase reranking: 

Probabilistic phrase reranking Answer extraction: probabilistic phrase reranking. What is: p(ph is answer to q | q, ph) Evaluation: TRDR Example: (2,8,10) gives .725 Document, sentence, or phrase level Criterion: presence of answer(s) High correlation with manual assessment

Phrase types: 

Phrase types PERSON PLACE DATE NUMBER DEFINITION ORGANIZATION DESCRIPTION ABBREVIATION KNOWNFOR RATE LENGTH MONEY REASON DURATION PURPOSE NOMINAL OTHER

Question Type Identification: 

Question Type Identification Wh-type not sufficient: Who: PERSON 77, DESCRIPTION 19, ORG 6 What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16, NUMBER 14, etc. How: NUMBER 33, LENGTH 6, RATE 2, etc. Ripper: 13 features: Question-Words, Wh-Word, Word-Beside-Wh-Word, Is-Noun-Length, Is-Noun-Person, etc. Top 2 question types Heuristic algorithm: About 100 regular expressions based on words and parts of speech

Ripper performance: 

Ripper performance

Regex performance: 

Regex performance

Phrase ranking: 

Phrase ranking Phrases are identified by a shallow parser (ltchunk from Edinburgh) Four features: Proximity POS (part-of-speech) signature (qtype) Query overlap Frequency

Proximity: 

Proximity Phrasal answers tend to appear near words from the query Average distance = 7 words, range = 1 to 50 words Use linear rescaling of scores

Part of speech signature: 

Part of speech signature Example: “Hugo/NNP Young/NNP” P (PERSON | “NNP NNP”) = .458 Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN” P (PERSON | “DT NNP NNP NNP NN”) = 0 Penn Treebank tagset (DT = determiner, JJ = adjective)

Query overlap and frequency: 

Query overlap and frequency Query overlap: What is the capital of Zimbabwe? Possible choices: Mugabe, Zimbabwe, Luanda, Harare Frequency: Not necessarily accurate but rather useful

Reranking: 

Reranking Rank Probability and phrase 1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._. 2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP 3 0.598398 International_NNP Space_NNP Station_NNP 4 0.598125 to_TO become_VB 5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP 6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP 7 0.587140 will_MD form_VB 8 0.585410 The_DT purpose_NN 9 0.576797 prime_JJ contracts_NNS 10 0.568013 First_NNP American_NNP 11 0.567361 this_DT bulletin_NN board_NN 12 0.565757 Space_NNP :_: 13 0.562627 'Spirit_NN '_'' of_IN ... 41 0.516368 Alan_NNP Shepard_NNP Proximity = .5164

Reranking : 

Reranking Rank Probability and phrase 1 0.465012 Space_NNP Administration_NNP ._. 2 0.446466 SPACE_NNP CALENDAR_NNP _. 3 0.413976 First_NNP American_NNP 4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP 5 0.396250 her_PRP$ third_JJ space_NN mission_NN 6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP 7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP 8 0.390163 the_DT Red_NNP Planet_NNP ._. 9 0.379797 First_NNP American_NNP 10 0.376336 Alan_NNP Shepard_NNP 11 0.375669 February_NNP 12 0.374813 Space_NNP 13 0.373999 International_NNP Space_NNP Station_NNP Qtype = .7288 Proximity * qtype = .3763

Reranking: 

Reranking Rank Probability and phrase 1 0.478857 Neptune_NNP Beach_NNP ._. 2 0.449232 February_NNP 3 0.447075 Go_NNP 4 0.437895 Space_NNP 5 0.431835 Go_NNP 6 0.424678 Alan_NNP Shepard_NNP 7 0.423855 First_NNP American_NNP 8 0.421133 Space_NNP May_NNP 9 0.411065 First_NNP American_NNP woman_NN 10 0.401994 Life_NNP Sciences_NNP 11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN 12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP 13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP All four features

Document level performance: 

Document level performance TREC 8 corpus (200 questions)

Sentence level performance: 

Sentence level performance

Phrase level performance: 

Phrase level performance Experiments performed Oct-Nov. 2001

Text classification: 

Text classification

Introduction: 

Introduction Text classification: assigning documents to predefined categories Hierarchical vs. flat Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression) Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) Discriminative: model p(y|x) directly.

Generative models: knn: 

Generative models: knn K-nearest neighbors Very easy to program Issues: choosing k, b?

Feature selection: The 2 test: 

Feature selection: The 2 test For a term t: Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0) P(C=0) = (k00+k01)/n P(C=1) = 1-P(C=0) = (k10+k11)/n P(It=0) = (k00+K10)/n P(It=1) = 1-P(It=0) = (k01+k11)/n

Feature selection: The 2 test: 

Feature selection: The 2 test High values of 2 indicate lower belief in independence. In practice, compute 2 for all words and pick the top k among them.

Feature selection: mutual information: 

Feature selection: mutual information No document length scaling is needed Documents are assumed to be generated according to the multinomial model

Naïve Bayesian classifiers: 

Naïve Bayesian classifiers Naïve Bayesian classifier Assuming statistical independence

Spam recognition: 

Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

Well-known datasets: 

Well-known datasets 20 newsgroups http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/ Reuters-21578 Cats: grain, acquisitions, corn, crude, wheat, trade… WebKB http://www-2.cs.cmu.edu/~webkb/ course, student, faculty, staff, project, dept, other NB performance (2000) P=26,43,18,6,13,2,94 R=83,75,77,9,73,100,35

Support vector machines: 

Support vector machines Introduced by Vapnik in the early 90s.

Semi-supervised learning: 

Semi-supervised learning EM Co-training Graph-based

Additional topics: 

Additional topics Soft margins VC dimension Kernel methods

Slide262: 

SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. NB also good in many circumstances

Readings: 

Readings Books: 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-Wesley/ACM Press, 1999. 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1 Papers: Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999 Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 Bush "As we may thing" The Atlantic Monthly 1945 Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998 Davison "Topical locality on the Web" SIGIR 2000 Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990

Readings: 

Readings Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004 Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999 Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 Haveliwala "Topic-sensitive pagerank" WWW 2002 Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000 Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999 Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998 Menczer "Links tell us about lexical and semantic Web content" arXiv 2001 Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998 Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005 Singhal "Modern Information Retrieval: an Overview" IEEE 2001

More readings: 

More readings Gerard Salton, Automatic Text Processing, Addison-Wesley (1989) Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983) C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) ACM SIGIR Proceedings, SIGIR Forum ACM conferences in Digital Libraries

Slide266: 

Thank you! Благодаря!