Information Retrieval: Information Retrieval
Dragomir R. Radev
University of Michigan
radev@umich.edu September 19, 2005
About the instructor: About the instructor Dragomir R. Radev
Associate Professor, University of Michigan
School of Information
Department of Electrical Engineering and Computer Science
Department of Linguistics
Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan
Treasurer, North American Chapter of the ACL
Ph.D., 1998, Computer Science, Columbia University
Email: radev@umich.edu
Home page: http://tangra.si.umich.edu/~radev
Introduction: Introduction
IR systems: IR systems Google
Vivísimo
AskJeeves
NSIR
Lemur
MG
Nutch
Examples of IR systems: Examples of IR systems Conventional (library catalog). Search by keyword, title, author, etc.
Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ).
Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language
Need for IR: Need for IR Advent of WWW - more than 8 Billion documents indexed on Google
How much information? 200TB according to Lyman and Varian 2003.
http://www.sims.berkeley.edu/research/projects/how-much-info/
Search, routing, filtering
User’s information need
Some definitions of Information Retrieval (IR): Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
Sample queries (from Excite): Sample queries (from Excite) In what year did baseball become an offical sport?
play station codes . com
birth control and depression
government
"WorkAbility I"+conference
kitchen appliances
where can I find a chines rosewood
tiger electronics
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?
emeril Lagasse
Hubble
M.S Subalaksmi
running
Mappings and abstractions: Mappings and abstractions Reality Data Information need Query From Korfhage’s book
Typical IR system: Typical IR system (Crawling)
Indexing
Retrieval
User interface
Key Terms Used in IR: Key Terms Used in IR QUERY: a representation of what the user is looking for - can be a list of words or a phrase.
DOCUMENT: an information entity that the user wants to retrieve
COLLECTION: a set of documents
INDEX: a representation of information that makes querying easier
TERM: word or concept that appears in a document or a query
Documents: Documents
Documents: Documents Not just printed paper
collections vs. documents
data structures: representations
Bag of words method
document surrogates: keywords, summaries
encoding: ASCII, Unicode, etc.
Document preprocessing: Document preprocessing Formatting
Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc)
Casing (cat vs. CAT)
Stemming (computer, computation)
Soundex
Document representations: Document representations Term-document matrix (m x n)
term-term matrix (m x m x n)
document-document matrix (n x n)
Example: 3,000,000 documents (n) with 50,000 terms (m)
sparse matrices
Boolean vs. integer matrices
Document representations: Document representations Term-document matrix
Evaluating queries (e.g., (AB)C)
Storage issues
Inverted files
Storage issues
Evaluating queries
Advantages and disadvantages
IR models: IR models
Major IR models: Major IR models Boolean
Vector
Probabilistic
Language modeling
Fuzzy retrieval
Latent semantic indexing
Major IR tasks: Major IR tasks Ad-hoc
Filtering and routing
Question answering
Spoken document retrieval
Multimedia retrieval
Venn diagrams: Venn diagrams x w y z D1 D2
Boolean model: Boolean model A B
Boolean queries: restaurants AND (Mideastern OR vegetarian) AND inexpensive Boolean queries What types of documents are returned?
Stemming
thesaurus expansion
inclusive vs. exclusive OR
confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal)
Boolean queries: Boolean queries Weighting (Beethoven AND sonatas)
precedence
coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses Use of negation: potential problems
Conjunctive and Disjunctive normal forms
Full CNF and DNF
Transformations: Transformations De Morgan’s Laws: NOT (A AND B) = (NOT A) OR (NOT B) NOT (A OR B) = (NOT A) AND (NOT B) CNF or DNF?
Reference librarians prefer CNF - why?
Boolean model: Boolean model Partition
Partial relevance?
Operators: AND, NOT, OR, parentheses
Exercise: Exercise D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”
Q1 = “information retrieval”
Q2 = “information ¬computer”
Exercise: Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
Stop lists: Stop lists 250-300 most common words in English account for 50% or more of a given text.
Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
Token/type ratio: 2256/859 = 2.63
Vector models: Vector models Term 1 Term 2 Term 3 Doc 1 Doc 2 Doc 3
Vector queries: Vector queries Each document is represented as a vector
non-efficient representations (bit vectors)
dimensional compatibility
The matching process: The matching process Document space
Matching is done between a document and a query (or between two documents)
distance vs. similarity
Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.
Miscellaneous similarity measures: Miscellaneous similarity measures The Cosine measure (D,Q) = = (di x qi) (di)2 * (qi)2 |X Y| |X| * |Y| (D,Q) = |X Y| |X Y| The Jaccard coefficient
Exercise: Exercise Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = , D2 = and D3 =
Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.
Evaluation: Evaluation
Relevance: Relevance Difficult to change: fuzzy, inconsistent
Methods: exhaustive, sampling, pooling, search-based
Contingency table: Contingency table w x y z n2 = w + y n1 = w + x N relevant not relevant retrieved not retrieved
Precision and Recall: Precision and Recall Recall: Precision: w w+y w+x w
Exercise: Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.
Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.
Later, try different queries.
Slide41: [From Salton’s book]
Slide42: Interpolated average precision (e.g., 11pt)
Interpolation – what is precision at recall=0.5?
Issues: Issues Why not use accuracy A=(w+z)/N?
Average precision
Average P at given “document cutoff values”
Report when P=R
F measure: F=(b2+1)PR/(b2P+R)
F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R
Kappa: Kappa
N: number of items (index i)
n: number of categories (index j)
k: number of annotators
Kappa example (from Manning, Schuetze, Raghavan): Kappa example (from Manning, Schuetze, Raghavan)
Kappa (cont’d): Kappa (cont’d) P(A) = 370/400
P (-) = (10+20+20+70)/800 = 0.2125
P (+) = (10+20+300+300)/800 = 0.7878
P (E) = 0.2125 * 0.2125 + 0.7878 * 0.7878 = 0.665
K = (0.925-0.665)/(1-0.665) = 0.776
Kappa higher than 0.67 is tentatively acceptable; higher than 0.8 is good
Relevance collections: Relevance collections TREC ad hoc collections, 2-6 GB
TREC Web collections, 2-100GB
Sample TREC query: Sample TREC query
Number: 305
Most Dangerous Vehicles
Description:
Which are the most crashworthy, and least crashworthy,
passenger vehicles?
Narrative:
A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.
LA031689-0177
FT922-1008
LA090190-0126
LA101190-0218
LA082690-0158
LA112590-0109
FT944-136
LA020590-0119
FT944-5300
LA052190-0048
LA051689-0139
FT944-9371
LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208
Slide49: LA031689-0177
31701
March 16, 1989, Thursday, Home Edition
Business; Part 4; Page 1; Column 5; Financial Desk
586 words
AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS
By LINDA WILLIAMS, Times Staff Writer
The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over
accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents.
The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the
Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after
Consumer Reports magazine charged that the vehicle had basic design flaws.
Several Fatalities
However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,
particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation
conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said.
According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle
roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving
the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the
accident report, NHTSA declined to investigate the Samurai.
...
Photo, The Ford Bronco II "appears to have a higher
number of single-vehicle, first event roll-overs," a federal official
said.
TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;
RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY
TREC (cont’d): TREC (cont’d) http://trec.nist.gov/tracks.html
http://trec.nist.gov/presentations/presentations.html
Word distribution models: Word distribution models
Shakespeare: Shakespeare Romeo and Juliet:
And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;
…
A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html
The BNC (Adam Kilgarriff): The BNC (Adam Kilgarriff) 1 6187267 the det
2 4239632 be v
3 3093444 of prep
4 2687863 and conj
5 2186369 a det
6 1924315 in prep
7 1620850 to infinitive-marker
8 1375636 have v
9 1090186 it pron
10 1039323 to prep
11 887877 for prep
12 884599 i pron
13 760399 that conj
14 695498 you pron
15 681255 he pron
16 680739 on prep
17 675027 with prep
18 559596 do v
19 534162 at prep
20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) 1997. Pp 135--155
Stop lists: Stop lists 250-300 most common words in English account for 50% or more of a given text.
Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
Token/type ratio: 2256/859 = 2.63
Zipf’s law: Zipf’s law Rank x Frequency Constant
Zipf's law is fairly general!: Zipf's law is fairly general! Frequency of accesses to web pages
in particular the access counts on the Wikipedia page,
with s approximately equal to 0.3
page access counts on Polish Wikipedia (data for late July 2003)
approximately obey Zipf's law with s about 0.5
Words in the English language
for instance, in Shakespeare’s play Hamlet with s approximately 0.5
Sizes of settlements
Income distributions amongst individuals
Size of earthquakes
Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law
Zipf’s law (cont’d): Zipf’s law (cont’d) Limitations:
Low and high frequencies
Lack of convergence
Power law with coefficient c = -1
Y=kxc
Li (1992) – typing words one letter at a time, including spaces
Heap’s law: Heap’s law Size of vocabulary: V(n) = Knb
In English, K is between 10 and 100, β is between 0.4 and 0.6. n V(n) http://en.wikipedia.org/wiki/Heaps%27_law
Heap’s law (cont’d): Heap’s law (cont’d) Related to Zipf’s law: generative models
Zipf’s and Heap’s law coefficients change with language Alexander Gelbukh, Grigori Sidorov. Zipf and Heaps Laws’ Coefficients Depend on Language. Proc. CICLing-2001, Conference on Intelligent Text Processing and Computational Linguistics, February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag, pp. 332–335.
Indexing: Indexing
Methods: Methods Manual: e.g., Library of Congress subject headings, MeSH
Automatic
LOC subject headings: LOC subject headings http://www.loc.gov/catdir/cpso/lcco/lcco.html A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
Medicine: Medicine CLASS R - MEDICINE
Subclass R
R5-920 Medicine (General)
R5-130.5 General works
R131-687 History of medicine. Medical expeditions
R690-697 Medicine as a profession. Physicians
R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc.
R711-713.97 Directories
R722-722.32 Missionary medicine. Medical missionaries
R723-726 Medical philosophy. Medical ethics
R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying
R727-727.5 Medical personnel and the public. Physician and the public
R728-733 Practice of medicine. Medical practice economics
R735-854 Medical education. Medical schools. Research
R855-855.5 Medical technology
R856-857 Biomedical engineering. Electronics. Instrumentation
R858-859.7 Computer applications to medicine. Medical informatics
R864 Medical records
R895-920 Medical physics. Medical radiology. Nuclear medicine
Finding the most frequent terms in a document: Finding the most frequent terms in a document Typically stop words: the, and, in
Not content-bearing
Terms vs. words
Luhn’s method
Luhn’s method: Luhn’s method WORDS FREQUENCY E
Computing term salience: Computing term salience Term frequency (IDF)
Document frequency (DF)
Inverse document frequency (IDF)
Applications of TFIDF: Applications of TFIDF Cosine similarity
Indexing
Clustering
Vector-based matching: Vector-based matching The cosine measure sim (D,C) = (dk . ck . idf(k)) (dk)2 . (ck)2 k S S S k k
IDF: Inverse document frequency: IDF: Inverse document frequency N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1 TF * IDF is used for automated indexing and for topic discrimination:
Asian and European news: Asian and European news 622.941 deng
306.835 china
196.725 beijing
153.608 chinese
152.113 xiaoping
124.591 jiang
108.777 communist
102.894 body
85.173 party
71.898 died
68.820 leader
43.402 state
38.166 people 97.487 nato
92.151 albright
74.652 belgrade
46.657 enlargement
34.778 alliance
34.778 french
33.803 opposition
32.571 russia
14.095 government
9.389 told
9.154 would
8.459 their
6.059 which
Other topics: Other topics 120.385 shuttle
99.487 space
90.128 telescope
70.224 hubble
59.992 rocket
50.160 astronauts
49.722 discovery
47.782 canaveral
47.782 cape
40.889 mission
35.778 florida
27.063 center 74.652 compuserve
65.321 massey
55.989 salizzoni
29.996 bob
27.994 online
27.198 executive
15.890 interim
15.271 chief
11.647 service
11.174 second
6.781 world
6.315 president
Compression: Compression
Compression: Compression Methods
Fixed length codes
Huffman coding
Ziv-Lempel codes
Fixed length codes: Fixed length codes Binary representations
ASCII
Representational power (2k symbols where k is the number of bits)
Variable length codes: Variable length codes Alphabet:
A .- N -. 0 -----
B -... O --- 1 .----
C -.-. P .--. 2 ..---
D -.. Q --.- 3 ...—
E . R .-. 4 ....-
F ..-. S ... 5 .....
G --. T - 6 -....
H .... U ..- 7 --...
I .. V ...- 8 ---..
J .--- W .-- 9 ----.
K -.- X -..-
L .-.. Y -.—
M -- Z --..
Demo:
http://www.babbage.demon.co.uk/morse.html
http://www.scphillips.com/morse/
Most frequent letters in English: Most frequent letters in English Most frequent letters:
E T A O I N S H R D L U
http://www.math.cornell.edu/~mec/modules/cryptography/subs/frequencies.html
Demo:
http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
Also: bigrams:
TH HE IN ER AN RE ND AT ON NT
http://www.math.cornell.edu/~mec/modules/cryptography/subs/digraphs.html
Useful links about cryptography: Useful links about cryptography http://world.std.com/~franl/crypto.html
http://www.faqs.org/faqs/cryptography-faq/
http://en.wikipedia.org/wiki/Cryptography
Huffman coding: Huffman coding Developed by David Huffman (1952)
Average of 5 bits per character (37.5% compression)
Based on frequency distributions of symbols
Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
Slide80: 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 c b d f g i j h e a
Exercise: Exercise Consider the bit string: 01101101111000100110001110100111000110101101011101
Use the Huffman code from the example to decode it.
Try inserting, deleting, and switching some bits at random locations and try decoding.
Ziv-Lempel coding: Ziv-Lempel coding Two types - one is known as LZ77 (used in GZIP)
Code: set of triples
a: how far back in the decoded text to look for the upcoming text segment
b: how many characters to copy
c: new character to add to complete segment
Slide84: p
pe
pet
peter
peter_
peter_pi
peter_piper
peter_piper_pic
peter_piper_pick
peter_piper_picked
peter_piper_picked_a
peter_piper_picked_a_pe
peter_piper_picked_a_peck_
peter_piper_picked_a_peck_o
peter_piper_picked_a_peck_of
peter_piper_picked_a_peck_of_pickl
peter_piper_picked_a_peck_of_pickled
peter_piper_picked_a_peck_of_pickled_pep
peter_piper_picked_a_peck_of_pickled_pepper
peter_piper_picked_a_peck_of_pickled_peppers
Links on text compression: Links on text compression Data compression:
http://www.data-compression.info/
Calgary corpus:
http://links.uwaterloo.ca/calgary.corpus.html
Huffman coding:
http://www.compressconsult.com/huffman/
http://en.wikipedia.org/wiki/Huffman_coding
LZ
http://en.wikipedia.org/wiki/LZ77
Relevance feedback andquery expansion: Relevance feedback and query expansion
Relevance feedback: Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need.
Idea: modify the original query so that it gets closer to the right documents in the vector space
Relevance feedback: Relevance feedback Automatic
Manual
Method: identifying feedback terms
Q’ = a1Q + a2R - a3N
Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
Example: Example Q = “safety minivans”
D1 = “car safety minivans tests injury statistics” - relevant
D2 = “liability tests safety” - relevant
D3 = “car passengers injury reviews” - non-relevant
R = ?
S = ?
Q’ = ?
Pseudo relevance feedback: Pseudo relevance feedback Automatic query expansion
Thesaurus-based expansion (e.g., using latent semantic indexing – later…)
Distributional similarity
Query log mining
Examples: Examples Book: publication, product, fact, dramatic composition, record
Computer: machine, expert, calculator, reckoner, figurer
Fruit: reproductive structure, consequence, product, bear
Politician: leader, schemer
Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels
Computer: adobe, computing, computers, developed, hardware
Fruit: leafy, canned, fruits, flowers, grapes
Politician: activist, campaigner, politicians, intellectuals, journalist
Newspaper: daily, globe, newspapers, newsday, paper
Examples (query logs): Examples (query logs) Book: booksellers, bookmark, blue
Computer: sales, notebook, stores, shop
Fruit: recipes cake salad basket company
Games: online play gameboy free video
Politician: careers federal office history
Newspaper: online website college information
Schools: elementary high ranked yearbook
California: berkeley san francisco southern
French: embassy dictionary learn
Problems with automatic query expansion: Problems with automatic query expansion Adding frequent words may dilute the results (example?)
Stemming: Stemming
Goals: Goals Motivation:
Computer, computers, computerize, computational, computerization
User, users, using, used
Representing related words as one token
Simplify matching
Reduce storage and computation
Also known as: term conflation
Methods: Methods Manual (tables)
Achievement achiev
Achiever achiev
Etc.
Affix removal (Harman 1991, Frakes 1992)
if a word ends in “ies” but not “eies” or “aies” then “ies” “y”
If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es” “e”
If a word ends in “s” but not “us” or “ss” then “s” NULL
(apply only the first applicable rule)
Porter’s algorithm (Porter 1980): Porter’s algorithm (Porter 1980) Home page:
http://www.tartarus.org/~martin/PorterStemmer
Reading assignment:
http://www.tartarus.org/~martin/PorterStemmer/def.txt
Consonant-vowel sequences:
CVCV ... C
CVCV ... V
VCVC ... C
VCVC ... V
Shorthand: [C]VCVC ... [V]
Porter’s algorithm (cont’d): Porter’s algorithm (cont’d) [C](VC){m}[V]
{m} indicates repetition
Examples:
m=0 TR, EE, TREE, Y, BY
m=1 TROUBLE, OATS, TREES, IVY
m=2 TROUBLES, PRIVATE, OATEN
Conditions:
*S - the stem ends with S (and similarly for the other letters).
*v* - the stem contains a vowel.
*d - the stem ends with a double consonant (e.g. -TT, -SS).
*o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).
Slide99: Step 1a
SSES -> SS caresses -> caress
IES -> I ponies -> poni ties -> ti
SS -> SS caress -> caress
S -> cats -> cat
Step 1b
(m>0) EED -> EE feed -> feed agreed -> agree
(*v*) ED -> plastered -> plaster bled -> bled
(*v*) ING -> motoring -> motor sing -> sing
Step 1b1
If the second or third of the rules in Step 1b is successful,
the following is done:
AT -> ATE conflat(ed) -> conflate
BL -> BLE troubl(ed) -> trouble
IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter
hopp(ing) -> hop
tann(ed) -> tan
fall(ing) -> fall
hiss(ing) -> hiss
fizz(ed) -> fizz
(m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file
Slide100: Step 1c
(*v*) Y -> I happy -> happi sky -> sky
Step 2
(m>0) ATIONAL -> ATE relational -> relate
(m>0) TIONAL -> TION conditional -> condition rational -> rational
(m>0) ENCI -> ENCE valenci -> valence
(m>0) ANCI -> ANCE hesitanci -> hesitance
(m>0) IZER -> IZE digitizer -> digitize
(m>0) ABLI -> ABLE conformabli -> conformable
(m>0) ALLI -> AL radicalli -> radical
(m>0) ENTLI -> ENT differentli -> different
(m>0) ELI -> E vileli - > vile
(m>0) OUSLI -> OUS analogousli -> analogous
(m>0) IZATION -> IZE vietnamization -> vietnamize
(m>0) ATION -> ATE predication -> predicate
(m>0) ATOR -> ATE operator -> operate
(m>0) ALISM -> AL feudalism -> feudal
(m>0) IVENESS -> IVE decisiveness -> decisive
(m>0) FULNESS -> FUL hopefulness -> hopeful
(m>0) OUSNESS -> OUS callousness -> callous
(m>0) ALITI -> AL formaliti -> formal
(m>0) IVITI -> IVE sensitiviti -> sensitive
(m>0) BILITI -> BLE sensibiliti -> sensible
Slide101: Step 3
(m>0) ICATE -> IC triplicate -> triplic
(m>0) ATIVE -> formative -> form
(m>0) ALIZE -> AL formalize -> formal
(m>0) ICITI -> IC electriciti -> electric
(m>0) ICAL -> IC electrical -> electric
(m>0) FUL -> hopeful -> hope
(m>0) NESS -> goodness -> good
Step 4
(m>1) AL -> revival -> reviv
(m>1) ANCE -> allowance -> allow
(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
(m>1) IC -> gyroscopic -> gyroscop
(m>1) ABLE -> adjustable -> adjust
(m>1) IBLE -> defensible -> defens
(m>1) ANT -> irritant -> irrit
(m>1) EMENT -> replacement -> replac
(m>1) MENT -> adjustment -> adjust
(m>1) ENT -> dependent -> depend
(m>1 and (*S or *T)) ION -> adoption -> adopt
(m>1) OU -> homologou -> homolog
(m>1) ISM -> communism -> commun
(m>1) ATE -> activate -> activ
(m>1) ITI -> angulariti -> angular
(m>1) OUS -> homologous -> homolog
(m>1) IVE -> effective -> effect
(m>1) IZE -> bowdlerize -> bowdler
Slide102: Step 5a
(m>1) E -> probate -> probat rate -> rate
(m=1 and not *o) E -> cease -> ceas
Step 5b
(m > 1 and *d and *L) -> single letter controll -> control roll -> roll
Porter’s algorithm (cont’d): Porter’s algorithm (cont’d) Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf
% ./stem.pl computers
computers comput
Porter’s algorithm: Porter’s algorithm
Stemming: Stemming Not always appropriate (e.g., proper names, titles)
The same applies to casing (e.g., CAT vs. cat)
String matching: String matching
String matching methods: String matching methods Index-based
Full or approximate
E.g., theater = theatre
Index-based matching: Index-based matching Inverted files
Position-based inverted files
Block-based inverted files 1 6 9 11 1719 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters. Text: 11, 19
Words: 33, 40
From: 55
Inverted index (trie): Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n
Sequential searching: Sequential searching No indexing structure given
Given: database d and search pattern p.
Example: find “words” in the earlier example
Brute force method
try all possible starting positions
O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)
Typical runtime is actually O(n) given that mismatches are easy to notice
Knuth-Morris-Pratt: Knuth-Morris-Pratt Average runtime similar to BF
Worst case runtime is linear: O(n)
Idea: reuse knowledge
Need preprocessing of the pattern
Knuth-Morris-Pratt (cont’d): Knuth-Morris-Pratt (cont’d) Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)
database: ABC ABC ABC ABDAB ABCDABCDABDE
pattern: ABCDABD index 0 1 2 3 4 5 6 7
char A B C D A B D –
pos -1 0 0 0 0 1 2 0 1234567
ABCDABD
ABCDABD
Knuth-Morris-Pratt (cont’d): Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
Boyer-Moore: Boyer-Moore Used in text editors
Demos
http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html
http://www.blarg.com/~doyle/pages/bmi.html
Word similarity: Word similarity Hamming distance - when words are of the same length
Levenshtein distance - number of edits (insertions, deletions, replacements)
color --> colour (1)
survey --> surgery (2)
com puter --> computer ?
Longest common subsequence (LCS)
lcs (survey, surgery) = surey
Levenshtein edit distance: Levenshtein edit distance Examples:
Theatre-> theater
Ghaddafi->Qadafi
Computer->counter
Edit distance (inserts, deletes, substitutions)
Edit transcript
Done through dynamic programming
Recurrence relation: Recurrence relation Three dependencies
D(i,0)=i
D(0,j)=j
D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)]
Simple edit distance:
t(i,j) = 0 iff S1(i)=S2(j)
Example: Example Gusfield 1997
Example (cont’d): Example (cont’d) Gusfield 1997
Tracebacks: Tracebacks Gusfield 1997
Weighted edit distance: Weighted edit distance Used to emphasize the relative cost of different edit operations
Useful in bioinformatics
Homology information
BLAST
Blosum
http://eta.embl-heidelberg.de:8000/misc/mat/blosum50.html
Slide122: Web sites:
http://www.merriampark.com/ld.htm
http://odur.let.rug.nl/~kleiweg/lev/
Clustering: Clustering
Clustering: Clustering Exclusive/overlapping clusters
Hierarchical/flat clusters
The cluster hypothesis
Documents in the same cluster are relevant to the same query
Representations for document clustering: Representations for document clustering Typically: vector-based
Words: “cat”, “dog”, etc.
Features: document length, author name, etc.
Each document is represented as a vector in an n-dimensional space
Similar documents appear nearby in the vector space (distance measures are needed)
Hierarchical clusteringDendrograms: Hierarchical clustering Dendrograms http://odur.let.rug.nl/~kleiweg/clustering/clustering.html E.g., language similarity:
Another example: Another example Kingdom = animal
Phylum = Chordata
Subphylum = Vertebrata
Class = Osteichthyes
Subclass = Actinoptergyii
Order = Salmoniformes
Family = Salmonidae
Genus = Oncorhynchus
Species = Oncorhynchus kisutch (Coho salmon)
Clustering using dendrograms: Clustering using dendrograms REPEAT
Compute pairwise similarities
Identify closest pair
Merge pair into single node
UNTIL only one node left
Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences:
A B C B A
A D C C A D E
C D E F C D A
E F G F D A
A C D A B A
Methods: Methods Single-linkage
One common pair is sufficient
disadvantages: long chains
Complete-linkage
All pairs have to match
Disadvantages: too conservative
Average-linkage
Centroid-based (online)
Look at distances to centroids
Demo:
/clair4/class/ir-w05/clustering
k-means: k-means Needed: small number k of desired clusters
hard vs. soft decisions
Example: Weka
k-means: k-means 1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do
3 for each document d do
4 find the cluster c whose centroid is closest to d
5 assign d to cluster c
6 end for
7 for each cluster c do
8 recompute the centroid of cluster c based on its documents
9 end for
10 end while
Example: Example Cluster the following vectors into two groups:
A =
B =
C =
D =
E =
F =
Complexity: Complexity Complexity = O(kn) because at each step, n documents have to be compared to k centroids.
Weka: Weka A general environment for machine learning (e.g. for classification and clustering)
Book by Witten and Frank
www.cs.waikato.ac.nz/ml/weka
Demos: Demos http://vivisimo.com/
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/AppletKM.html
http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_k-means
http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster
http://www.cc.gatech.edu/~dellaert/html/software.html
http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf
http://www.ece.neu.edu/groups/rpl/projects/kmeans/
Human clustering: Human clustering Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).
Lexical networks: Lexical networks
Lexical Networks: Lexical Networks Used to represent relationships between words
Example: WordNet - created by George Miller’s team at Princeton
Based on synsets (synonyms, interchangeable words) and lexical matrices
Lexical matrix: Lexical matrix
Synsets: Synsets Disambiguation
{board, plank}
{board, committee}
Synonyms
substitution
weak substitution
synonyms must be of the same part of speech
Slide141: $ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board
9 senses of board
Sense 1
board
=> committee, commission
=> administrative unit
=> unit, social unit
=> organization, organisation
=> social group
=> group, grouping
Sense 2
board
=> sheet, flat solid
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 3
board, plank
=> lumber, timber
=> building material
=> artifact, artefact
=> object, physical object
=> entity, something
Slide142: Sense 4
display panel, display board, board
=> display
=> electronic device
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 5
board, gameboard
=> surface
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 6
board, table
=> fare
=> food, nutrient
=> substance, matter
=> object, physical object
=> entity, something
Slide143: Sense 7
control panel, instrument panel, control board, board, panel
=> electrical device
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 8
circuit board, circuit card, board, card
=> printed circuit
=> computer circuit
=> circuit, electrical circuit, electric circuit
=> electrical device
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Sense 9
dining table, board
=> table
=> furniture, piece of furniture, article of furniture
=> furnishings
=> instrumentality, instrumentation
=> artifact, artefact
=> object, physical object
=> entity, something
Antonymy: Antonymy “x” vs. “not-x”
“rich” vs. “poor”?
{rise, ascend} vs. {fall, descend}
Other relations: Other relations Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.
Hyponymy: {tree} is a hyponym of {plant}.
Hierarchical structure based on hyponymy (and hypernymy).
Other features of WordNet: Other features of WordNet Index of familiarity
Polysemy
Familiarity and polysemy: board used as a noun is familiar (polysemy count = 9)
bird used as a noun is common (polysemy count = 5)
cat used as a noun is common (polysemy count = 7)
house used as a noun is familiar (polysemy count = 11)
information used as a noun is common (polysemy count = 5)
retrieval used as a noun is uncommon (polysemy count = 3)
serendipity used as a noun is very rare (polysemy count = 1) Familiarity and polysemy
Compound nouns: Compound nouns advisory board
appeals board
backboard
backgammon board
baseboard
basketball backboard
big board
billboard
binder's board
binder board blackboard
board game
board measure
board meeting
board member
board of appeals
board of directors
board of education
board of regents
board of trustees
Overview of senses: Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members")
2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")
3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)
4. display panel, display board, board -- (a board on which information can be displayed to public view)
5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")
6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")
7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")
8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities)
9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")
Top-level concepts: Top-level concepts {act, action, activity}
{animal, fauna}
{artifact}
{attribute, property}
{body, corpus}
{cognition, knowledge}
{communication}
{event, happening}
{feeling, emotion}
{food}
{group, collection}
{location, place}
{motive} {natural object}
{natural phenomenon}
{person, human being}
{plant, flora}
{possession}
{process}
{quantity, amount}
{relation}
{shape}
{state, condition}
{substance}
{time}
WordNet and DistSim: WordNet and DistSim wn reason -hypen - hypernyms
wn reason -synsn - synsets
wn reason -simsn - synonyms
wn reason -over - overview of senses
wn reason -famln - familiarity/polysemy
wn reason -grepn - compound nouns
/data2/tools/relatedwords/relate reason
System comparison: System comparison
Comparing two systems: Comparing two systems Comparing A and B
One query?
Average performance?
Need: A to consistently outperform B
[this slide: courtesy James Allan]
The sign test: The sign test Example 1:
A > B (12 times)
A = B (25 times)
A B (18 times)
A < B (9 times)
p < 0.122 (not significant at the 5% level) [this slide: courtesy James Allan]
Other tests: Other tests The t test:
Takes into account the actual performances, not just which system is better
http://nimitz.mcs.kent.edu/~blewis/stat/tTest.html
The sign test:
http://www.fon.hum.uva.nl/Service/Statistics/Sign_Test.html
Techniques for dimensionalityreduction: SVD and LSI: Techniques for dimensionality reduction: SVD and LSI
Techniques for dimensionality reduction: Techniques for dimensionality reduction Based on matrix decomposition (goal: preserve clusters, explain away variance)
A quick review of matrices
Vectors
Matrices
Matrix multiplication
SVD: Singular Value Decomposition: SVD: Singular Value Decomposition A=USVT
This decomposition exists for all matrices, dense or sparse
If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3
In Matlab, use [U,S,V] = svd (A)
Term matrix normalization: Term matrix normalization D1 D2 D3 D4 D5 D1 D2 D3 D4 D5
Example (Berry and Browne): Example (Berry and Browne) T1: baby
T2: child
T3: guide
T4: health
T5: home
T6: infant
T7: proofing
T8: safety
T9: toddler D1: infant & toddler first aid
D2: babies & children’s room (for your home)
D3: child safety at home
D4: your baby’s health and safety: from infant to toddler
D5: baby proofing basics
D6: your guide to easy rust proofing
D7: beanie babies collector’s guide
Document term matrix: Document term matrix
Decomposition: Decomposition u =
-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0
-0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098
-0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000
-0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000
-0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098
-0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356
-0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000
-0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000
-0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356
v =
-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433
-0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987
-0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451
-0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690
-0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407
-0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544
-0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407
Decomposition: Decomposition s =
1.5849 0 0 0 0 0 0
0 1.2721 0 0 0 0 0
0 0 1.1946 0 0 0 0
0 0 0 0.7996 0 0 0
0 0 0 0 0.7100 0 0
0 0 0 0 0 0.5692 0
0 0 0 0 0 0 0.1977
0 0 0 0 0 0 0
0 0 0 0 0 0 0 Spread on the v1 axis
Rank-4 approximation: Rank-4 approximation s4 =
1.5849 0 0 0 0 0 0
0 1.2721 0 0 0 0 0
0 0 1.1946 0 0 0 0
0 0 0 0.7996 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Rank-4 approximation: Rank-4 approximation u*s4*v'
-0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002
-0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121
0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584
0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535
-0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121
0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584
0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360
0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
Rank-4 approximation: Rank-4 approximation u*s4
-1.1056 -0.1203 0.0207 -0.5558 0 0 0
-0.4155 0.3748 0.5606 0.1573 0 0 0
-0.5576 -0.5719 -0.1226 0.3210 0 0 0
-0.1786 0.1801 -0.1765 -0.0587 0 0 0
-0.4155 0.3748 0.5606 0.1573 0 0 0
-0.2984 0.4778 -0.6015 0.1018 0 0 0
-0.5576 -0.5719 -0.1226 0.3210 0 0 0
-0.3348 0.4241 0.1149 0.2255 0 0 0
-0.2984 0.4778 -0.6015 0.1018 0 0 0
Rank-4 approximation: Rank-4 approximation s4*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451
0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863
-0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605
0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Rank-2 approximation: Rank-2 approximation s2 =
1.5849 0 0 0 0 0 0
0 1.2721 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Rank-2 approximation: Rank-2 approximation u*s2*v'
0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563
0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815
-0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358
0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293
0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815
0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
-0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358
0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286
0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
Rank-2 approximation: Rank-2 approximation u*s2
-1.1056 -0.1203 0 0 0 0 0
-0.4155 0.3748 0 0 0 0 0
-0.5576 -0.5719 0 0 0 0 0
-0.1786 0.1801 0 0 0 0 0
-0.4155 0.3748 0 0 0 0 0
-0.2984 0.4778 0 0 0 0 0
-0.5576 -0.5719 0 0 0 0 0
-0.3348 0.4241 0 0 0 0 0
-0.2984 0.4778 0 0 0 0 0
Rank-2 approximation: Rank-2 approximation s2*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451
0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Documents to concepts and terms to concepts: Documents to concepts and terms to concepts A(:,1)'*u*s
-0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095
>> A(:,1)'*u*s4
-0.4238 0.6784 -0.8541 0.1446 0 0 0
>> A(:,1)'*u*s2
-0.4238 0.6784 0 0 0 0 0
>> A(:,2)'*u*s2
-1.1233 0.3650 0 0 0 0 0
>> A(:,3)'*u*s2
-0.6762 0.6807 0 0 0 0 0
Documents to concepts and terms to concepts: Documents to concepts and terms to concepts >> A(:,4)'*u*s2
-0.9972 0.6478 0 0 0 0 0
>> A(:,5)'*u*s2
-1.1809 -0.4914 0 0 0 0 0
>> A(:,6)'*u*s2
-0.7918 -0.8121 0 0 0 0 0
>> A(:,7)'*u*s2
-1.1809 -0.4914 0 0 0 0 0
Cont’d: Cont’d >> (s2*v'*A(1,:)')'
-1.7523 -0.1530 0 0 0 0 0 0 0
>> (s2*v'*A(2,:)')'
-0.6585 0.4768 0 0 0 0 0 0 0
>> (s2*v'*A(3,:)')'
-0.8838 -0.7275 0 0 0 0 0 0 0
>> (s2*v'*A(4,:)')'
-0.2831 0.2291 0 0 0 0 0 0 0
>> (s2*v'*A(5,:)')'
-0.6585 0.4768 0 0 0 0 0 0 0
Cont’d: Cont’d >> (s2*v'*A(6,:)')'
-0.4730 0.6078 0 0 0 0 0 0 0
>> (s2*v'*A(7,:)')'
-0.8838 -0.7275 0 0 0 0 0 0 0
>> (s2*v'*A(8,:)')'
-0.5306 0.5395 0 0 0 0 0 0 0
>> (s2*v'*A(9,:)')‘
-0.4730 0.6078 0 0 0 0 0 0 0
Properties: Properties A*A'
1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025
0.3364 0.6728 0 0 0.6728 0 0 0.3364 0
0.5041 0 1.0082 0 0 0 0.5041 0 0
0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025
0.3364 0.6728 0 0 0.6728 0 0 0.3364 0
0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066
0.5041 0 0.5041 0 0 0 1.0082 0 0
0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025
0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066
A'*A
1.0082 0 0 0.6390 0 0 0
0 1.0092 0.6728 0.2610 0.4118 0 0.4118
0 0.6728 1.0092 0.2610 0 0 0
0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195
0 0.4118 0 0.3195 1.0082 0.5041 0.5041
0 0 0 0 0.5041 1.0082 0.5041
0 0.4118 0 0.3195 0.5041 0.5041 1.0082
A is a document to term matrix. What is A*A’, what is A’*A?
Latent semantic indexing (LSI): Latent semantic indexing (LSI) Dimensionality reduction = identification of hidden (latent) concepts
Query matching in latent space
Useful pointers: Useful pointers http://lsa.colorado.edu
http://lsi.research.telcordia.com/
http://www.cs.utk.edu/~lsi/
http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
http://citeseer.nj.nec.com/deerwester90indexing.html
http://www.pcug.org.au/~jdowling/
Models of the Web: Models of the Web
Size: Size The Web is the largest repository of data and it grows exponentially.
320 Million Web pages [Lawrence & Giles 1998]
800 Million Web pages, 15 TB [Lawrence & Giles 1999]
8 Billion Web pages indexed [Google 2005]
Amount of data
roughly 200 TB [Lyman et al. 2003]
Bow-tie model of the Web: Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page
Power laws: Power laws Web site size (Huberman and Adamic 1999)
Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree
Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.
Small-world networks: Small-world networks Diameter = average length of the shortest path between all pairs of nodes. Example…
Milgram experiment (1967)
Kansas/Omaha --> Boston (42/160 letters)
diameter = 6
Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.
Six degrees of separation
Clustering coefficient: Clustering coefficient Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.
Examples:
Models of the Web: Models of the Web Erdös/Rényi 59, 60 Barabási/Albert 99 Watts/Strogatz 98 Kleinberg 98 Menczer 02 Radev 03 Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology
Self-triggerability across hyperlinks: Self-triggerability across hyperlinks Document closures for information retrieval
Self-triggerability [Mosteller&Wallace 84] Poisson distribution
Two-Poisson [Bookstein&Swanson 74]
Negative Binomial, K-mixture [Church&Gale 95]
Triggerability across hyperlinks?
Evolving Word-based Web: Evolving Word-based Web Observations:
Links are made based on topics
Topics are expressed with words
Words are distributed very unevenly (Zipf, Benford, self-triggerability laws)
Model
Pick n
Generate n lengths according to a power-law distribution
Generate n documents using a trigram model Model (cont’d)
Pick words in decreasing order of r.
Generate hyperlinks with random directionality
Outcome
Generates power-law degree distributions
Generates topical communities
Natural variation of PageRank: LexRank
Social network analysis for IR: Social network analysis for IR
Social networks: Social networks Induced by a relation
Symmetric or not
Examples:
Friendship networks
Board membership
Citations
Power grid of the US
WWW
Slide190: Krebs 2004
Prestige and centrality: Prestige and centrality Degree centrality: how many neighbors each node has.
Closeness centrality: how close a node is to all of the other nodes
Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes
Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.
Prestige = same as centrality but for directed graphs.
Graph-based representations: Graph-based representations Square connectivity (incidence) matrix Graph G (V,E)
Markov chains: Markov chains A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.
Path = sequence (x0, x1, …, xn). Xi = xi-1*E
The probability of a path can be computed as a product of probabilities for each step i.
Random walk = find Xj given x0, E, and j.
Stationary solutions: Stationary solutions The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:
E is stochastic
E is irreducible
E is aperiodic
To make these conditions true:
All rows of E add up to 1 (and no value is negative)
Make sure that E is strongly connected
Make sure that E is not bipartite
Example: PageRank [Brin and Page 1998]: use “teleportation”
Slide195: Example This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph.
Eigenvectors: Eigenvectors An eigenvector is an implicit “direction” for a matrix.
Mv = λv, where v is non-zero, though λ can be any complex number in principle.
The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.
For λ1, the left (principal) eigenvector is p, the right eigenvector = 1
In other words, ETp = p.
Computing the stationary distribution: Computing the stationary distribution function PowerStatDist (E):
begin
p(0) = u; (or p(0) = [1,0,…0])
i=1;
repeat
p(i) = ETp(i-1)
L = ||p(i)-p(i-1)||1;
i = i + 1;
until L <
return p(i)
end Solution for the stationary distribution
Slide198: Example
How Google works: How Google works Crawling
Anchor text
Fast query processing
Pagerank
More about PageRank: More about PageRank Named after Larry Page, founder of Google (and UM alum)
Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.
Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.
HITS: HITS Query-dependent model (Kleinberg 97)
Hubs and authorities (e.g., cars, Honda)
Algorithm
obtain root set using input query
expanded the root set by radius one
run iterations on the hub and authority scores together
report top-ranking authorities and hubs
The link-content hypothesis: The link-content hypothesis Topical locality: page is similar () to the page that points to it ().
Davison (TF*IDF, 100K pages)
0.31 same domain
0.23 linked pages
0.19 sibling
0.02 random
Menczer (373K pages, non-linear least squares fit)
Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 1=1.8, 2=0.6,
Measuring the Web: Measuring the Web
Bharat and Broder 1998: Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek
10,000 queries in mid and late 1997
Estimate is 200M pages
Only 1.4% are indexed by all of them
Example (from Bharat&Broder): Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages
(Lawrence and Giles 1998).
Crawling the web: Crawling the web
Basic principles: Basic principles The HTTP/HTML protocols
Following hyperlinks
Some problems:
Link extraction
Link normalization
Robot exclusion
Loops
Spider traps
Server overload
Example: Example U-M’s root robots.txt file:
http://www.umich.edu/robots.txt
User-agent: *
Disallow: /~websvcs/projects/
Disallow: /%7Ewebsvcs/projects/
Disallow: /~homepage/
Disallow: /%7Ehomepage/
Disallow: /~smartgl/
Disallow: /%7Esmartgl/
Disallow: /~gateway/
Disallow: /%7Egateway/
Example crawler: Example crawler E.g., poacher
http://search.cpan.org/~neilb/Robot-0.011/examples/poacher
/data0/projects/perltree-index
Slide210:
&ParseCommandLine();
&Initialise();
$robot->run($siteRoot)
#=======================================================================
# Initialise() - initialise global variables, contents, tables, etc
# This function sets up various global variables such as the version number
# for WebAssay, the program name identifier, usage statement, etc.
#=======================================================================
sub Initialise
{
$robot = new WWW::Robot(
'NAME' => $BOTNAME,
'VERSION' => $VERSION,
'EMAIL' => $EMAIL,
'TRAVERSAL' => $TRAVERSAL,
'VERBOSE' => $VERBOSE,
);
$robot->addHook('follow-url-test', \&follow_url_test);
$robot->addHook('invoke-on-contents', \&process_contents);
$robot->addHook('invoke-on-get-error', \&process_get_error);
}
#=======================================================================
# follow_url_test() - tell the robot module whether is should follow link
#=======================================================================
sub follow_url_test {}
#=======================================================================
# process_get_error() - hook function invoked whenever a GET fails
#=======================================================================
sub process_get_error {}
#=======================================================================
# process_contents() - process the contents of a URL we've retrieved
#=======================================================================
sub process_contents
{
run_command($COMMAND, $filename) if defined $COMMAND;
}
Focused crawling: Focused crawling Topical locality
Pages that are linked are similar in content (and vice-versa: Davison 00, Menczer 02, 04, Radev et al. 04)
The radius-1 hypothesis
given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page)
Focused crawling
Keeping a priority queue of the most relevant pages
Question answering: Question answering
People ask questions: People ask questions Excite corpus of 2,477,283 queries (one day’s worth)
8.4% of them are questions
43.9% factual (what is the country code for Belgium)
56.1% procedural (how do I set up TCP/IP) or other
In other words, 100 K questions per day
People ask questions: People ask questions In what year did baseball become an offical sport? Who is the largest man in the world? Where can i get information on Raphael? where can i find information on puritan religion? Where can I find how much my house is worth? how do i get out of debt? Where can I found out how to pass a drug test? When is the Super Bowl? who is California's District State Senator? where can I buy extra nibs for a foutain pen? how do i set up tcp/ip ? what time is it in west samoa? Where can I buy a little kitty cat? what are the symptoms of attention deficit disorder? Where can I get some information on Michael Jordan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live? Which Frenchman declined the Nobel Prize for Literature for ideological reasons? What is the largest city in Northern Afghanistan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live?
Question answering: Question answering What is the largest city in Northern Afghanistan?
Possible approaches: Possible approaches Map?
Knowledge base Find x: city (x) located (x,”Northern Afghanistan”) ¬exists (y): city (y) located (y,”Northern Afghanistan”) greaterthan (population (y), population (x))
Database?
World factbook?
Search engine?
The TREC Q&A evaluation: The TREC Q&A evaluation Run by NIST [Voorhees and Tice 2000]
2GB of input
200 questions
Essentially fact extraction
Who was Lincoln’s secretary of state?
What does the Peugeot company manufacture?
Questions are based on text
Answers are assumed to be present
No inference needed
Question answering: Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years
Q: If Iraq attacks a neighboring country, what should the US do?
A: ?? Question answering
NSIR: NSIR Current project at U-M
http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi
Reading:
[Radev et al., 2005a]
Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005
http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html
Slide223: ... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined. Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603
... died in Kano, northern Nigeria's largest city, during two days of anti-American riots led by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/
... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significant blow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k
... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889), is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html
... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlying regions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/
... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: Education Offers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/
... within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html Google
Slide224: Document retrieval Query modulation Sentence retrieval Answer extraction Answer ranking What is the largest city in Northern Afghanistan? (largest OR biggest) city “Northern Afghanistan” www.infoplease.com/cgi-bin/id/A0855603 www.washingtonpost.com/wp-dyn/print/world/ Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regions
within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan Gudermes
Mazer-e-Sharif Mazer-e-Sharif
Gudermes
Research problems: Research problems Source identification:
semi-structured vs. text sources
Query modulation:
best paraphrase of a NL question given the syntax of a search engine?
Compare two approaches: noisy channel model and rule-based
Sentence ranking
n-gram matching, Okapi, co-reference?
Answer extraction
question type identification
phrase chunking
no general-purpose named entity tagger available
Answer ranking
what are the best predictors of a phrase being the answer to a given question: question type, proximity to query words, frequency
Evaluation (MRDR)
accuracy, reliability, timeliness
Document retrieval: Document retrieval Use existing search engines: Google, AlltheWeb, NorthernLight
No modifications to question
CF: work on QASM (ACM CIKM 2001)
Sentence ranking: Sentence ranking Weighted N-gram matching:
Weights are determined empirically, e.g., 0.6, 0.3, and 0.1
Probabilistic phrase reranking: Probabilistic phrase reranking Answer extraction: probabilistic phrase reranking. What is: p(ph is answer to q | q, ph)
Evaluation: TRDR
Example: (2,8,10) gives .725
Document, sentence, or phrase level
Criterion: presence of answer(s)
High correlation with manual assessment
Phrase types: Phrase types PERSON PLACE DATE NUMBER DEFINITION ORGANIZATION DESCRIPTION ABBREVIATION KNOWNFOR RATE LENGTH MONEY REASON DURATION PURPOSE NOMINAL OTHER
Question Type Identification: Question Type Identification Wh-type not sufficient:
Who: PERSON 77, DESCRIPTION 19, ORG 6
What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16, NUMBER 14, etc.
How: NUMBER 33, LENGTH 6, RATE 2, etc.
Ripper:
13 features: Question-Words, Wh-Word, Word-Beside-Wh-Word, Is-Noun-Length, Is-Noun-Person, etc.
Top 2 question types
Heuristic algorithm:
About 100 regular expressions based on words and parts of speech
Ripper performance: Ripper performance
Regex performance: Regex performance
Phrase ranking: Phrase ranking Phrases are identified by a shallow parser (ltchunk from Edinburgh)
Four features:
Proximity
POS (part-of-speech) signature (qtype)
Query overlap
Frequency
Proximity: Proximity Phrasal answers tend to appear near words from the query
Average distance = 7 words, range = 1 to 50 words
Use linear rescaling of scores
Part of speech signature: Part of speech signature Example: “Hugo/NNP Young/NNP” P (PERSON | “NNP NNP”) = .458
Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN”
P (PERSON | “DT NNP NNP NNP NN”) = 0 Penn Treebank tagset (DT = determiner, JJ = adjective)
Query overlap and frequency: Query overlap and frequency Query overlap:
What is the capital of Zimbabwe?
Possible choices: Mugabe, Zimbabwe, Luanda, Harare
Frequency:
Not necessarily accurate but rather useful
Reranking: Reranking Rank Probability and phrase
1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._. 2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP 3 0.598398 International_NNP Space_NNP Station_NNP 4 0.598125 to_TO become_VB 5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP 6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP 7 0.587140 will_MD form_VB 8 0.585410 The_DT purpose_NN 9 0.576797 prime_JJ contracts_NNS 10 0.568013 First_NNP American_NNP 11 0.567361 this_DT bulletin_NN board_NN 12 0.565757 Space_NNP :_: 13 0.562627 'Spirit_NN '_'' of_IN ... 41 0.516368 Alan_NNP Shepard_NNP Proximity = .5164
Reranking : Reranking Rank Probability and phrase
1 0.465012 Space_NNP Administration_NNP ._. 2 0.446466 SPACE_NNP CALENDAR_NNP _. 3 0.413976 First_NNP American_NNP 4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP 5 0.396250 her_PRP$ third_JJ space_NN mission_NN 6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP 7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP 8 0.390163 the_DT Red_NNP Planet_NNP ._. 9 0.379797 First_NNP American_NNP 10 0.376336 Alan_NNP Shepard_NNP 11 0.375669 February_NNP 12 0.374813 Space_NNP 13 0.373999 International_NNP Space_NNP Station_NNP Qtype = .7288 Proximity * qtype = .3763
Reranking: Reranking Rank Probability and phrase
1 0.478857 Neptune_NNP Beach_NNP ._. 2 0.449232 February_NNP 3 0.447075 Go_NNP 4 0.437895 Space_NNP 5 0.431835 Go_NNP 6 0.424678 Alan_NNP Shepard_NNP 7 0.423855 First_NNP American_NNP 8 0.421133 Space_NNP May_NNP 9 0.411065 First_NNP American_NNP woman_NN 10 0.401994 Life_NNP Sciences_NNP 11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN 12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP 13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP All four features
Document level performance: Document level performance TREC 8 corpus (200 questions)
Sentence level performance: Sentence level performance
Phrase level performance: Phrase level performance Experiments performed Oct-Nov. 2001
Text classification: Text classification
Introduction: Introduction Text classification: assigning documents to predefined categories
Hierarchical vs. flat
Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression)
Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x)
Discriminative: model p(y|x) directly.
Generative models: knn: Generative models: knn K-nearest neighbors
Very easy to program
Issues: choosing k, b?
Feature selection: The 2 test: Feature selection: The 2 test For a term t:
Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0)
P(C=0) = (k00+k01)/n
P(C=1) = 1-P(C=0) = (k10+k11)/n
P(It=0) = (k00+K10)/n
P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature selection: The 2 test: Feature selection: The 2 test
High values of 2 indicate lower belief in independence.
In practice, compute 2 for all words and pick the top k among them.
Feature selection: mutual information: Feature selection: mutual information No document length scaling is needed
Documents are assumed to be generated according to the multinomial model
Naïve Bayesian classifiers: Naïve Bayesian classifiers Naïve Bayesian classifier
Assuming statistical independence
Spam recognition: Spam recognition Return-Path:
X-Sieve: CMU Sieve 2.2
From: "Ibrahim Galadima"
Reply-To: galadima_esq@netpiper.com
To: webmaster@aclweb.org
Date: Tue, 14 Jan 2003 21:06:26 -0800
Subject: Gooday
DEAR SIR
FUNDS FOR INVESTMENTS
THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD
NO PREVIOUS CORRESPONDENCE WITH YOU
I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT
NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR
CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE
PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL
TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT
TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES
DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT
THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH
ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF
OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
Well-known datasets: Well-known datasets 20 newsgroups
http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
Reuters-21578
Cats: grain, acquisitions, corn, crude, wheat, trade…
WebKB
http://www-2.cs.cmu.edu/~webkb/
course, student, faculty, staff, project, dept, other
NB performance (2000)
P=26,43,18,6,13,2,94
R=83,75,77,9,73,100,35
Support vector machines: Support vector machines Introduced by Vapnik in the early 90s.
Semi-supervised learning: Semi-supervised learning EM
Co-training
Graph-based
Additional topics: Additional topics Soft margins
VC dimension
Kernel methods
Slide262: SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.
NB also good in many circumstances
Readings: Readings Books:
1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-Wesley/ACM Press, 1999.
2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1
Papers:
Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999
Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998
Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998
Bush "As we may thing" The Atlantic Monthly 1945
Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999
Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998
Davison "Topical locality on the Web" SIGIR 2000
Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999
Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990
Readings: Readings Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004
Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999
Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000
Haveliwala "Topic-sensitive pagerank" WWW 2002
Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000
Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999
Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998
Menczer "Links tell us about lexical and semantic Web content" arXiv 2001
Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998
Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005
Singhal "Modern Information Retrieval: an Overview" IEEE 2001
More readings: More readings Gerard Salton, Automatic Text Processing, Addison-Wesley (1989)
Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997)
Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983)
C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979)
Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994)
ACM SIGIR Proceedings, SIGIR Forum
ACM conferences in Digital Libraries
Slide266: Thank you! Благодаря!