Information Access. Lecture 2Representation of information and identification of significant text features : Information Access. Lecture 2 Representation of information and identification of significant text features GSLT
Jan 2007 Barbara Gawronska, Högskolan i Skövde
Requirements on Information Representation: : Requirements on Information Representation: Discriminating power
Descriptive power
Similarity identification
Ambiguity minimalization
Conciseness
Those requirements may collide...
Traditional descriptors : Traditional descriptors Classification codes (e.g. Universal Decimal Classification)
Subject headings
Key words
Problems:
Standarized lists of subject headings needed
Different spelling conventions, spelling errors
Morphology: inflectional and derivational, compounding
Semantic relations
Problems in linking related words and phrases and strategies for solving the problems : Problems in linking related words and phrases and strategies for solving the problems Different spelling conventions, spelling errors: an actual problem. Data below from Dalianis 2002:
10% of all Google queries are misspelled
10% of all queries to the Swedish search engine SUNET are misspelled (Stolpe 2002)
10-12.5 % of all queries to Euroling-SiteSeeker
Solutions:
spelling checkers;
in proper names - counting the number of identical letters or identical bigrams (letter pairs). This could be improved by adding some phonological and phonotactic knowledge (metathesis etc.)
for example
BARBARA GAWRONSKA (WR is a very unfrequent consonant combination in Swedish)
often misspelled as
BARBRO GRAVONSKA
Different causes of errors related to spelling : Different causes of errors related to spelling dyslexi
unsufficient language knowledge (e.g. in foreign language learners)
”accidents”
alternative spellings
the user spels correctly, but there is a spelling error in the document
nominal compounds (key word or keyword?)
Problems in linking related words and phrases and strategies for solving the problems (2) : Problems in linking related words and phrases and strategies for solving the problems (2) Different morphological forms
Truncation: finding the common part of two strings; no language specific morphological knowledge. Problems: too many unrelated words may pass trough
Example:
Swedish: ren1 (’clean’, adj or verb) and ren2 (’reindeer’)
ren#: rena (adj/v), rent (adj), rens (n)...
ren$$: renen (n), renar(n), renad (v,prt)...
Strategies for linking different morphological forms, cont. : Strategies for linking different morphological forms, cont.
Lemmatization: identifying the lexical form:
includes knowledge about irregular forms, ”Umlaut”, inflectional patterns
Stemming: a strategy between truncation and lemmatization
The general principle for English (Lovins 1968,Paice 1990):
remove the ending, and transform the ending of the remaining string,
if needed
Language-dependent algorithms needed; consider e.g. Indonesian verbs:
infinitive active
tawar menawar ”bargain”
pikir memikir ”think”
beri memberi ”give”
sewa menyewa ”rent”
Handling multi-word entries: : Handling multi-word entries: context operators, e.g.
exact distance between words
retrieval$information: retrieval of information
retrieval with information loss
maximal distance between words
text##retrieval: text retrieval
text and data retrieval
unspecified word order
information#,retrieval: information retrieval, retrieval of information
+ word pair co-occurence rate
Linking semantically related words and phrases : Linking semantically related words and phrases Semantic relations in thesauri, lexicons, semantic nets
as tools for term expansion; some examples:
ERIC Thesaurus of Descriptors (the Dialog Corporation)
Roger Thesaurus
KL-ONE
WordNet...
Normally used relations: broader/narrower term, related term,
synonym,”used for”/ ”use” (identifies a preferred synonym);
Even entailment (WordNet), role (KL-ONE)
Thesauri: Top-down classification - monohierarchy : Thesauri: Top-down classification - monohierarchy
Thesauri: Polyhierarchy : Thesauri: Polyhierarchy
Thesauri: Polydimensional hierarchy : Thesauri: Polydimensional hierarchy
Thesauri: Polydimensional hierarchy : Thesauri: Polydimensional hierarchy
Thesauri: Bottom-up classification : Thesauri: Bottom-up classification
Slide16 : Another example from WordNet missile
bomb WordNet Classification:
Semantic query expansion : Semantic query expansion Expanding terms by hypernyms from general thesauri is not to reccommend
Nodes that are placed deeply in the top-down hierarchy give better results
the use of synonyms is problematic (feedback from user?)
The knowledge of multiword entries important
Discovering English compounds by Lexware Culler (Dura, Gawronska, and Erlendsson 2006) : Discovering English compounds by Lexware Culler (Dura, Gawronska, and Erlendsson 2006) f(x) corpus frequency of word x
f(x,y) corpus frequency of word pair (x, y)
N total number of words in the corpus
Discovering Latin terms by Lexware Culler : Discovering Latin terms by Lexware Culler H(x) entropy of word x = -p(x)log2(p(x)),
where probability p(x) = f(x) / N
The questions of significance and similarity : The questions of significance and similarity the query word may match a word in many documents – but how significant is this word for the different documents?
Finding significant words : Finding significant words Significance as a function of rank (Luhn 1958) A simple frequency-based indexing method: frequent words – stop list + truncation/stemming
Finding significant words (2) : Finding significant words (2) Term weighting: Salton & McGill1983
The ”Tf x idf” method (also called document frequency, or inverse term frequency):
”Tf x idf” can be combined with similarity measures, e.g. the vector space model
DocFreq = the number of documents in which the word occurs
n = the total number of documents
Tf x idf, an example : Tf x idf, an example Curcumin, a major yellow pigment and active component of turmeric, has multiple
anti-cancer properties. However, its molecular targets and mechanisms of action
on human colon adenocarcinoma cells are unknown. In the present study, we
examined the effects of curcumin on the proliferation of human colon
adenocarcinoma HT-29 cells by the
3-[4,5-dimethylthiazol-2-yl]-2,5-diphenyltetrazolium bromide method and
confirmed the curcumin-induced apoptosis by morphology and DNA ladder formation.
At the same time, p53, phospho-p53 (Ser15), and other apoptosis-related proteins…
gene: occurrences 0 tf x idf = 0
curcumin: occurrences 8 tf x idf = 8
p53: 5 tf x idf = 5 x 0 = 0
The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on human
chromosome 17p13.3, is frequently silenced in cancer by epigenetic mechanisms.
Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses and
zinc-finger family of transcription factors and acts by repressing target gene
expression. It has been shown that enforced p53 expression leads to increased
HIC1 mRNA, and recent data suggest that p53 and Hic1 cooperate in tumorigenesis.
In order to elucidate the regulation of HIC1 expression, we have analysed the…
gene: 7 tf x idf = 7 x 1 = 7
curcumin: occurrences 0 tf x idf = 0
p53: 10 tf x idf = 0
Similarity measures : Similarity measures Models for comparing texts normally make use of words the texts have in common
Some models also utilize the size of the documents and/or the number of words the texts do not have in common
Similarity measures (2) : Similarity measures (2) = THE WEIGHT OF AN OCCURENCE OF TERM j IN DOCUMENT i THE MAXIMUM NUMBER OF TERMS IN BOTH DOCUMENTS COMBINED T = No attention is paid to the size of a document
Similarity measures (3) : Similarity measures (3) Dice’s coefficient Jaccard’s coefficient
Similarity measures (4) : Similarity measures (4) The cosine coefficient
(the cosine of the angle between two vectors;
the closer the documents, the larger the cosine)
Similarity measures (5) : Similarity measures (5) Clustering by similarity matrices
(Jaccard’s coefficient applied to attribute/value matrices)
Document signature matching (documents coded into very compact binary representations, so-called signatures)
Discriminator words (Williams 1963): the discrimination coefficient ascribes high values to words that occur with a probability much different from the mean probability
Which words should count as common to both documents? The need of stop lists and stemming/morphological analysis : Which words should count as common to both documents? The need of stop lists and stemming/morphological analysis As summer turns to fall, many brewers start to plan their Oktoberfest brewing. This installment of "Brewing in Styles" looks at the materials and techniques used for brewing traditional and modern Maerzen beers and offers some radical tips for brewing Oktoberfest-like ales. Ein prosit!
Several people called in response to the last installment of "Brewing in Styles" ("American Wheat," BrewingTechniques 1 [1], May/June 1993) to say that they were confused because many pubs and micros in the Midwest brew wheat beers in the traditional German manner, complete with the 4-vinylguaiacol clovelike character. Many fine German-style Weizenbiers are brewed in America.
*****************************************************************************************************
Republished from BrewingTechniques' July/August 1993.
What to do with that unfortunate mistake of a recipe? Design another beer that is out of balance in an opposite and complementary way.
It invariably happens, even to the best of us. The beer that should have been so good ends up out of balance and undrinkable. Not being the type to accept less-than-perfect products graciously, I decided to take a page from the Belgian book of brewing.
Belgian brewers have long used the practice of blending to even out inconsistent, wild fermentations
Document clustering : Document clustering
Clustering by predefined thesaurus categories : Clustering by predefined thesaurus categories Välj ämne och klicka på knappen "Visa" (hjälp)
Socialpolitik (13) Befolkning (80) Åldersfördelning (12) Äldre (163) Pensionärsorganisationer (12) Gerontologi (5) Ungdom (808) Ungdomsfrågor (27) Ungdomsforskning (2) Åldersgränser (124) Generationer (214) Barn (1228) Barnombudsman (68) Vuxna (239) Könstillhörighet (17) Kvinnor (1286) Kvinnoforskning (9) Kvinnoorganisationer (20) Kvinnors rättigheter (18)
The general architecture of a web search engine : The general architecture of a web search engine Brin and Page, http://www-db.stanford.edu/~backrub/google.html
Categorisation during indexing and search : Categorisation during indexing and search Language (automatic language recognition)
Dokument type (HTML, Word, Excel, PDF etc.)
Date
Categories like server, domain, country
Page ranking makes use of the links pointing to a page
Document clustering : Document clustering standard similarity measures (should apply after filtering out stop list words and, preferably, after stemming)
word weights modified by factors like:
bold type
presence in titles and headings
capital letters
co-occurrence with numbers and symbols
semantic closeness
Clustering algorithms : Clustering algorithms Top-down (hierarchical)
The whole document collection is divided into a few big clusters; then, the algorithm makes finer and finer distinctions
Bottom-up (non-hierarchical)
a single text is taken as starting point; then, similarities with other texts are computed
Feedback : Feedback (Meadow et al. 2000: 246, Mc GrawHill 1971):
Feedback = information derived from the output of a process and used to control the process in the future
a shortcoming of the present IR systems is that the user’s possibility of giving feedback are very limited
more research on utilizing feedback from user needed
Query modification by relevance feedback : Query modification by relevance feedback (picture from M.A. Hearst, http://www.sims.berkeley.edu/courses/is202/f98/Lecture25/sld005.htm)