Share PowerPoint. Anywhere!
Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

Lecture2 2007

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 22
Like it  ( Likes) Dislike it  ( Dislikes)
Added: December 10, 2007 This presentation is Public
Presentation Category :Entertainment
Presentation StatisticsNew!
Views on authorSTREAM: 22
Presentation Transcript

Information Access. Lecture 2 Representation of information and identification of significant text features : Information Access. Lecture 2 Representation of information and identification of significant text features GSLT Jan 2007 Barbara Gawronska, Högskolan i Skövde


Requirements on Information Representation: : Requirements on Information Representation: Discriminating power Descriptive power Similarity identification Ambiguity minimalization Conciseness Those requirements may collide...


Traditional descriptors : Traditional descriptors Classification codes (e.g. Universal Decimal Classification) Subject headings Key words Problems: Standarized lists of subject headings needed Different spelling conventions, spelling errors Morphology: inflectional and derivational, compounding Semantic relations


Problems in linking related words and phrases and strategies for solving the problems : Problems in linking related words and phrases and strategies for solving the problems Different spelling conventions, spelling errors: an actual problem. Data below from Dalianis 2002: 10% of all Google queries are misspelled 10% of all queries to the Swedish search engine SUNET are misspelled (Stolpe 2002) 10-12.5 % of all queries to Euroling-SiteSeeker Solutions: spelling checkers; in proper names - counting the number of identical letters or identical bigrams (letter pairs). This could be improved by adding some phonological and phonotactic knowledge (metathesis etc.) for example BARBARA GAWRONSKA (WR is a very unfrequent consonant combination in Swedish) often misspelled as BARBRO GRAVONSKA


Different causes of errors related to spelling : Different causes of errors related to spelling dyslexi unsufficient language knowledge (e.g. in foreign language learners) ”accidents” alternative spellings the user spels correctly, but there is a spelling error in the document nominal compounds (key word or keyword?)


Problems in linking related words and phrases and strategies for solving the problems (2) : Problems in linking related words and phrases and strategies for solving the problems (2) Different morphological forms Truncation: finding the common part of two strings; no language specific morphological knowledge. Problems: too many unrelated words may pass trough Example: Swedish: ren1 (’clean’, adj or verb) and ren2 (’reindeer’) ren#: rena (adj/v), rent (adj), rens (n)... ren$$: renen (n), renar(n), renad (v,prt)...


Strategies for linking different morphological forms, cont. : Strategies for linking different morphological forms, cont. Lemmatization: identifying the lexical form: includes knowledge about irregular forms, ”Umlaut”, inflectional patterns Stemming: a strategy between truncation and lemmatization The general principle for English (Lovins 1968,Paice 1990): remove the ending, and transform the ending of the remaining string, if needed Language-dependent algorithms needed; consider e.g. Indonesian verbs: infinitive active tawar menawar ”bargain” pikir memikir ”think” beri memberi ”give” sewa menyewa ”rent”


Handling multi-word entries: : Handling multi-word entries: context operators, e.g. exact distance between words retrieval$information: retrieval of information retrieval with information loss maximal distance between words text##retrieval: text retrieval text and data retrieval unspecified word order information#,retrieval: information retrieval, retrieval of information + word pair co-occurence rate


Linking semantically related words and phrases : Linking semantically related words and phrases Semantic relations in thesauri, lexicons, semantic nets as tools for term expansion; some examples: ERIC Thesaurus of Descriptors (the Dialog Corporation) Roger Thesaurus KL-ONE WordNet... Normally used relations: broader/narrower term, related term, synonym,”used for”/ ”use” (identifies a preferred synonym); Even entailment (WordNet), role (KL-ONE)


Thesauri: Top-down classification - monohierarchy : Thesauri: Top-down classification - monohierarchy


Thesauri: Polyhierarchy : Thesauri: Polyhierarchy


Thesauri: Polydimensional hierarchy : Thesauri: Polydimensional hierarchy


Thesauri: Polydimensional hierarchy : Thesauri: Polydimensional hierarchy


Thesauri: Bottom-up classification : Thesauri: Bottom-up classification


Slide16 : Another example from WordNet missile bomb WordNet Classification:


Semantic query expansion : Semantic query expansion Expanding terms by hypernyms from general thesauri is not to reccommend Nodes that are placed deeply in the top-down hierarchy give better results the use of synonyms is problematic (feedback from user?) The knowledge of multiword entries important


Discovering English compounds by Lexware Culler (Dura, Gawronska, and Erlendsson 2006) : Discovering English compounds by Lexware Culler (Dura, Gawronska, and Erlendsson 2006) f(x) corpus frequency of word x f(x,y) corpus frequency of word pair (x, y) N total number of words in the corpus


Discovering Latin terms by Lexware Culler : Discovering Latin terms by Lexware Culler H(x) entropy of word x = -p(x)log2(p(x)), where probability p(x) = f(x) / N


The questions of significance and similarity : The questions of significance and similarity the query word may match a word in many documents – but how significant is this word for the different documents?


Finding significant words : Finding significant words Significance as a function of rank (Luhn 1958) A simple frequency-based indexing method: frequent words – stop list + truncation/stemming


Finding significant words (2) : Finding significant words (2) Term weighting: Salton & McGill1983 The ”Tf x idf” method (also called document frequency, or inverse term frequency): ”Tf x idf” can be combined with similarity measures, e.g. the vector space model DocFreq = the number of documents in which the word occurs n = the total number of documents


Tf x idf, an example : Tf x idf, an example Curcumin, a major yellow pigment and active component of turmeric, has multiple anti-cancer properties. However, its molecular targets and mechanisms of action on human colon adenocarcinoma cells are unknown. In the present study, we examined the effects of curcumin on the proliferation of human colon adenocarcinoma HT-29 cells by the 3-[4,5-dimethylthiazol-2-yl]-2,5-diphenyltetrazolium bromide method and confirmed the curcumin-induced apoptosis by morphology and DNA ladder formation. At the same time, p53, phospho-p53 (Ser15), and other apoptosis-related proteins… gene: occurrences 0 tf x idf = 0 curcumin: occurrences 8 tf x idf = 8 p53: 5 tf x idf = 5 x 0 = 0 The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on human chromosome 17p13.3, is frequently silenced in cancer by epigenetic mechanisms. Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses and zinc-finger family of transcription factors and acts by repressing target gene expression. It has been shown that enforced p53 expression leads to increased HIC1 mRNA, and recent data suggest that p53 and Hic1 cooperate in tumorigenesis. In order to elucidate the regulation of HIC1 expression, we have analysed the… gene: 7 tf x idf = 7 x 1 = 7 curcumin: occurrences 0 tf x idf = 0 p53: 10 tf x idf = 0


Similarity measures : Similarity measures Models for comparing texts normally make use of words the texts have in common Some models also utilize the size of the documents and/or the number of words the texts do not have in common


Similarity measures (2) : Similarity measures (2) = THE WEIGHT OF AN OCCURENCE OF TERM j IN DOCUMENT i THE MAXIMUM NUMBER OF TERMS IN BOTH DOCUMENTS COMBINED T = No attention is paid to the size of a document


Similarity measures (3) : Similarity measures (3) Dice’s coefficient Jaccard’s coefficient


Similarity measures (4) : Similarity measures (4) The cosine coefficient (the cosine of the angle between two vectors; the closer the documents, the larger the cosine)


Similarity measures (5) : Similarity measures (5) Clustering by similarity matrices (Jaccard’s coefficient applied to attribute/value matrices) Document signature matching (documents coded into very compact binary representations, so-called signatures) Discriminator words (Williams 1963): the discrimination coefficient ascribes high values to words that occur with a probability much different from the mean probability


Which words should count as common to both documents? The need of stop lists and stemming/morphological analysis : Which words should count as common to both documents? The need of stop lists and stemming/morphological analysis As summer turns to fall, many brewers start to plan their Oktoberfest brewing. This installment of "Brewing in Styles" looks at the materials and techniques used for brewing traditional and modern Maerzen beers and offers some radical tips for brewing Oktoberfest-like ales. Ein prosit! Several people called in response to the last installment of "Brewing in Styles" ("American Wheat," BrewingTechniques 1 [1], May/June 1993) to say that they were confused because many pubs and micros in the Midwest brew wheat beers in the traditional German manner, complete with the 4-vinylguaiacol clovelike character. Many fine German-style Weizenbiers are brewed in America. ***************************************************************************************************** Republished from BrewingTechniques' July/August 1993. What to do with that unfortunate mistake of a recipe? Design another beer that is out of balance in an opposite and complementary way. It invariably happens, even to the best of us. The beer that should have been so good ends up out of balance and undrinkable. Not being the type to accept less-than-perfect products graciously, I decided to take a page from the Belgian book of brewing. Belgian brewers have long used the practice of blending to even out inconsistent, wild fermentations


Document clustering : Document clustering


Clustering by predefined thesaurus categories : Clustering by predefined thesaurus categories Välj ämne och klicka på knappen "Visa" (hjälp)     Socialpolitik (13)        Befolkning (80)           Åldersfördelning (12)              Äldre (163)                 Pensionärsorganisationer (12)                 Gerontologi (5)              Ungdom (808)                 Ungdomsfrågor (27)                 Ungdomsforskning (2)              Åldersgränser (124)              Generationer (214)              Barn (1228)                 Barnombudsman (68)              Vuxna (239)           Könstillhörighet (17)              Kvinnor (1286)                 Kvinnoforskning (9)                 Kvinnoorganisationer (20)                 Kvinnors rättigheter (18)


The general architecture of a web search engine : The general architecture of a web search engine Brin and Page, http://www-db.stanford.edu/~backrub/google.html


Categorisation during indexing and search : Categorisation during indexing and search Language (automatic language recognition) Dokument type (HTML, Word, Excel, PDF etc.) Date Categories like server, domain, country Page ranking makes use of the links pointing to a page


Document clustering : Document clustering standard similarity measures (should apply after filtering out stop list words and, preferably, after stemming) word weights modified by factors like: bold type presence in titles and headings capital letters co-occurrence with numbers and symbols semantic closeness


Clustering algorithms : Clustering algorithms Top-down (hierarchical) The whole document collection is divided into a few big clusters; then, the algorithm makes finer and finer distinctions Bottom-up (non-hierarchical) a single text is taken as starting point; then, similarities with other texts are computed


Feedback : Feedback (Meadow et al. 2000: 246, Mc GrawHill 1971): Feedback = information derived from the output of a process and used to control the process in the future a shortcoming of the present IR systems is that the user’s possibility of giving feedback are very limited more research on utilizing feedback from user needed


Query modification by relevance feedback : Query modification by relevance feedback (picture from M.A. Hearst, http://www.sims.berkeley.edu/courses/is202/f98/Lecture25/sld005.htm)