logging in or signing up week4 Sophia Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 228 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript IST2140Information Storage and Retrieval: IST2140 Information Storage and Retrieval Week 4 Information Retrieval Models IIInformation Retrieval Models: Information Retrieval Models A model is an embodiment of the theory in which we define a set of objects about which assertions can be made and restrict the ways in which classes of objects can interact A retrieval model specifies the representations used for documents and information needs, and how they are compared. (Turtle & Croft, 1992)Information Retrieval Models: Information Retrieval Models An IR model specifies the representations used for documents and information needs, and how they are compared Three ‘classic’ models: Boolean Model Vector Space Model Probabilistic ModelBoolean Model: Boolean Model Exact match system Document features are binary variables {0,1} Features can be document terms, also more complex information (dates, source, authors, etc.) Query is a Boolean expression using feature variables related by AND, OR, NOT operators All documents which match query are retrieved and equal Extensions to model Add additional operators, e.g. positional Introduce term weightsVector space model: Vector space model Documents and queries are represented as vectors in l-dimensional hyperspace Each dimension corresponds to possible document feature Vector elements are usually weighted 0 to 1 Matching function is a distance metric that operates on document and query vectors Documents can be ranked in order of distance from queryProbabilistic Model: Probabilistic Model First proposed by Maron & Kuhns (1960) Further developed by (inter alia) Maron & Cooper Van Rijsbergen and colleagues Croft and Turtle (InQuery) Robertson and colleagues (Okapi)Probabilistic Model: Probabilistic Model Views retrieval as an attempt to answer a basic question: “What is the probability that this document is relevant to this query?” expressed as: P(REL|D) ie. Probability of x given y (Probability that of relevance given a particular document D)Assumptions: Assumptions document here means the content representation or description, i.e. surrogate relevance is binary relevance of a document is independent of relevance of other documents terms are independent of one another Independence Assumption: Independence Assumption General model is developed by making a strong Independence Assumption: Given relevance, the attributes are statistically independent I.e.. Within each class of documents, relevant or non-relevant, the attributes are statistically independent Simplifies theoryProbability Ranking Principle: Probability Ranking Principle If retrieved documents are ordered by decreasing probability of relevance on the data available, then the system’s effectiveness is the best to be gotten for the dataFormal Model: Formal Model Given a document D and query Q, there are two possible events: REL, that D is relevant to Q notREL, that D is not relevant to Q We want to calculate the probability P(REL|D), i.e. the probability that a document is relevant given its contentFormal Model: Formal Model Assuming relevance is binary: P(notREL) = 1 – P(REL) Probability that a document is relevant is: P(REL) = n = no. of relevant docs N no. of docs Probability that a document is not relevant: P(notREL) = 1 – P(REL) = N-n NSlide13: Assuming term independence, then P(D|REL) = П(ai|REL) i.e. probability that a document is relevant is based on the probability of relevance of individual terms Can derive a matching score of a document which is the sum of the weights of the matching (present) terms For simplicity use log-odds rather than odds What term weights to use?: What term weights to use? What contribution does presence of a term in specific documents make to document’s probability of relevance? E.g. CFW: collection frequency weight (log N/ni) RW: relevance weight CW: combined weight (using in-document frequency)Formula for relevance weights: Formula for relevance weights (r+0.5)(N-n-R+r+0.5) RW = log --------------------------- (R-4+0.5)(n-r+0.5) Where r = no. of rel docs containing term i n = no of docs containing term i N = number of docs in collection Probabilistic Model: Probabilistic Model Extensively tested See for instance Sparck Jones et al, A probabilistic model of information retrieval: development and comparative experiments. Parts I and II. Information Processing & Management 36: 779-840. -based on performance figures from Cranfield, UKCIS, NPL, TRECSparck Jones’ Evaluation: Sparck Jones’ Evaluation CW>RW > CFW (idf) > UW I.e. performance gains from Unweighted terms Collection frequency weights Relevance weights Combined weights (using in-document frequency) Largest single gains from using term frequency information Also very noticeable gain from using relevance informationClassic IR Models: Classic IR Models Boolean considered weakest Vector vs. probabilistic: “Through several different measures, Salton and Buckley showed that the vector model is expected to outperform the probabilistic model with general collections. This also seems to be the dominant thought among researchers, practitioners, and the Web community, where the popularity of the vector model runs high”. –Baeza-Yates & Ribeiro, p. 34Classic IR Models: Classic IR Models Vector vs. probabilistic “Numerous experiments demonstrate that probabilistic retrieval procedures yield good results. However, the results have not been sufficiently better than those obtained using Boolean or vector techniques to convince system developers to move heavily in this direction”---Korfhage, p. 92 However, see recent Sparck Jones et al article, and success of probabilistic systems such as InQuery and Okapi at TREC. Latest Model: Language Model: Latest Model: Language Model a statistical language model is a probabilistic model for generating text In late 90’s the language model was proposed for information retrieval (not a new model but new for IR) generates scores for probability that query is generated from the document modelLanguage Models in IR: Language Models in IR Views documents as models and considers queries as strings of texts randomly sampled from models documents are ranked by probability that a query Q would be observed during repeated random sampling from the document model MD: P(Q|MD)Other IR Models: Other IR Models Extended Boolean Fuzzy Set Cluster-based RetrievalCluster-based retrieval: Cluster-based retrieval Cluster analysis: a technique in multivariate analysis used to generate a category structure which fits a set of observations Form of automatic classification but classes formed are not known prior to processing but defined by the cluster processClustering of Text: Clustering of Text Cluster documents on basis of terms they contain Cluster documents on basis of co-occurring citations Cluster terms on basis of documents they occur in Applications in IR: Applications in IR Cluster based retrieval: retrieve documents in the same cluster or hierarchy Text categorization: use to categorize documents Use of term clusters: build a “thesaurus” of like terms; use in query expansion Web IR: use post-hoc clustering to group retrieved web pages (e.g. “jaguar”: car, animal, software) Use in text mining to find patternsPracticalities of clustering: Practicalities of clustering Determine attributes clusters will be based on Choose a clustering method Choose a similarity function Create the clusters Assess validity of resultSimilarity Matrix: Similarity Matrix Many similarity measures: for text Dice, Jaccard, cosine are commonly used Calculate the similarity matrix S21 S31 S32 S41 S42 S43 …. Sn1 Sn2 Sn3… Sn(n-1)Clustering methods: Clustering methods Non-hierarchical Single pass Reallocation Hierarchical Single link Complete link Ward’s methodSingle Pass Method: Single Pass Method First document becomes representative for first cluster For Di calculate Sim with each cluster representative If Simmax > threshold SimT, add to corresponding cluster; otherwise start a new cluster If an item Di remains, return to step 2 Reallocation Method: Reallocation Method Select M cluster representatives or centroids For I = 1 to N, assign Di to most similar centroid For j = 1 to M, recalculate cluster centroid Cj Repeat steps 2 and 3 until cluster membership stabilizesHACM(Hierarchical agglomerative clustering methods): HACM (Hierarchical agglomerative clustering methods) Identify two closest points and combine them in a cluster Identify and combine the next two closest points, treating existing clusters as points If more than one cluster remains, return to step 1HACM: HACM Differ based on: Definition of most similar pair Cluster representation used High storage and time requirements Output presented as a dendrogram HACM: HACM Single link: Joins most similar pair of objects not yet in same cluster Complete link Uses least similar pair between clusters to measure intercluster similarity Ward’s method Merges pair which minimizes increase in within group error sum of squares; uses Euclidean distance; cluster center is weighted average How many clusters?: How many clusters? HACM give complete hierarchy; clusters from 1 to N Use a stopping rule to determine when to stop process -- > number of clusters Cluster-based Retrieval: Cluster-based Retrieval Nearest neighbour cluster (most similar document Rank clusters rather than documents; retrieve entire clusters Use hierarchy structure; top-down or bottom-up search You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
week4 Sophia Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 228 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript IST2140Information Storage and Retrieval: IST2140 Information Storage and Retrieval Week 4 Information Retrieval Models IIInformation Retrieval Models: Information Retrieval Models A model is an embodiment of the theory in which we define a set of objects about which assertions can be made and restrict the ways in which classes of objects can interact A retrieval model specifies the representations used for documents and information needs, and how they are compared. (Turtle & Croft, 1992)Information Retrieval Models: Information Retrieval Models An IR model specifies the representations used for documents and information needs, and how they are compared Three ‘classic’ models: Boolean Model Vector Space Model Probabilistic ModelBoolean Model: Boolean Model Exact match system Document features are binary variables {0,1} Features can be document terms, also more complex information (dates, source, authors, etc.) Query is a Boolean expression using feature variables related by AND, OR, NOT operators All documents which match query are retrieved and equal Extensions to model Add additional operators, e.g. positional Introduce term weightsVector space model: Vector space model Documents and queries are represented as vectors in l-dimensional hyperspace Each dimension corresponds to possible document feature Vector elements are usually weighted 0 to 1 Matching function is a distance metric that operates on document and query vectors Documents can be ranked in order of distance from queryProbabilistic Model: Probabilistic Model First proposed by Maron & Kuhns (1960) Further developed by (inter alia) Maron & Cooper Van Rijsbergen and colleagues Croft and Turtle (InQuery) Robertson and colleagues (Okapi)Probabilistic Model: Probabilistic Model Views retrieval as an attempt to answer a basic question: “What is the probability that this document is relevant to this query?” expressed as: P(REL|D) ie. Probability of x given y (Probability that of relevance given a particular document D)Assumptions: Assumptions document here means the content representation or description, i.e. surrogate relevance is binary relevance of a document is independent of relevance of other documents terms are independent of one another Independence Assumption: Independence Assumption General model is developed by making a strong Independence Assumption: Given relevance, the attributes are statistically independent I.e.. Within each class of documents, relevant or non-relevant, the attributes are statistically independent Simplifies theoryProbability Ranking Principle: Probability Ranking Principle If retrieved documents are ordered by decreasing probability of relevance on the data available, then the system’s effectiveness is the best to be gotten for the dataFormal Model: Formal Model Given a document D and query Q, there are two possible events: REL, that D is relevant to Q notREL, that D is not relevant to Q We want to calculate the probability P(REL|D), i.e. the probability that a document is relevant given its contentFormal Model: Formal Model Assuming relevance is binary: P(notREL) = 1 – P(REL) Probability that a document is relevant is: P(REL) = n = no. of relevant docs N no. of docs Probability that a document is not relevant: P(notREL) = 1 – P(REL) = N-n NSlide13: Assuming term independence, then P(D|REL) = П(ai|REL) i.e. probability that a document is relevant is based on the probability of relevance of individual terms Can derive a matching score of a document which is the sum of the weights of the matching (present) terms For simplicity use log-odds rather than odds What term weights to use?: What term weights to use? What contribution does presence of a term in specific documents make to document’s probability of relevance? E.g. CFW: collection frequency weight (log N/ni) RW: relevance weight CW: combined weight (using in-document frequency)Formula for relevance weights: Formula for relevance weights (r+0.5)(N-n-R+r+0.5) RW = log --------------------------- (R-4+0.5)(n-r+0.5) Where r = no. of rel docs containing term i n = no of docs containing term i N = number of docs in collection Probabilistic Model: Probabilistic Model Extensively tested See for instance Sparck Jones et al, A probabilistic model of information retrieval: development and comparative experiments. Parts I and II. Information Processing & Management 36: 779-840. -based on performance figures from Cranfield, UKCIS, NPL, TRECSparck Jones’ Evaluation: Sparck Jones’ Evaluation CW>RW > CFW (idf) > UW I.e. performance gains from Unweighted terms Collection frequency weights Relevance weights Combined weights (using in-document frequency) Largest single gains from using term frequency information Also very noticeable gain from using relevance informationClassic IR Models: Classic IR Models Boolean considered weakest Vector vs. probabilistic: “Through several different measures, Salton and Buckley showed that the vector model is expected to outperform the probabilistic model with general collections. This also seems to be the dominant thought among researchers, practitioners, and the Web community, where the popularity of the vector model runs high”. –Baeza-Yates & Ribeiro, p. 34Classic IR Models: Classic IR Models Vector vs. probabilistic “Numerous experiments demonstrate that probabilistic retrieval procedures yield good results. However, the results have not been sufficiently better than those obtained using Boolean or vector techniques to convince system developers to move heavily in this direction”---Korfhage, p. 92 However, see recent Sparck Jones et al article, and success of probabilistic systems such as InQuery and Okapi at TREC. Latest Model: Language Model: Latest Model: Language Model a statistical language model is a probabilistic model for generating text In late 90’s the language model was proposed for information retrieval (not a new model but new for IR) generates scores for probability that query is generated from the document modelLanguage Models in IR: Language Models in IR Views documents as models and considers queries as strings of texts randomly sampled from models documents are ranked by probability that a query Q would be observed during repeated random sampling from the document model MD: P(Q|MD)Other IR Models: Other IR Models Extended Boolean Fuzzy Set Cluster-based RetrievalCluster-based retrieval: Cluster-based retrieval Cluster analysis: a technique in multivariate analysis used to generate a category structure which fits a set of observations Form of automatic classification but classes formed are not known prior to processing but defined by the cluster processClustering of Text: Clustering of Text Cluster documents on basis of terms they contain Cluster documents on basis of co-occurring citations Cluster terms on basis of documents they occur in Applications in IR: Applications in IR Cluster based retrieval: retrieve documents in the same cluster or hierarchy Text categorization: use to categorize documents Use of term clusters: build a “thesaurus” of like terms; use in query expansion Web IR: use post-hoc clustering to group retrieved web pages (e.g. “jaguar”: car, animal, software) Use in text mining to find patternsPracticalities of clustering: Practicalities of clustering Determine attributes clusters will be based on Choose a clustering method Choose a similarity function Create the clusters Assess validity of resultSimilarity Matrix: Similarity Matrix Many similarity measures: for text Dice, Jaccard, cosine are commonly used Calculate the similarity matrix S21 S31 S32 S41 S42 S43 …. Sn1 Sn2 Sn3… Sn(n-1)Clustering methods: Clustering methods Non-hierarchical Single pass Reallocation Hierarchical Single link Complete link Ward’s methodSingle Pass Method: Single Pass Method First document becomes representative for first cluster For Di calculate Sim with each cluster representative If Simmax > threshold SimT, add to corresponding cluster; otherwise start a new cluster If an item Di remains, return to step 2 Reallocation Method: Reallocation Method Select M cluster representatives or centroids For I = 1 to N, assign Di to most similar centroid For j = 1 to M, recalculate cluster centroid Cj Repeat steps 2 and 3 until cluster membership stabilizesHACM(Hierarchical agglomerative clustering methods): HACM (Hierarchical agglomerative clustering methods) Identify two closest points and combine them in a cluster Identify and combine the next two closest points, treating existing clusters as points If more than one cluster remains, return to step 1HACM: HACM Differ based on: Definition of most similar pair Cluster representation used High storage and time requirements Output presented as a dendrogram HACM: HACM Single link: Joins most similar pair of objects not yet in same cluster Complete link Uses least similar pair between clusters to measure intercluster similarity Ward’s method Merges pair which minimizes increase in within group error sum of squares; uses Euclidean distance; cluster center is weighted average How many clusters?: How many clusters? HACM give complete hierarchy; clusters from 1 to N Use a stopping rule to determine when to stop process -- > number of clusters Cluster-based Retrieval: Cluster-based Retrieval Nearest neighbour cluster (most similar document Rank clusters rather than documents; retrieve entire clusters Use hierarchy structure; top-down or bottom-up search