logging in or signing up lecture9 Aric85 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 112 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript LBSC 796/INFM 718R: Week 10Clustering, Classifying, and Filtering: LBSC 796/INFM 718R: Week 10 Clustering, Classifying, and Filtering Jimmy Lin College of Information Studies University of Maryland Monday, April 10, 2006The Information Retrieval Cycle: The Information Retrieval Cycle Source Selection Search Selection Examination Delivery Query FormulationOrganizing Search Results: Organizing Search Results Single document Collection of documents Enhanced presentation (syntax) Concepts and relationships (semantics) Last week TodayToday’s Focus: Clustering: Today’s Focus: Clustering Can we do better than a ranked list? How do we automatically group documents into clusters? What are the issues to consider? Related Topics: Related Topics Using the tools in your toolbox to tackle related problems: Classification: automatically assign labels to documents Filtering: automatically decide if a document matches my information needs Text Clustering: Text Clustering Automatically partition documents into clusters based on content Documents within each cluster should be similar Documents in different clusters should be different Discover categories in an unsupervised manner No sample category labels provided by humansVisualizing Clusters: Visualizing Clusters Centroids The Cluster Hypothesis: The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979Outline of Clustering: Outline of Clustering How do you actually do it? Why would you want to do it? How can you build interfaces that support clustering? Two Strategies: Two Strategies Aglommerative (bottom-up) methods Start with each document in its own cluster Iteratively combine smaller clusters to form larger clusters Divisive (partitional, top-down) methods Directly separate documents into clustersHAC: HAC HAC = Hierarchical Agglomerative Clustering Start with each document in its own cluster Until there is only one cluster: Among the current clusters, determine the two clusters ci and cj, that are most similar Replace ci and cj with a single cluster ci cj The history of merging forms the hierarchyHAC: HAC A B C D E F G H What’s going on geometrically?: What’s going on geometrically? Cluster Similarity: Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y) What’s appropriate for documents? What’s the similarity between two clusters? Single Link: similarity of two most similar members Complete Link: similarity of two least similar members Group Average: average similarity between membersDifferent Similarity Functions: Different Similarity Functions Single link: Uses maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect Complete link: Use minimum similarity of pairs: Makes more “tight,” spherical clustersNon-Hierarchical Clustering: Non-Hierarchical Clustering Typically, must provide the number of desired clusters, k Randomly choose k instances as seeds, one per cluster Form initial clusters based on these seeds Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering Stop when clustering converges or after a fixed number of iterationsK-Means: K-Means Clusters are determined by centroids (center of gravity) of documents in a cluster: Reassignment of documents to clusters is based on distance to the current cluster centroidsK-Means Algorithm: K-Means Algorithm Let d be the distance measure between documents Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: Assign each instance xi to the cluster cj such that d(xi, sj) is minimal Update the seeds to the centroid of each cluster For each cluster cj, sj = (cj) K-Means Clustering Example: K-Means Clustering Example Reassign clusters Converged!K-Means: Discussion: K-Means: Discussion How do you select k? Results can vary based on random seed selection Some seeds can result in poor convergence rate, or convergence to sub-optimal clustersWhy cluster for IR?: Why cluster for IR? Cluster the collection Retrieve clusters instead of documents Cluster the results “Closely associated documents tend to be relevant to the same requests.” “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.”From Clusters to Centroids: From Clusters to Centroids Centroids Clustering the Collection: Clustering the Collection Basic idea: Cluster the document collection Find the centroid of each cluster Search only on the centroids, but retrieve clusters If the cluster hypothesis is true, then this should perform better Why would you want to do this? Why doesn’t it work?Clustering the Results: Clustering the Results Scatter/Gather Swish (Hearst and Pedersen, 1996) (Chen and Dumais, 2000)Scatter/Gather: Scatter/Gather How it works The system clusters documents into general “themes” The system displays the contents of the clusters by showing topical terms and typical titles User chooses a subset of the clusters The system automatically re-clusters documents within selected cluster The new clusters have different, more refined, “themes” Originally used to give collection overview Evidence suggests more appropriate for displaying retrieval results in context Marti A. Hearst and Jan O. Pedersen. (1996) Reexaming the Cluster Hypothesis: Scatter/Gather on Retrieval Results. Proceedings of SIGIR 1996.Scatter/Gather Example: symbols 8 docs film, tv 68 docs astrophysics 97 docs astronomy 67 docs flora/fauna 10 docs Clustering and re-clustering is entirely automated sports 14 docs film, tv 47 docs music 7 docs stellar phenomena 12 docs galaxies, stars 49 docs constellations 29 docs miscellaneous 7 docs Query = “star” on encyclopedic text Scatter/Gather ExampleClustering Result Sets: Clustering Result Sets Advantages: Topically coherent sets of documents are presented to the user together User gets a sense for the range of themes in the result set Supports exploration and browsing of retrieved hits Disadvantage: Clusters might not “make sense” May be difficult to understand the theme of a cluster based on summary terms Additional computational processing required Things to ponder: What is the relationship between clusters and classification systems? Why does this work?Two Queries: Two Clusterings: Two Queries: Two Clusterings AUTO, CAR, ELECTRIC AUTO, CAR, SAFETY The main differences are the clusters that are central to the query 8 control drive accident … 25 battery california technology … 48 import j. rate honda toyota … 16 export international unit japan … 3 service employee automatic … 6 control inventory integrate … 10 investigation washington … 12 study fuel death bag air … 61 sale domestic truck import … 11 japan export defect unite … … …The SWISH System: The SWISH System Basic idea: Use an existing hierarchical category structure to organize results of Web searches Automatically classify Web pages into the relevant category Present search results grouped according to categories Research questions: How does a category interface compare with a list interface? What features of a category interface would users find useful? Hao Chen and Susan Dumais. (2000) Bringing Order to the Web: Automatically Categorizing Search Results. Proceedings of CHI 2000.Organizing Search Results: Organizing Search Results List Interface Category Interface Query: jaguarCategory Structure: Category Structure Category hierarchy taken from LookSmart Web Directory (Spring, 1999) 13 top-level categories 150 second-level categories Top-level Categories: People & Chat Reference & Education Shopping & Services Society & Politics Sports & Recreation Travel & Vacations Automotive Business & Finance Computers & Internet Entertainment & Media Health & Fitness Hobbies & Interests Home & Family Interface Characteristics: Interface Characteristics Problems Large amount of information to display Limited screen real estate Solutions Information overlay (“mouseovers”) Expandable information displayInformation Overlay: Information Overlay Use “mouseovers” to show Summaries of web pages Category hierarchy Expansion of Category Structure: Expansion of Category StructureExpansion of Web Page List: Expansion of Web Page ListInterface Conditions: Interface Conditions List Interface Category InterfaceUser Study Interface: User Study InterfaceUser Study: User Study Participants: 18 “intermediate” Web users Tasks 30 search tasks, e.g., “Find home page for Seattle Art Museum” Search terms are fixed for each task (cached Web pages) Experimental Design Category/List – within subjects (15 search tasks with each interface) Order (Category/List first) – counterbalanced between subjects Both Subjective and Objective MeasuresSubjective Results: Subjective Results 7-point rating scale (1=disagree; 7=agree) Use of Interface Features: Use of Interface Features Average number of uses of feature per task: Search Time: Search Time Category: 56 sec. List: 85 sec. (p < .002) 50% faster with category interface!Search Time by Query Difficulty: Category interface is helpful for both easy and difficult queries! Search Time by Query DifficultyVisualization of Clusters: Visualization of Clusters Feature Maps Other 2D and 3D displaysKohonen’s Feature Maps: Kohonen’s Feature Maps AKA Self-Organizing Maps Expresses complex, non-linear relationships between high dimensional data on a 2D display Geometric relationships on display preserve some relationships in original data set Map Attributes: Map Attributes Different areas correspond to different concepts in collection Size of area corresponds to its relative importance in set Neighboring regions share commonalitiesStudy of Kohonen Feature Maps: Study of Kohonen Feature Maps Comparison: Kohonen Map and Yahoo Task: “Window shop” for interesting home page Repeat with other interface Results: Starting with map could repeat in Yahoo (8/11) Starting with Yahoo unable to repeat in map (2/14) Hsinchun Chen, Andrea L. Houston, Robin R. Sewell, and Bruce R. Schatz. (1998) Journal of the American Society for Information Science, 49(7):582-603. Feature Map Study (1): Feature Map Study (1) Participants liked: Correspondence of region size to # documents Overview (but also wanted zoom) Ease of jumping from one topic to another Multiple routes to topics Use of category and subcategory labelsFeature Map Study (2): Feature Map Study (2) Participants wanted: Hierarchical organization Other ordering of concepts (alphabetical) Integration of browsing and search Correspondence of color to meaning More meaningful labels Labels at same level of abstraction Fit more labels in the given space Combined keyword and category search Multiple category assignment (sports+entertainment)WEBSOM: WEBSOM Self-organizing map of Net newsgroups and postings Galaxies: GalaxiesThemescape: ThemescapeWebTheme: WebThemeTreeMaps: TreeMaps Demos http://www.smartmoney.com/marketmap/ http://www.cs.umd.edu/hcil/treemap/ Deployment: Deployment Web Search engine that employ clustering: http://www.vivisimo.com http://www.kartoo.com/ Summary: Clustering: Summary: Clustering Advantages: Provides users with an overview of main themes in search results Helps combat polysemy Can improve retrieval effectiveness Disadvantages: Documents can be clustered in many ways Not always easy to understand the theme of a cluster What is the correct level of granularity? More information to present; requires careful design of user interfacesText Classification: Text Classification Problem: automatically sort items into bins Examples: Spam vs. non-spam Interesting vs. non-interesting Machine learning approach Obtain a training set with ground truth labels Use a machine learning algorithm to “train” a classifier kNN, Bayesian classifier, SVMs, decision trees, etc. Apply classifier to new documents System assigns labels according to patterns learned in the training set Machine Learning: Machine Learning label1 label2 label3 label4 Text Classifier Supervised Machine Learning Algorithm Unlabeled Document label1? label2? label3? label4? Testing Training Training examples Representation FunctionkNN: kNN A simple text classification algorithm: k Nearest Neighbors Select k document that are similar to the test document Have them vote on what the correct label should be How can similarity be defined? Information Access Problems: Information Access Problems Collection Information Need Stable Stable Different Each Time Retrieval Filtering Different Each TimeInformation Filtering: Information Filtering An abstract problem in which: The information need is stable: characterized by a “profile” A stream of documents is arriving: each must either be presented to the user or not Introduced by Luhn in 1958 As “Selective Dissemination of Information” Named “filtering” by Denning in 1975A Simple Filtering Strategy: A Simple Filtering Strategy Use any information retrieval system Boolean, vector space, probabilistic, … Have the user specify a “standing query” This will be the profile Limit the standing query by date For each use, show new documents since the last useSocial Filtering: Social Filtering Exploit ratings from other users as features Like personal recommendations, peer review, … Reaches beyond topicality to: Accuracy, coherence, depth, novelty, style, … Applies equally well to other modalities Movies, recorded music, … Sometimes called “collaborative” filteringUsing Positive Information: Using Positive Information Source: Jon Herlocker, SIGIR 1999 Using Negative Information: Using Negative Information Source: Jon Herlocker, SIGIR 1999Some Things We (Sort of) Know: Some Things We (Sort of) Know Treating each genre separately can be useful Separate predictions for separate tastes Negative information can be useful “I hate everything my parents like” People like to know who provided ratings Popularity provides a useful fallback People don’t like to provide ratingsThe Cold Start Problem: The Cold Start Problem Social filtering will not work in isolation Without ratings, we get no recommendations Without recommendations, we read nothing Without reading, we get no ratings An initial recommendation strategy is needed Stereotypes Content-based searchCold Start: Potential Solutions: Cold Start: Potential Solutions Provide motivation: Self-interest Altruism Economic benefit Implicit feedbackSample Observations: Sample Observations User selects an article Interpretation: Summary was interesting User quickly prints the article Interpretation: They want to read it User selects a second article Interpretation: another interesting summary User scrolls around in the article Interpretation: Parts with high dwell time and/or repeated revisits are interesting User stops scrolling for an extended period Interpretation: User was interruptedCritical Issues: Critical Issues Protecting privacy What absolute assurances can we provide? How can we make remaining risks understood? Scalable rating servers Is a fully distributed architecture practical? Non-cooperative users How can the effect of spamming be limited?Recap: Recap Clustering Automatically group documents into clusters Classification Automatically assign labels to documents Filtering Automatically decide if a document matches my information needs Many approaches to the same elephant You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
lecture9 Aric85 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 112 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 21, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript LBSC 796/INFM 718R: Week 10Clustering, Classifying, and Filtering: LBSC 796/INFM 718R: Week 10 Clustering, Classifying, and Filtering Jimmy Lin College of Information Studies University of Maryland Monday, April 10, 2006The Information Retrieval Cycle: The Information Retrieval Cycle Source Selection Search Selection Examination Delivery Query FormulationOrganizing Search Results: Organizing Search Results Single document Collection of documents Enhanced presentation (syntax) Concepts and relationships (semantics) Last week TodayToday’s Focus: Clustering: Today’s Focus: Clustering Can we do better than a ranked list? How do we automatically group documents into clusters? What are the issues to consider? Related Topics: Related Topics Using the tools in your toolbox to tackle related problems: Classification: automatically assign labels to documents Filtering: automatically decide if a document matches my information needs Text Clustering: Text Clustering Automatically partition documents into clusters based on content Documents within each cluster should be similar Documents in different clusters should be different Discover categories in an unsupervised manner No sample category labels provided by humansVisualizing Clusters: Visualizing Clusters Centroids The Cluster Hypothesis: The Cluster Hypothesis “Closely associated documents tend to be relevant to the same requests.” van Rijsbergen 1979 “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.” van Rijsbergen 1979Outline of Clustering: Outline of Clustering How do you actually do it? Why would you want to do it? How can you build interfaces that support clustering? Two Strategies: Two Strategies Aglommerative (bottom-up) methods Start with each document in its own cluster Iteratively combine smaller clusters to form larger clusters Divisive (partitional, top-down) methods Directly separate documents into clustersHAC: HAC HAC = Hierarchical Agglomerative Clustering Start with each document in its own cluster Until there is only one cluster: Among the current clusters, determine the two clusters ci and cj, that are most similar Replace ci and cj with a single cluster ci cj The history of merging forms the hierarchyHAC: HAC A B C D E F G H What’s going on geometrically?: What’s going on geometrically? Cluster Similarity: Cluster Similarity Assume a similarity function that determines the similarity of two instances: sim(x,y) What’s appropriate for documents? What’s the similarity between two clusters? Single Link: similarity of two most similar members Complete Link: similarity of two least similar members Group Average: average similarity between membersDifferent Similarity Functions: Different Similarity Functions Single link: Uses maximum similarity of pairs: Can result in “straggly” (long and thin) clusters due to chaining effect Complete link: Use minimum similarity of pairs: Makes more “tight,” spherical clustersNon-Hierarchical Clustering: Non-Hierarchical Clustering Typically, must provide the number of desired clusters, k Randomly choose k instances as seeds, one per cluster Form initial clusters based on these seeds Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering Stop when clustering converges or after a fixed number of iterationsK-Means: K-Means Clusters are determined by centroids (center of gravity) of documents in a cluster: Reassignment of documents to clusters is based on distance to the current cluster centroidsK-Means Algorithm: K-Means Algorithm Let d be the distance measure between documents Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: Assign each instance xi to the cluster cj such that d(xi, sj) is minimal Update the seeds to the centroid of each cluster For each cluster cj, sj = (cj) K-Means Clustering Example: K-Means Clustering Example Reassign clusters Converged!K-Means: Discussion: K-Means: Discussion How do you select k? Results can vary based on random seed selection Some seeds can result in poor convergence rate, or convergence to sub-optimal clustersWhy cluster for IR?: Why cluster for IR? Cluster the collection Retrieve clusters instead of documents Cluster the results “Closely associated documents tend to be relevant to the same requests.” “… I would claim that document clustering can lead to more effective retrieval than linear search [which] ignores the relationships that exist between documents.”From Clusters to Centroids: From Clusters to Centroids Centroids Clustering the Collection: Clustering the Collection Basic idea: Cluster the document collection Find the centroid of each cluster Search only on the centroids, but retrieve clusters If the cluster hypothesis is true, then this should perform better Why would you want to do this? Why doesn’t it work?Clustering the Results: Clustering the Results Scatter/Gather Swish (Hearst and Pedersen, 1996) (Chen and Dumais, 2000)Scatter/Gather: Scatter/Gather How it works The system clusters documents into general “themes” The system displays the contents of the clusters by showing topical terms and typical titles User chooses a subset of the clusters The system automatically re-clusters documents within selected cluster The new clusters have different, more refined, “themes” Originally used to give collection overview Evidence suggests more appropriate for displaying retrieval results in context Marti A. Hearst and Jan O. Pedersen. (1996) Reexaming the Cluster Hypothesis: Scatter/Gather on Retrieval Results. Proceedings of SIGIR 1996.Scatter/Gather Example: symbols 8 docs film, tv 68 docs astrophysics 97 docs astronomy 67 docs flora/fauna 10 docs Clustering and re-clustering is entirely automated sports 14 docs film, tv 47 docs music 7 docs stellar phenomena 12 docs galaxies, stars 49 docs constellations 29 docs miscellaneous 7 docs Query = “star” on encyclopedic text Scatter/Gather ExampleClustering Result Sets: Clustering Result Sets Advantages: Topically coherent sets of documents are presented to the user together User gets a sense for the range of themes in the result set Supports exploration and browsing of retrieved hits Disadvantage: Clusters might not “make sense” May be difficult to understand the theme of a cluster based on summary terms Additional computational processing required Things to ponder: What is the relationship between clusters and classification systems? Why does this work?Two Queries: Two Clusterings: Two Queries: Two Clusterings AUTO, CAR, ELECTRIC AUTO, CAR, SAFETY The main differences are the clusters that are central to the query 8 control drive accident … 25 battery california technology … 48 import j. rate honda toyota … 16 export international unit japan … 3 service employee automatic … 6 control inventory integrate … 10 investigation washington … 12 study fuel death bag air … 61 sale domestic truck import … 11 japan export defect unite … … …The SWISH System: The SWISH System Basic idea: Use an existing hierarchical category structure to organize results of Web searches Automatically classify Web pages into the relevant category Present search results grouped according to categories Research questions: How does a category interface compare with a list interface? What features of a category interface would users find useful? Hao Chen and Susan Dumais. (2000) Bringing Order to the Web: Automatically Categorizing Search Results. Proceedings of CHI 2000.Organizing Search Results: Organizing Search Results List Interface Category Interface Query: jaguarCategory Structure: Category Structure Category hierarchy taken from LookSmart Web Directory (Spring, 1999) 13 top-level categories 150 second-level categories Top-level Categories: People & Chat Reference & Education Shopping & Services Society & Politics Sports & Recreation Travel & Vacations Automotive Business & Finance Computers & Internet Entertainment & Media Health & Fitness Hobbies & Interests Home & Family Interface Characteristics: Interface Characteristics Problems Large amount of information to display Limited screen real estate Solutions Information overlay (“mouseovers”) Expandable information displayInformation Overlay: Information Overlay Use “mouseovers” to show Summaries of web pages Category hierarchy Expansion of Category Structure: Expansion of Category StructureExpansion of Web Page List: Expansion of Web Page ListInterface Conditions: Interface Conditions List Interface Category InterfaceUser Study Interface: User Study InterfaceUser Study: User Study Participants: 18 “intermediate” Web users Tasks 30 search tasks, e.g., “Find home page for Seattle Art Museum” Search terms are fixed for each task (cached Web pages) Experimental Design Category/List – within subjects (15 search tasks with each interface) Order (Category/List first) – counterbalanced between subjects Both Subjective and Objective MeasuresSubjective Results: Subjective Results 7-point rating scale (1=disagree; 7=agree) Use of Interface Features: Use of Interface Features Average number of uses of feature per task: Search Time: Search Time Category: 56 sec. List: 85 sec. (p < .002) 50% faster with category interface!Search Time by Query Difficulty: Category interface is helpful for both easy and difficult queries! Search Time by Query DifficultyVisualization of Clusters: Visualization of Clusters Feature Maps Other 2D and 3D displaysKohonen’s Feature Maps: Kohonen’s Feature Maps AKA Self-Organizing Maps Expresses complex, non-linear relationships between high dimensional data on a 2D display Geometric relationships on display preserve some relationships in original data set Map Attributes: Map Attributes Different areas correspond to different concepts in collection Size of area corresponds to its relative importance in set Neighboring regions share commonalitiesStudy of Kohonen Feature Maps: Study of Kohonen Feature Maps Comparison: Kohonen Map and Yahoo Task: “Window shop” for interesting home page Repeat with other interface Results: Starting with map could repeat in Yahoo (8/11) Starting with Yahoo unable to repeat in map (2/14) Hsinchun Chen, Andrea L. Houston, Robin R. Sewell, and Bruce R. Schatz. (1998) Journal of the American Society for Information Science, 49(7):582-603. Feature Map Study (1): Feature Map Study (1) Participants liked: Correspondence of region size to # documents Overview (but also wanted zoom) Ease of jumping from one topic to another Multiple routes to topics Use of category and subcategory labelsFeature Map Study (2): Feature Map Study (2) Participants wanted: Hierarchical organization Other ordering of concepts (alphabetical) Integration of browsing and search Correspondence of color to meaning More meaningful labels Labels at same level of abstraction Fit more labels in the given space Combined keyword and category search Multiple category assignment (sports+entertainment)WEBSOM: WEBSOM Self-organizing map of Net newsgroups and postings Galaxies: GalaxiesThemescape: ThemescapeWebTheme: WebThemeTreeMaps: TreeMaps Demos http://www.smartmoney.com/marketmap/ http://www.cs.umd.edu/hcil/treemap/ Deployment: Deployment Web Search engine that employ clustering: http://www.vivisimo.com http://www.kartoo.com/ Summary: Clustering: Summary: Clustering Advantages: Provides users with an overview of main themes in search results Helps combat polysemy Can improve retrieval effectiveness Disadvantages: Documents can be clustered in many ways Not always easy to understand the theme of a cluster What is the correct level of granularity? More information to present; requires careful design of user interfacesText Classification: Text Classification Problem: automatically sort items into bins Examples: Spam vs. non-spam Interesting vs. non-interesting Machine learning approach Obtain a training set with ground truth labels Use a machine learning algorithm to “train” a classifier kNN, Bayesian classifier, SVMs, decision trees, etc. Apply classifier to new documents System assigns labels according to patterns learned in the training set Machine Learning: Machine Learning label1 label2 label3 label4 Text Classifier Supervised Machine Learning Algorithm Unlabeled Document label1? label2? label3? label4? Testing Training Training examples Representation FunctionkNN: kNN A simple text classification algorithm: k Nearest Neighbors Select k document that are similar to the test document Have them vote on what the correct label should be How can similarity be defined? Information Access Problems: Information Access Problems Collection Information Need Stable Stable Different Each Time Retrieval Filtering Different Each TimeInformation Filtering: Information Filtering An abstract problem in which: The information need is stable: characterized by a “profile” A stream of documents is arriving: each must either be presented to the user or not Introduced by Luhn in 1958 As “Selective Dissemination of Information” Named “filtering” by Denning in 1975A Simple Filtering Strategy: A Simple Filtering Strategy Use any information retrieval system Boolean, vector space, probabilistic, … Have the user specify a “standing query” This will be the profile Limit the standing query by date For each use, show new documents since the last useSocial Filtering: Social Filtering Exploit ratings from other users as features Like personal recommendations, peer review, … Reaches beyond topicality to: Accuracy, coherence, depth, novelty, style, … Applies equally well to other modalities Movies, recorded music, … Sometimes called “collaborative” filteringUsing Positive Information: Using Positive Information Source: Jon Herlocker, SIGIR 1999 Using Negative Information: Using Negative Information Source: Jon Herlocker, SIGIR 1999Some Things We (Sort of) Know: Some Things We (Sort of) Know Treating each genre separately can be useful Separate predictions for separate tastes Negative information can be useful “I hate everything my parents like” People like to know who provided ratings Popularity provides a useful fallback People don’t like to provide ratingsThe Cold Start Problem: The Cold Start Problem Social filtering will not work in isolation Without ratings, we get no recommendations Without recommendations, we read nothing Without reading, we get no ratings An initial recommendation strategy is needed Stereotypes Content-based searchCold Start: Potential Solutions: Cold Start: Potential Solutions Provide motivation: Self-interest Altruism Economic benefit Implicit feedbackSample Observations: Sample Observations User selects an article Interpretation: Summary was interesting User quickly prints the article Interpretation: They want to read it User selects a second article Interpretation: another interesting summary User scrolls around in the article Interpretation: Parts with high dwell time and/or repeated revisits are interesting User stops scrolling for an extended period Interpretation: User was interruptedCritical Issues: Critical Issues Protecting privacy What absolute assurances can we provide? How can we make remaining risks understood? Scalable rating servers Is a fully distributed architecture practical? Non-cooperative users How can the effect of spamming be limited?Recap: Recap Clustering Automatically group documents into clusters Classification Automatically assign labels to documents Filtering Automatically decide if a document matches my information needs Many approaches to the same elephant