logging in or signing up LOC Feb99 Eagle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 23 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 14, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Research in Information Retrieval and Management: Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999 Research in IR at MS: Research in IR at MS Microsoft Research (http://research.microsoft.com) Decision Theory and Adaptive Systems Natural Language Processing MSR Cambridge User Interface Database Web Companion Paperless Office Microsoft Product Groups … many IR-related IR Themes & Directions: IR Themes andamp; Directions Improvements in representation and content-matching Probabilistic/Bayesian models p(Relevant|Document), p(Concept|Words) NLP: Truffle, MindNet Beyond content-matching User/Task modeling Domain/Object modeling Advances in presentation and manipulation Improvements: Using Probabilistic Model: Improvements: Using Probabilistic Model MSR-Cambridge (Steve Robertson) Probabilistic Retrieval (e.g., Okapi) Theory-driven derivation of matching function Estimate: PQ(ri=Rel or NotRel | d=document) Using Bayes Rule and assuming conditional independence given Rel/NotRel Improvements: Using Probabilistic Model: Improvements: Using Probabilistic Model Good performance for uniform length document surrogates (e.g., abstracts) Enhanced to take into account term frequency and document 'BM25' one of the best ranking function at TREC Easy to incorporate relevance feedback Now looking at adaptive filtering/routing Improvements: Using NLP: Improvements: Using NLP Current search techniques use word forms Improvements in content-matching will come from: -andgt; Identifying relations between words -andgt; Identifying word meanings Advanced NLP can provide these http:/research.microspft.com/nlp Slide7: Dictionary MindNet Morphology Sketch Logical Form Portrait NL Text Discourse Generation NL Text NLP System Architecture Machine Translation Projects Technology Search and Retrieval Meaning Representation Grammar andamp; Style Checking Document Understanding Intelligent Summarizing Smart Selection Word Breaking Indexing “Truffle”: Word Relations % Relevant In Top Ten Docs: Result: 2-3 times as many relevant documents in the top 10 with Microsoft NLP 21.5% 33.1% 63.7% Engine X X+ NLP Relevant hits 'Truffle': Word Relations % Relevant In Top Ten Docs “MindNet”: Word Meanings: 'MindNet': Word Meanings A huge knowledge base Automatically created from dictionaries Words (nodes) linked by relationships 7 million links and growing Slide10: MindNet Beyond Content Matching: Beyond Content Matching Domain/Object modeling Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM) Broader View of IR: Broader View of IR Beyond Content Matching: Beyond Content Matching Domain/Object modeling Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM) Text Classification: Text Classification Text Classification: assign objects to one or more of a predefined set of categories using text features E.g., News feeds, Web data, OHSUMED, Email - spam/no-spam Approaches: Human classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol) Hand-crafted knowledge engineered systems (e.g., CONSTRUE) Inductive learning methods (Semi-) automatic classification Classifiers: Classifiers A classifier is a function: f(x) = conf(class) from attribute vectors, x=(x1,x2, … xd) to target values, confidence(class) Example classifiers if (interest AND rate) OR (quarterly), then confidence(interest) = 0.9 confidence(interest) = 0.3*interest + 0.4*rate + 0.1*quarterly Inductive Learning Methods: Inductive Learning Methods Supervised learning from examples Examples are easy for domain experts to provide Models easy to learn, update, and customize Example learning algorithms Relevance Feedback, Decision Trees, Naïve Bayes, Bayes Nets, Support Vector Machines (SVMs) Text representation Large vector of features (words, phrases, hand-crafted) Text Classification Process : Text Classification Process text files word counts per file data set Decision tree Index Server Feature selection Naïve Bayes Find similar Bayes nets Support vector machine Learning Methods test classifier Support Vector Machine: Support Vector Machine Optimization Problem Find hyperplane, h, separating positive and negative examples Optimization for maximum margin: Classify new items using: Support Vector Machines: Support Vector Machines Extendable to: Non-separable problems (Cortes andamp; Vapnik, 1995) Non-linear classifiers (Boser et al., 1992) Good generalization performance Handwriting recognition (LeCun et al.) Face detection (Osuna et al.) Text classification (Joachims, Dumais et al.) Platt’s Sequential Minimal Optimization algorithm very efficient Reuters Data Set (21578 - ModApte split) : Reuters Data Set (21578 - ModApte split) 9603 training articles; 3299 test articles Example 'interest' article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER Average article 200 words long Example: Reuters news: Example: Reuters news 118 categories (article can be in more than one category) Most common categories (#train, #test) Overall Results Linear SVM most accurate: 87% precision at 87% recall Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Reuters ROC - Category Grain: Reuters ROC - Category Grain Precision Recall LSVM Decision Tree Naïve Bayes Find Similar Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category Text Categ Summary : Text Categ Summary Accurate classifiers can be learned automatically from training examples Linear SVMs are efficient and provide very good classification accuracy Widely applicable, flexible, and adaptable representations Email spam/no-spam, Web, Medical abstracts, TREC Text Clustering: Text Clustering Discovering structure Vector-based document representation EM algorithm to identify clusters Interactive user interface Text Clustering: Text Clustering Beyond Content Matching: Beyond Content Matching Domain/Object modeling Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM) Implicit Queries (IQ): Implicit Queries (IQ) Explicit queries: Search is a separate, discrete task User types query, Gets results, Tries again … Implicit queries: Search as part of normal information flow Ongoing query formulation based on user activities, and non-intrusive results display Can include explicit query or push profile, but doesn’t require either Slide28: Slide29: User Modeling for IQ/IR: User Modeling for IQ/IR IQ: Model of user interests based on actions Explicit search activity (query or profile) Patterns of scroll / dwell on text Copying and pasting actions Interaction with multiple applications User’s Short- and Long-Term Interests / Needs 'Implicit Query (IQ)' Implicit Query Highlights: Implicit Query Highlights IQ built by tracking user’s reading behavior No explicit search required Good matches returned IQ user model: Combines present context + previous interests New interfaces for tightly coupling search results with structure -- user study Slide32: Slide33: Data Mountain with Implicit Query results shown (highlighted pages to left of selected page). IQ Study: Experimental Details: IQ Study: Experimental Details Store 100 Web pages 50 popular Web pages; 50 random pages With or without Implicit Query IQ1: Co-occurrence based IQ IQ2: Content-based IQ Retrieve 100 Web pages Title given as retrieval cue -- e.g., 'CNN Home Page' No implicit query highlighting at retrieval Slide35: Find: 'CNN Home Page' Results: Information Storage: Results: Information Storage Filing strategies Number of categories Results: Retrieval Time: Results: Retrieval Time Example Web Searches: Example Web Searches 161858 lion lions 163041 lion facts 163919 picher of lions 164040 lion picher 165002 lion pictures 165100 pictures of lions 165211 pictures of big cats 165311 lion photos 170013 video in lion 172131 pictureof a lioness 172207 picture of a lioness 172241 lion pictures 172334 lion pictures cat 172443 lions 172450 lions 150052 lion 152004 lions 152036 lions lion 152219 lion facts 153747 roaring 153848 lions roaring 160232 africa lion 160642 lions, tigers, leopards and cheetahs 161042 lions, tigers, leopards and cheetahs cats 161144 wild cats of africa 161414 africa cat 161602 africa lions 161308 africa wild cats 161823 mane 161840 lion user = A1D6F19DB06BD694 date = 970916 excite log Slide39: Slide40: Summary : Summary Rich IR research tapestry Improving content-matching And, beyond ... Domain/Object Models User/Task Models Information Presentation and Use http://research.microsoft.com/~sdumais You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
LOC Feb99 Eagle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 23 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: September 14, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Research in Information Retrieval and Management: Research in Information Retrieval and Management Susan Dumais Microsoft Research Library of Congress Feb 8, 1999 Research in IR at MS: Research in IR at MS Microsoft Research (http://research.microsoft.com) Decision Theory and Adaptive Systems Natural Language Processing MSR Cambridge User Interface Database Web Companion Paperless Office Microsoft Product Groups … many IR-related IR Themes & Directions: IR Themes andamp; Directions Improvements in representation and content-matching Probabilistic/Bayesian models p(Relevant|Document), p(Concept|Words) NLP: Truffle, MindNet Beyond content-matching User/Task modeling Domain/Object modeling Advances in presentation and manipulation Improvements: Using Probabilistic Model: Improvements: Using Probabilistic Model MSR-Cambridge (Steve Robertson) Probabilistic Retrieval (e.g., Okapi) Theory-driven derivation of matching function Estimate: PQ(ri=Rel or NotRel | d=document) Using Bayes Rule and assuming conditional independence given Rel/NotRel Improvements: Using Probabilistic Model: Improvements: Using Probabilistic Model Good performance for uniform length document surrogates (e.g., abstracts) Enhanced to take into account term frequency and document 'BM25' one of the best ranking function at TREC Easy to incorporate relevance feedback Now looking at adaptive filtering/routing Improvements: Using NLP: Improvements: Using NLP Current search techniques use word forms Improvements in content-matching will come from: -andgt; Identifying relations between words -andgt; Identifying word meanings Advanced NLP can provide these http:/research.microspft.com/nlp Slide7: Dictionary MindNet Morphology Sketch Logical Form Portrait NL Text Discourse Generation NL Text NLP System Architecture Machine Translation Projects Technology Search and Retrieval Meaning Representation Grammar andamp; Style Checking Document Understanding Intelligent Summarizing Smart Selection Word Breaking Indexing “Truffle”: Word Relations % Relevant In Top Ten Docs: Result: 2-3 times as many relevant documents in the top 10 with Microsoft NLP 21.5% 33.1% 63.7% Engine X X+ NLP Relevant hits 'Truffle': Word Relations % Relevant In Top Ten Docs “MindNet”: Word Meanings: 'MindNet': Word Meanings A huge knowledge base Automatically created from dictionaries Words (nodes) linked by relationships 7 million links and growing Slide10: MindNet Beyond Content Matching: Beyond Content Matching Domain/Object modeling Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM) Broader View of IR: Broader View of IR Beyond Content Matching: Beyond Content Matching Domain/Object modeling Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM) Text Classification: Text Classification Text Classification: assign objects to one or more of a predefined set of categories using text features E.g., News feeds, Web data, OHSUMED, Email - spam/no-spam Approaches: Human classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol) Hand-crafted knowledge engineered systems (e.g., CONSTRUE) Inductive learning methods (Semi-) automatic classification Classifiers: Classifiers A classifier is a function: f(x) = conf(class) from attribute vectors, x=(x1,x2, … xd) to target values, confidence(class) Example classifiers if (interest AND rate) OR (quarterly), then confidence(interest) = 0.9 confidence(interest) = 0.3*interest + 0.4*rate + 0.1*quarterly Inductive Learning Methods: Inductive Learning Methods Supervised learning from examples Examples are easy for domain experts to provide Models easy to learn, update, and customize Example learning algorithms Relevance Feedback, Decision Trees, Naïve Bayes, Bayes Nets, Support Vector Machines (SVMs) Text representation Large vector of features (words, phrases, hand-crafted) Text Classification Process : Text Classification Process text files word counts per file data set Decision tree Index Server Feature selection Naïve Bayes Find similar Bayes nets Support vector machine Learning Methods test classifier Support Vector Machine: Support Vector Machine Optimization Problem Find hyperplane, h, separating positive and negative examples Optimization for maximum margin: Classify new items using: Support Vector Machines: Support Vector Machines Extendable to: Non-separable problems (Cortes andamp; Vapnik, 1995) Non-linear classifiers (Boser et al., 1992) Good generalization performance Handwriting recognition (LeCun et al.) Face detection (Osuna et al.) Text classification (Joachims, Dumais et al.) Platt’s Sequential Minimal Optimization algorithm very efficient Reuters Data Set (21578 - ModApte split) : Reuters Data Set (21578 - ModApte split) 9603 training articles; 3299 test articles Example 'interest' article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. REUTER Average article 200 words long Example: Reuters news: Example: Reuters news 118 categories (article can be in more than one category) Most common categories (#train, #test) Overall Results Linear SVM most accurate: 87% precision at 87% recall Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Reuters ROC - Category Grain: Reuters ROC - Category Grain Precision Recall LSVM Decision Tree Naïve Bayes Find Similar Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category Text Categ Summary : Text Categ Summary Accurate classifiers can be learned automatically from training examples Linear SVMs are efficient and provide very good classification accuracy Widely applicable, flexible, and adaptable representations Email spam/no-spam, Web, Medical abstracts, TREC Text Clustering: Text Clustering Discovering structure Vector-based document representation EM algorithm to identify clusters Interactive user interface Text Clustering: Text Clustering Beyond Content Matching: Beyond Content Matching Domain/Object modeling Text classification and clustering User/Task modeling Implicit queries and Lumiere Advances in presentation and manipulation Combining structure and search (e.g., DM) Implicit Queries (IQ): Implicit Queries (IQ) Explicit queries: Search is a separate, discrete task User types query, Gets results, Tries again … Implicit queries: Search as part of normal information flow Ongoing query formulation based on user activities, and non-intrusive results display Can include explicit query or push profile, but doesn’t require either Slide28: Slide29: User Modeling for IQ/IR: User Modeling for IQ/IR IQ: Model of user interests based on actions Explicit search activity (query or profile) Patterns of scroll / dwell on text Copying and pasting actions Interaction with multiple applications User’s Short- and Long-Term Interests / Needs 'Implicit Query (IQ)' Implicit Query Highlights: Implicit Query Highlights IQ built by tracking user’s reading behavior No explicit search required Good matches returned IQ user model: Combines present context + previous interests New interfaces for tightly coupling search results with structure -- user study Slide32: Slide33: Data Mountain with Implicit Query results shown (highlighted pages to left of selected page). IQ Study: Experimental Details: IQ Study: Experimental Details Store 100 Web pages 50 popular Web pages; 50 random pages With or without Implicit Query IQ1: Co-occurrence based IQ IQ2: Content-based IQ Retrieve 100 Web pages Title given as retrieval cue -- e.g., 'CNN Home Page' No implicit query highlighting at retrieval Slide35: Find: 'CNN Home Page' Results: Information Storage: Results: Information Storage Filing strategies Number of categories Results: Retrieval Time: Results: Retrieval Time Example Web Searches: Example Web Searches 161858 lion lions 163041 lion facts 163919 picher of lions 164040 lion picher 165002 lion pictures 165100 pictures of lions 165211 pictures of big cats 165311 lion photos 170013 video in lion 172131 pictureof a lioness 172207 picture of a lioness 172241 lion pictures 172334 lion pictures cat 172443 lions 172450 lions 150052 lion 152004 lions 152036 lions lion 152219 lion facts 153747 roaring 153848 lions roaring 160232 africa lion 160642 lions, tigers, leopards and cheetahs 161042 lions, tigers, leopards and cheetahs cats 161144 wild cats of africa 161414 africa cat 161602 africa lions 161308 africa wild cats 161823 mane 161840 lion user = A1D6F19DB06BD694 date = 970916 excite log Slide39: Slide40: Summary : Summary Rich IR research tapestry Improving content-matching And, beyond ... Domain/Object Models User/Task Models Information Presentation and Use http://research.microsoft.com/~sdumais