logging in or signing up ROCLING04 1 Natalya Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 63 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Why should I care about Computational Linguistics & Language Processing?: Why should I care about Computational Linguistics & Language Processing? Hsiao-Wuen Hon 洪小文 Assistant Managing Director Microsoft Research Asia Agenda : Agenda Should I care? Industry cares Microsoft cares Speech NLP Web Search & Mining Summary - we should careShould I care?: Should I care? Medical school 金饭碗 Electronics 配股 Easy way to become millionaire Chip manufacture TSMC, UMC Hardware Acer, Quanta, 鸿海, BenQ, 英业达, MiTac NLP? Speech? IR? HWR?It is actually a good choice: It is actually a good choice People go on to have good careers Many applications IR, HWR Investment banks Bioinformatics ….. With many smart people Software Industry cares Not overproducing studentsIndustry Cares: Industry Cares People you might know Academics Pillars of A.I. Well funded Taiwan professors Oversea professors V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N. Chomsky, Michael Collin, Fernando Pereira …Industry Cares: Industry Cares Industrial R&D Labs Executives Kai-Fu Lee (MS), Qi Lu (Yahoo), … Microsoft X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric Brill, Ken Church, … Continue hiring Google Speech - Amit Singhal, Michael Riley, … etc., NL – Franz Och, Krishna Bharat, Dekang Lin, … Aggressively hiring Others… Industry Cares: Industry Cares Other applications Renaissance Technologies Hedge fund management – 4 billions in assets Time-series predication based on S&L technologies a.k.a ex-IBM S&L group P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra brothers, … Startups Nuance, SpeechWorks, InfoTalk, iPhrase, Lexicus, …Microsoft Cares : Microsoft Cares Bill Gates’ vision PC on everyone’s desktop (’75) Information at your finger tips (’90) Seamless Computing (’03) S&L technologies is the key Billions of $ investment in S&L technologies Full-size S&L product & research groups Multi-lingual & multi-products Continue hiring Expanded investment due to search/Google Information Agent: Information Agent “Do what I mean” “Find what I want” How to turn on Firewall in Windows? Speech recognition Signal to text Natural language understanding Syntax/semantics Domain knowledge Knowledge search AI-CompleteA Long Long Journey: A Long Long Journey Speech Ubiquitous interface Automatic Speech Recognition Text-to-Speech Natural Language Spelling/grammar/style checking IME Machine translation Information Retrieval & MiningSpeech: Speech SAPI 1.0 – 6.0 Window Sound System in ’92 Platform for building speech app. in Windows Accessibility support (Screen Reader) Office Dictation Chinese, English Microsoft Speech Server Telephony speech & multiomdal platform Other – Encarta, WinCE/Smartphone…Speech: SpeechMSRA Speech: MSRA Speech TTS – multi-lingual natural TTS ASR Chinese LVCSR - dictation/telephony/embedded Fundamental research AIME: Audio Info. Management & Extraction Audio/video file indexing/retrieval Offline transcription/extraction/summarization More in Eric’s keynote tomorrow From the Lab to Ubiquity: Speech Technology's Road to MainstreamNLP Contributes to MS Products: NLP Contributes to MS Products IME (Chinese, Japanese, …) Spelling/grammar checking Spam filtering English Writing Wizard (EWW) Spoken language interface IR and CLIR Text mining Machine translation Search engine QA (AskMSR) SLM for Speech Text analysis for TTS …..NLP “Rainbow”: NLP “Rainbow” Dictionary Knowledge base Morphology Syntax Logical Form Source Text Target Text Understanding Word Breaking Dictionary Logical Form Syntax Morphology Transfer Grammar Checking Machine Translation Analysis Generation Discourse DiscourseNLP at MSRA: NLP at MSRA Research Linguistic Resources ApplicationsNLP at MSRA: NLP at MSRA TIME Email Routing Spam filtering Resume routing Support routing EWW TranslationTIME Platform: TIME Platform Text Information Management & Extraction Goal: extract information from text data genres: email, newspaper, report, web pages formats: Word document, PDF/PS, HTML/XML languages: English, Chinese, Japanese, … Applications: search, question answering, data mining, machine translationTIME Components: TIME Components Linguistic processing TIME linguistic platform Text normalization: sentence splitting, tokenization, morphological analysis Entity extraction: person name, company name, time expression, phrases Relation learning: syntactic/semantic dependencies between entities Information extraction Document property extraction: title, author, key term, summary Domain knowledge extraction: concept, concept relation, glossary, taxonomy, event Cross-lingual information exchange Translation at word, entity, term, skeleton, text levels Reading, writing, cross language information retrievalTIME Demo: TIME DemoMulti-lingual linguistic unit processing: Multi-lingual linguistic unit processing Word Tokenization Named entity recognition (NER) POS Sentence Chunking (VP/NP) Source-channel models: TIME (linguistic unit processing): TIME (linguistic unit processing)Chinese Tokenization & NEI: Chinese Tokenization & NEIEnglish Chunking and POS Tagging: English Chunking and POS TaggingEnglish Chunking and POS Tagging: English Chunking and POS TaggingSkeleton Parser: Skeleton Parser Skeleton == <subject V object> Input: He is succeeded by Ivan Allen Jr. Output [He] is succeeded by [Ivan Allen Jr.] Sub Obj More robust & faster than traditional parser Adequate for most applications Collocation checking, Spell checking, Grammar checking, QA, SearchSkeleton Parser: Skeleton Parser Key Dependency Relations A set of most important relations (e.g. subject, object…) Definition based on application Our Target: A Robust & Fast Dependency Extractor Not rely on high quality (hand-annotated) training data. High efficiency in dealing with large scale of data (e.g. web data) Potential Applications Information Extraction, Q/A, TDT Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun) Machine translation Skeleton translation NL-based Information Retrieval Cross-Language IR Re-ranking by triple matching Proposed approach: Proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking The proposed approach: The proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking The proposed approach: The proposed approach NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking Feature Extraction ClassificationSkeleton Parser: Skeleton ParserSkeleton Parser: Skeleton ParserTerm Extraction: Term Extraction Candidate Generation Options: Boundary determination BaseNP Pattern filtering Ranking Text Term List Terms Options: Term frequency TF-IDF Entropy reduction ER-IDFTerm Extraction: Term ExtractionTerm Extraction: Term Extraction Text Mining Roadmap: Text Mining Roadmap SQL Text Mining Key technologies Metadata extraction Ranking algorithm Multi-languages support Text Miner Meta Data for Sharepoint Information DeskInformation Desk: Information Desk http://msra-nlc-tm1 Slide38: http://msra-nlc-tm1/Machine Translation Roadmap: Machine Translation Roadmap Office EWW Key technologies Skeleton parser Collocation checker Paraphrase Knowledge acquisition Adaptive to new language pairs Mobility Search Engine Direction Template based Linguistic data acquisition from Web mining TIME Slide47: EWW (English Writing Wizard) Features Idiomatic usages Synonymous collocation Collocation translations Bilingual example sentences Technology Highlights Auto extraction of idiomatic usage Auto extraction of synonymous collocation Auto extraction of collocation translations Example sentence retrieval Idiomatic Usage Objectives Make your English writing as good as native speakers Input: question question (Noun) Verb+question: raise ~, ask ~, resolve ~, pose ~ Adj+question: unanswered ~, serious ~, big ~, real ~ question (Verb) question+Noun: ~ motive, ~ value, ~ truth, ~ boy question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all Adv+question: privately ~, cautiously ~, hardly ~ Synonymous Collocation attain~dobj~level achieve~dobj~level attract~dobj~fan draw~dobj~fan take~dobj~reins assume~dobj~reins|hold~dobj~reins bad~Intnsifs~extremely risky~Intnsifs~extremely unusual~Intnsifs~quite unusual~Intnsifs~rather vision~Attrib~unusual sight~Attrib~unusual Improve~Mod~greatly Improve~Mod~considerably 克服~困难 conquer difficulty, overcome difficulty, master~difficulty overcome~adversity, surmount~difficulty Collocation TranslationWeb Search & Mining: Web Search & Mining Internet + Data + Information -> Search, Mining, Sharing, & Intelligence Lots of text Text-based IR Text Mining Semantic/Structure Mining Media Search Surrounding text Audio/video transcription Make Billions of $ from trillions of wordsInformation Retrieval: Information Retrieval Text Processing Tokenization Normalization – stemming, … Precision/Recall Beyond 1st order statistics (TF-IDF) N-gram for adaptive indexing Better model of P(Doc|Query) Classification vs. term frequency Result Summarization Query sensitive U盘 (优盘) vs. 大拇哥 Result clustering & classification Search Long Result List: Search Long Result List A user search for information about “jaguar”, a Mac OS However, the relevant results are mixed with other pages The user need to go through a long list to find desired informationClustering vs. Classification: Clustering vs. Classification Clustering Results for “jaguar” Classification Results for “jaguar”Document Clustering & Sub-topic Identification: Document Clustering & Sub-topic Identification http://msra-idss-04:8080/prototype1 Search Result Grouping Overview of the returned documents Locate useful information quickly Word sense disambiguationText Mining: Text Mining New research area Highly statistically based TIME on internet Improving Precision/Recall Title Extraction 10% improvement in ranking XP Help & Support (support.microsoft.com) Aggregate TF from Newsgroup Support emails Text Mining: Text Mining Location finder Entity location The physical address of the entity (e.g. organization, corporation or person) owning the web Crucial for geographical web retrieval and navigation Yellow Pages, map services Content location The location that the content of the web resource is lied on. Crucial for location based search & services Context location The geographical scope that the web resource reaches. Crucial for B2C applications like local advertisement and e-commerce.Three Types of Page Locations: Three Types of Page LocationsDistribution of Geographical Keywords: Distribution of Geographical Keywords DemoText Mining: Text Mining AskMSR Providing Answers inline instead of links to answers USPS, UPC, Vehicle #s, Product IDs, Addresses, Stock & financial #s, etc…AskMSR: AskMSR Leverage redundant web information N-gram locator in results pagesSemantic Mining: Semantic Mining Beyond document retrieval Web mining & knowledge discovery Hierarchical clustering -> Mining From non-structure to structure Entity Identification Relation Discovery Mining on relation graph Clustering Multi-typed Interrelated Objects Ranking Graph Evolving Relation visualization Graph Matching/Morphing/embedding http://msra-idss-04:8080/prototype1/(r0l5ivbnvijh4y45d5nyewee)/clustermain.aspx Structure Paper Search: Structure Paper SearchRelevant Term Mining: Relevant Term Mining Search Term Suggestion (STS) Document term may not match with real queries Cluster the query terms into semantic topics Classifying document terms into semantic topics Rank the suggested terms by the popularity http://msra-mm650-06/demo Web-page Hyperlink Query Thesaurus Query Query LogMedia Search: Media Search Rely mostly on Text! Surrounding text mining/extraction Transcription from ASR Audio/Video AIME Result presentation Clustering/classification Rely on text again! Image and keyword co-occurrence matrixImage Clustering: Image Clustering 1710 JPG images in 1287 pages are crawled within the website Six Categories Fish Bird Mammal Reptile Amphibian InsectWeb Image Thesaurus: Web Image Thesaurus coyote Basic Idea: Use abundant annotated images on the Web as training dataMedia Search: Media Search Cross-lingual Information Access: Cross-lingual Information Access Chinese Query Query Translation English Query Query Processing Ontology Search Web Page Chs. Doc Reading Assistant Eng Docs Search Engine Query Translation Reading Assistant Cross-lingual Information Access: Cross-lingual Information Access Important for non-English surfer Access to English content Using English content for ranking Web-based Data Acquisition Vast Noisy Parallel textCross-Lingual Information Retrieval: Cross-Lingual Information Retrieval 微软研究院Cross-Lingual Reading Assistant: 量子计算 平板电脑 Cross-Lingual Reading AssistantCross-Lingual Summarization: Cross-Lingual Summarization Title: Talking computers nearing reality Author: Michael Kanellos Time: July 9, 2003 …… Summary: Microsoft on Wednesday released the first public beta of its Speech Server, which will let servers better handle oral comments. Title: 说话的计算机临近现实 Author: Michael Kanellos Time: 2003.7.9 …… Summary: 星期三微软发布了它的第一个说话服务器,将让服务器更好处理口头命令。 Summary: Summary Industry will continue Build products using speech, NL, IR,… Hiring people in speech, NL, NL, IR Require more software to drive market quoted by Barry Lam of Quanta We should all care about these technologies You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
ROCLING04 1 Natalya Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 63 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Why should I care about Computational Linguistics & Language Processing?: Why should I care about Computational Linguistics & Language Processing? Hsiao-Wuen Hon 洪小文 Assistant Managing Director Microsoft Research Asia Agenda : Agenda Should I care? Industry cares Microsoft cares Speech NLP Web Search & Mining Summary - we should careShould I care?: Should I care? Medical school 金饭碗 Electronics 配股 Easy way to become millionaire Chip manufacture TSMC, UMC Hardware Acer, Quanta, 鸿海, BenQ, 英业达, MiTac NLP? Speech? IR? HWR?It is actually a good choice: It is actually a good choice People go on to have good careers Many applications IR, HWR Investment banks Bioinformatics ….. With many smart people Software Industry cares Not overproducing studentsIndustry Cares: Industry Cares People you might know Academics Pillars of A.I. Well funded Taiwan professors Oversea professors V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N. Chomsky, Michael Collin, Fernando Pereira …Industry Cares: Industry Cares Industrial R&D Labs Executives Kai-Fu Lee (MS), Qi Lu (Yahoo), … Microsoft X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric Brill, Ken Church, … Continue hiring Google Speech - Amit Singhal, Michael Riley, … etc., NL – Franz Och, Krishna Bharat, Dekang Lin, … Aggressively hiring Others… Industry Cares: Industry Cares Other applications Renaissance Technologies Hedge fund management – 4 billions in assets Time-series predication based on S&L technologies a.k.a ex-IBM S&L group P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra brothers, … Startups Nuance, SpeechWorks, InfoTalk, iPhrase, Lexicus, …Microsoft Cares : Microsoft Cares Bill Gates’ vision PC on everyone’s desktop (’75) Information at your finger tips (’90) Seamless Computing (’03) S&L technologies is the key Billions of $ investment in S&L technologies Full-size S&L product & research groups Multi-lingual & multi-products Continue hiring Expanded investment due to search/Google Information Agent: Information Agent “Do what I mean” “Find what I want” How to turn on Firewall in Windows? Speech recognition Signal to text Natural language understanding Syntax/semantics Domain knowledge Knowledge search AI-CompleteA Long Long Journey: A Long Long Journey Speech Ubiquitous interface Automatic Speech Recognition Text-to-Speech Natural Language Spelling/grammar/style checking IME Machine translation Information Retrieval & MiningSpeech: Speech SAPI 1.0 – 6.0 Window Sound System in ’92 Platform for building speech app. in Windows Accessibility support (Screen Reader) Office Dictation Chinese, English Microsoft Speech Server Telephony speech & multiomdal platform Other – Encarta, WinCE/Smartphone…Speech: SpeechMSRA Speech: MSRA Speech TTS – multi-lingual natural TTS ASR Chinese LVCSR - dictation/telephony/embedded Fundamental research AIME: Audio Info. Management & Extraction Audio/video file indexing/retrieval Offline transcription/extraction/summarization More in Eric’s keynote tomorrow From the Lab to Ubiquity: Speech Technology's Road to MainstreamNLP Contributes to MS Products: NLP Contributes to MS Products IME (Chinese, Japanese, …) Spelling/grammar checking Spam filtering English Writing Wizard (EWW) Spoken language interface IR and CLIR Text mining Machine translation Search engine QA (AskMSR) SLM for Speech Text analysis for TTS …..NLP “Rainbow”: NLP “Rainbow” Dictionary Knowledge base Morphology Syntax Logical Form Source Text Target Text Understanding Word Breaking Dictionary Logical Form Syntax Morphology Transfer Grammar Checking Machine Translation Analysis Generation Discourse DiscourseNLP at MSRA: NLP at MSRA Research Linguistic Resources ApplicationsNLP at MSRA: NLP at MSRA TIME Email Routing Spam filtering Resume routing Support routing EWW TranslationTIME Platform: TIME Platform Text Information Management & Extraction Goal: extract information from text data genres: email, newspaper, report, web pages formats: Word document, PDF/PS, HTML/XML languages: English, Chinese, Japanese, … Applications: search, question answering, data mining, machine translationTIME Components: TIME Components Linguistic processing TIME linguistic platform Text normalization: sentence splitting, tokenization, morphological analysis Entity extraction: person name, company name, time expression, phrases Relation learning: syntactic/semantic dependencies between entities Information extraction Document property extraction: title, author, key term, summary Domain knowledge extraction: concept, concept relation, glossary, taxonomy, event Cross-lingual information exchange Translation at word, entity, term, skeleton, text levels Reading, writing, cross language information retrievalTIME Demo: TIME DemoMulti-lingual linguistic unit processing: Multi-lingual linguistic unit processing Word Tokenization Named entity recognition (NER) POS Sentence Chunking (VP/NP) Source-channel models: TIME (linguistic unit processing): TIME (linguistic unit processing)Chinese Tokenization & NEI: Chinese Tokenization & NEIEnglish Chunking and POS Tagging: English Chunking and POS TaggingEnglish Chunking and POS Tagging: English Chunking and POS TaggingSkeleton Parser: Skeleton Parser Skeleton == <subject V object> Input: He is succeeded by Ivan Allen Jr. Output [He] is succeeded by [Ivan Allen Jr.] Sub Obj More robust & faster than traditional parser Adequate for most applications Collocation checking, Spell checking, Grammar checking, QA, SearchSkeleton Parser: Skeleton Parser Key Dependency Relations A set of most important relations (e.g. subject, object…) Definition based on application Our Target: A Robust & Fast Dependency Extractor Not rely on high quality (hand-annotated) training data. High efficiency in dealing with large scale of data (e.g. web data) Potential Applications Information Extraction, Q/A, TDT Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun) Machine translation Skeleton translation NL-based Information Retrieval Cross-Language IR Re-ranking by triple matching Proposed approach: Proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking The proposed approach: The proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking The proposed approach: The proposed approach NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking Feature Extraction ClassificationSkeleton Parser: Skeleton ParserSkeleton Parser: Skeleton ParserTerm Extraction: Term Extraction Candidate Generation Options: Boundary determination BaseNP Pattern filtering Ranking Text Term List Terms Options: Term frequency TF-IDF Entropy reduction ER-IDFTerm Extraction: Term ExtractionTerm Extraction: Term Extraction Text Mining Roadmap: Text Mining Roadmap SQL Text Mining Key technologies Metadata extraction Ranking algorithm Multi-languages support Text Miner Meta Data for Sharepoint Information DeskInformation Desk: Information Desk http://msra-nlc-tm1 Slide38: http://msra-nlc-tm1/Machine Translation Roadmap: Machine Translation Roadmap Office EWW Key technologies Skeleton parser Collocation checker Paraphrase Knowledge acquisition Adaptive to new language pairs Mobility Search Engine Direction Template based Linguistic data acquisition from Web mining TIME Slide47: EWW (English Writing Wizard) Features Idiomatic usages Synonymous collocation Collocation translations Bilingual example sentences Technology Highlights Auto extraction of idiomatic usage Auto extraction of synonymous collocation Auto extraction of collocation translations Example sentence retrieval Idiomatic Usage Objectives Make your English writing as good as native speakers Input: question question (Noun) Verb+question: raise ~, ask ~, resolve ~, pose ~ Adj+question: unanswered ~, serious ~, big ~, real ~ question (Verb) question+Noun: ~ motive, ~ value, ~ truth, ~ boy question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all Adv+question: privately ~, cautiously ~, hardly ~ Synonymous Collocation attain~dobj~level achieve~dobj~level attract~dobj~fan draw~dobj~fan take~dobj~reins assume~dobj~reins|hold~dobj~reins bad~Intnsifs~extremely risky~Intnsifs~extremely unusual~Intnsifs~quite unusual~Intnsifs~rather vision~Attrib~unusual sight~Attrib~unusual Improve~Mod~greatly Improve~Mod~considerably 克服~困难 conquer difficulty, overcome difficulty, master~difficulty overcome~adversity, surmount~difficulty Collocation TranslationWeb Search & Mining: Web Search & Mining Internet + Data + Information -> Search, Mining, Sharing, & Intelligence Lots of text Text-based IR Text Mining Semantic/Structure Mining Media Search Surrounding text Audio/video transcription Make Billions of $ from trillions of wordsInformation Retrieval: Information Retrieval Text Processing Tokenization Normalization – stemming, … Precision/Recall Beyond 1st order statistics (TF-IDF) N-gram for adaptive indexing Better model of P(Doc|Query) Classification vs. term frequency Result Summarization Query sensitive U盘 (优盘) vs. 大拇哥 Result clustering & classification Search Long Result List: Search Long Result List A user search for information about “jaguar”, a Mac OS However, the relevant results are mixed with other pages The user need to go through a long list to find desired informationClustering vs. Classification: Clustering vs. Classification Clustering Results for “jaguar” Classification Results for “jaguar”Document Clustering & Sub-topic Identification: Document Clustering & Sub-topic Identification http://msra-idss-04:8080/prototype1 Search Result Grouping Overview of the returned documents Locate useful information quickly Word sense disambiguationText Mining: Text Mining New research area Highly statistically based TIME on internet Improving Precision/Recall Title Extraction 10% improvement in ranking XP Help & Support (support.microsoft.com) Aggregate TF from Newsgroup Support emails Text Mining: Text Mining Location finder Entity location The physical address of the entity (e.g. organization, corporation or person) owning the web Crucial for geographical web retrieval and navigation Yellow Pages, map services Content location The location that the content of the web resource is lied on. Crucial for location based search & services Context location The geographical scope that the web resource reaches. Crucial for B2C applications like local advertisement and e-commerce.Three Types of Page Locations: Three Types of Page LocationsDistribution of Geographical Keywords: Distribution of Geographical Keywords DemoText Mining: Text Mining AskMSR Providing Answers inline instead of links to answers USPS, UPC, Vehicle #s, Product IDs, Addresses, Stock & financial #s, etc…AskMSR: AskMSR Leverage redundant web information N-gram locator in results pagesSemantic Mining: Semantic Mining Beyond document retrieval Web mining & knowledge discovery Hierarchical clustering -> Mining From non-structure to structure Entity Identification Relation Discovery Mining on relation graph Clustering Multi-typed Interrelated Objects Ranking Graph Evolving Relation visualization Graph Matching/Morphing/embedding http://msra-idss-04:8080/prototype1/(r0l5ivbnvijh4y45d5nyewee)/clustermain.aspx Structure Paper Search: Structure Paper SearchRelevant Term Mining: Relevant Term Mining Search Term Suggestion (STS) Document term may not match with real queries Cluster the query terms into semantic topics Classifying document terms into semantic topics Rank the suggested terms by the popularity http://msra-mm650-06/demo Web-page Hyperlink Query Thesaurus Query Query LogMedia Search: Media Search Rely mostly on Text! Surrounding text mining/extraction Transcription from ASR Audio/Video AIME Result presentation Clustering/classification Rely on text again! Image and keyword co-occurrence matrixImage Clustering: Image Clustering 1710 JPG images in 1287 pages are crawled within the website Six Categories Fish Bird Mammal Reptile Amphibian InsectWeb Image Thesaurus: Web Image Thesaurus coyote Basic Idea: Use abundant annotated images on the Web as training dataMedia Search: Media Search Cross-lingual Information Access: Cross-lingual Information Access Chinese Query Query Translation English Query Query Processing Ontology Search Web Page Chs. Doc Reading Assistant Eng Docs Search Engine Query Translation Reading Assistant Cross-lingual Information Access: Cross-lingual Information Access Important for non-English surfer Access to English content Using English content for ranking Web-based Data Acquisition Vast Noisy Parallel textCross-Lingual Information Retrieval: Cross-Lingual Information Retrieval 微软研究院Cross-Lingual Reading Assistant: 量子计算 平板电脑 Cross-Lingual Reading AssistantCross-Lingual Summarization: Cross-Lingual Summarization Title: Talking computers nearing reality Author: Michael Kanellos Time: July 9, 2003 …… Summary: Microsoft on Wednesday released the first public beta of its Speech Server, which will let servers better handle oral comments. Title: 说话的计算机临近现实 Author: Michael Kanellos Time: 2003.7.9 …… Summary: 星期三微软发布了它的第一个说话服务器,将让服务器更好处理口头命令。 Summary: Summary Industry will continue Build products using speech, NL, IR,… Hiring people in speech, NL, NL, IR Require more software to drive market quoted by Barry Lam of Quanta We should all care about these technologies