ROCLING04 1

Uploaded from authorPOINT Lite
Download as
 PPT
Presentation Description 

No description available

Happy Thanksgiving
What's up on authorSTREAM?
Views: 32
Like it  ( Likes) Dislike it  ( Dislikes)
Added: November 20, 2007 This Presentation is Public 
Presentation Category : Entertainment All Rights Reserved
Presentation Transcript

Why should I care about Computational Linguistics & Language Processing?: Why should I care about Computational Linguistics & Language Processing? Hsiao-Wuen Hon 洪小文 Assistant Managing Director Microsoft Research Asia


Agenda : Agenda Should I care? Industry cares Microsoft cares Speech NLP Web Search & Mining Summary - we should care


Should I care?: Should I care? Medical school 金饭碗 Electronics 配股 Easy way to become millionaire Chip manufacture TSMC, UMC Hardware Acer, Quanta, 鸿海, BenQ, 英业达, MiTac NLP? Speech? IR? HWR?


It is actually a good choice: It is actually a good choice People go on to have good careers Many applications IR, HWR Investment banks Bioinformatics ….. With many smart people Software Industry cares Not overproducing students


Industry Cares: Industry Cares People you might know Academics Pillars of A.I. Well funded Taiwan professors Oversea professors V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N. Chomsky, Michael Collin, Fernando Pereira …


Industry Cares: Industry Cares Industrial R&D Labs Executives Kai-Fu Lee (MS), Qi Lu (Yahoo), … Microsoft X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric Brill, Ken Church, … Continue hiring Google Speech - Amit Singhal, Michael Riley, … etc., NL – Franz Och, Krishna Bharat, Dekang Lin, … Aggressively hiring Others…


Industry Cares: Industry Cares Other applications Renaissance Technologies Hedge fund management – 4 billions in assets Time-series predication based on S&L technologies a.k.a ex-IBM S&L group P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra brothers, … Startups Nuance, SpeechWorks, InfoTalk, iPhrase, Lexicus, …


Microsoft Cares : Microsoft Cares Bill Gates’ vision PC on everyone’s desktop (’75) Information at your finger tips (’90) Seamless Computing (’03) S&L technologies is the key Billions of $ investment in S&L technologies Full-size S&L product & research groups Multi-lingual & multi-products Continue hiring Expanded investment due to search/Google


Information Agent: Information Agent “Do what I mean” “Find what I want” How to turn on Firewall in Windows? Speech recognition Signal to text Natural language understanding Syntax/semantics Domain knowledge Knowledge search AI-Complete


A Long Long Journey: A Long Long Journey Speech Ubiquitous interface Automatic Speech Recognition Text-to-Speech Natural Language Spelling/grammar/style checking IME Machine translation Information Retrieval & Mining


Speech: Speech SAPI 1.0 – 6.0 Window Sound System in ’92 Platform for building speech app. in Windows Accessibility support (Screen Reader) Office Dictation Chinese, English Microsoft Speech Server Telephony speech & multiomdal platform Other – Encarta, WinCE/Smartphone…


Speech: Speech


MSRA Speech: MSRA Speech TTS – multi-lingual natural TTS ASR Chinese LVCSR - dictation/telephony/embedded Fundamental research AIME: Audio Info. Management & Extraction Audio/video file indexing/retrieval Offline transcription/extraction/summarization More in Eric’s keynote tomorrow From the Lab to Ubiquity: Speech Technology's Road to Mainstream


NLP Contributes to MS Products: NLP Contributes to MS Products IME (Chinese, Japanese, …) Spelling/grammar checking Spam filtering English Writing Wizard (EWW) Spoken language interface IR and CLIR Text mining Machine translation Search engine QA (AskMSR) SLM for Speech Text analysis for TTS …..


NLP “Rainbow”: NLP “Rainbow” Dictionary Knowledge base Morphology Syntax Logical Form Source Text Target Text Understanding Word Breaking Dictionary Logical Form Syntax Morphology Transfer Grammar Checking Machine Translation Analysis Generation Discourse Discourse


NLP at MSRA: NLP at MSRA Research Linguistic Resources Applications


NLP at MSRA: NLP at MSRA TIME Email Routing Spam filtering Resume routing Support routing EWW Translation


TIME Platform: TIME Platform Text Information Management & Extraction Goal: extract information from text data genres: email, newspaper, report, web pages formats: Word document, PDF/PS, HTML/XML languages: English, Chinese, Japanese, … Applications: search, question answering, data mining, machine translation


TIME Components: TIME Components Linguistic processing  TIME linguistic platform Text normalization: sentence splitting, tokenization, morphological analysis Entity extraction: person name, company name, time expression, phrases Relation learning: syntactic/semantic dependencies between entities Information extraction Document property extraction: title, author, key term, summary Domain knowledge extraction: concept, concept relation, glossary, taxonomy, event Cross-lingual information exchange Translation at word, entity, term, skeleton, text levels Reading, writing, cross language information retrieval


TIME Demo: TIME Demo


Multi-lingual linguistic unit processing: Multi-lingual linguistic unit processing Word Tokenization Named entity recognition (NER) POS Sentence Chunking (VP/NP) Source-channel models:


TIME (linguistic unit processing): TIME (linguistic unit processing)


Chinese Tokenization & NEI: Chinese Tokenization & NEI


English Chunking and POS Tagging: English Chunking and POS Tagging


English Chunking and POS Tagging: English Chunking and POS Tagging


Skeleton Parser: Skeleton Parser Skeleton == Input: He is succeeded by Ivan Allen Jr. Output [He] is succeeded by [Ivan Allen Jr.] Sub Obj More robust & faster than traditional parser Adequate for most applications Collocation checking, Spell checking, Grammar checking, QA, Search


Skeleton Parser: Skeleton Parser Key Dependency Relations A set of most important relations (e.g. subject, object…) Definition based on application Our Target: A Robust & Fast Dependency Extractor Not rely on high quality (hand-annotated) training data. High efficiency in dealing with large scale of data (e.g. web data) Potential Applications Information Extraction, Q/A, TDT Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun) Machine translation Skeleton translation NL-based Information Retrieval Cross-Language IR Re-ranking by triple matching


Proposed approach: Proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking


The proposed approach: The proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking


The proposed approach: The proposed approach NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking Feature Extraction Classification


Skeleton Parser: Skeleton Parser


Skeleton Parser: Skeleton Parser


Term Extraction: Term Extraction Candidate Generation Options: Boundary determination BaseNP Pattern filtering Ranking Text Term List Terms Options: Term frequency TF-IDF Entropy reduction ER-IDF


Term Extraction: Term Extraction


Term Extraction: Term Extraction


Text Mining Roadmap: Text Mining Roadmap SQL Text Mining Key technologies Metadata extraction Ranking algorithm Multi-languages support Text Miner Meta Data for Sharepoint Information Desk


Information Desk: Information Desk http://msra-nlc-tm1


Slide38: http://msra-nlc-tm1/


Machine Translation Roadmap: Machine Translation Roadmap Office EWW Key technologies Skeleton parser Collocation checker Paraphrase Knowledge acquisition Adaptive to new language pairs Mobility Search Engine Direction Template based Linguistic data acquisition from Web mining TIME


Slide47: EWW (English Writing Wizard) Features Idiomatic usages Synonymous collocation Collocation translations Bilingual example sentences Technology Highlights Auto extraction of idiomatic usage Auto extraction of synonymous collocation Auto extraction of collocation translations Example sentence retrieval Idiomatic Usage Objectives Make your English writing as good as native speakers Input: question question (Noun) Verb+question: raise ~, ask ~, resolve ~, pose ~ Adj+question: unanswered ~, serious ~, big ~, real ~ question (Verb) question+Noun: ~ motive, ~ value, ~ truth, ~ boy question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all Adv+question: privately ~, cautiously ~, hardly ~ Synonymous Collocation attain~dobj~level  achieve~dobj~level attract~dobj~fan  draw~dobj~fan take~dobj~reins  assume~dobj~reins|hold~dobj~reins bad~Intnsifs~extremely  risky~Intnsifs~extremely unusual~Intnsifs~quite  unusual~Intnsifs~rather vision~Attrib~unusual  sight~Attrib~unusual Improve~Mod~greatly  Improve~Mod~considerably 克服~困难 conquer difficulty, overcome difficulty, master~difficulty overcome~adversity, surmount~difficulty Collocation Translation


Web Search & Mining: Web Search & Mining Internet + Data + Information -> Search, Mining, Sharing, & Intelligence Lots of text Text-based IR Text Mining Semantic/Structure Mining Media Search Surrounding text Audio/video transcription Make Billions of $ from trillions of words


Information Retrieval: Information Retrieval Text Processing Tokenization Normalization – stemming, … Precision/Recall Beyond 1st order statistics (TF-IDF) N-gram for adaptive indexing Better model of P(Doc|Query) Classification vs. term frequency Result Summarization Query sensitive U盘 (优盘) vs. 大拇哥 Result clustering & classification


Search Long Result List: Search Long Result List A user search for information about “jaguar”, a Mac OS However, the relevant results are mixed with other pages The user need to go through a long list to find desired information


Clustering vs. Classification: Clustering vs. Classification Clustering Results for “jaguar” Classification Results for “jaguar”


Document Clustering & Sub-topic Identification: Document Clustering & Sub-topic Identification http://msra-idss-04:8080/prototype1 Search Result Grouping Overview of the returned documents Locate useful information quickly Word sense disambiguation


Text Mining: Text Mining New research area Highly statistically based TIME on internet Improving Precision/Recall Title Extraction 10% improvement in ranking XP Help & Support (support.microsoft.com) Aggregate TF from Newsgroup Support emails


Text Mining: Text Mining Location finder Entity location The physical address of the entity (e.g. organization, corporation or person) owning the web Crucial for geographical web retrieval and navigation Yellow Pages, map services Content location The location that the content of the web resource is lied on. Crucial for location based search & services Context location The geographical scope that the web resource reaches. Crucial for B2C applications like local advertisement and e-commerce.


Three Types of Page Locations: Three Types of Page Locations


Distribution of Geographical Keywords: Distribution of Geographical Keywords Demo


Text Mining: Text Mining AskMSR Providing Answers inline instead of links to answers USPS, UPC, Vehicle #s, Product IDs, Addresses, Stock & financial #s, etc…


AskMSR: AskMSR Leverage redundant web information N-gram locator in results pages


Semantic Mining: Semantic Mining Beyond document retrieval Web mining & knowledge discovery Hierarchical clustering -> Mining From non-structure to structure Entity Identification Relation Discovery Mining on relation graph Clustering Multi-typed Interrelated Objects Ranking Graph Evolving Relation visualization Graph Matching/Morphing/embedding http://msra-idss-04:8080/prototype1/(r0l5ivbnvijh4y45d5nyewee)/clustermain.aspx


Structure Paper Search: Structure Paper Search


Relevant Term Mining: Relevant Term Mining Search Term Suggestion (STS) Document term may not match with real queries Cluster the query terms into semantic topics Classifying document terms into semantic topics Rank the suggested terms by the popularity http://msra-mm650-06/demo Web-page Hyperlink Query Thesaurus Query Query Log


Media Search: Media Search Rely mostly on Text! Surrounding text mining/extraction Transcription from ASR Audio/Video AIME Result presentation Clustering/classification Rely on text again! Image and keyword co-occurrence matrix


Image Clustering: Image Clustering 1710 JPG images in 1287 pages are crawled within the website Six Categories Fish Bird Mammal Reptile Amphibian Insect


Web Image Thesaurus: Web Image Thesaurus coyote Basic Idea: Use abundant annotated images on the Web as training data


Media Search: Media Search


Cross-lingual Information Access: Cross-lingual Information Access Chinese Query Query Translation English Query Query Processing Ontology Search Web Page Chs. Doc Reading Assistant Eng Docs Search Engine Query Translation Reading Assistant


Cross-lingual Information Access: Cross-lingual Information Access Important for non-English surfer Access to English content Using English content for ranking Web-based Data Acquisition Vast Noisy Parallel text


Cross-Lingual Information Retrieval: Cross-Lingual Information Retrieval 微软研究院


Cross-Lingual Reading Assistant: 量子计算 平板电脑 Cross-Lingual Reading Assistant


Cross-Lingual Summarization: Cross-Lingual Summarization Title: Talking computers nearing reality Author: Michael Kanellos Time: July 9, 2003 …… Summary: Microsoft on Wednesday released the first public beta of its Speech Server, which will let servers better handle oral comments. Title: 说话的计算机临近现实 Author: Michael Kanellos Time: 2003.7.9 …… Summary: 星期三微软发布了它的第一个说话服务器,将让服务器更好处理口头命令。


Summary: Summary Industry will continue Build products using speech, NL, IR,… Hiring people in speech, NL, NL, IR Require more software to drive market quoted by Barry Lam of Quanta We should all care about these technologies