Why should I care about Computational Linguistics & Language Processing?: Why should I care about Computational Linguistics & Language Processing? Hsiao-Wuen Hon
洪小文
Assistant Managing Director
Microsoft Research Asia
Agenda : Agenda Should I care?
Industry cares
Microsoft cares
Speech
NLP
Web Search & Mining
Summary - we should care
Should I care?: Should I care? Medical school
金饭碗
Electronics
配股
Easy way to become millionaire
Chip manufacture
TSMC, UMC
Hardware
Acer, Quanta, 鸿海, BenQ, 英业达, MiTac
NLP? Speech? IR? HWR?
It is actually a good choice: It is actually a good choice People go on to have good careers
Many applications
IR, HWR
Investment banks
Bioinformatics
…..
With many smart people
Software Industry cares
Not overproducing students
Industry Cares: Industry Cares People you might know
Academics
Pillars of A.I.
Well funded
Taiwan professors
Oversea professors
V. Zue, B.H. Juang, F. Jelinek, M. Libermann, N. Chomsky, Michael Collin, Fernando Pereira …
Industry Cares: Industry Cares Industrial R&D Labs
Executives
Kai-Fu Lee (MS), Qi Lu (Yahoo), …
Microsoft
X.D. Huang,洪小文,马维英, Eric Chang, 周明, Eric Brill, Ken Church, …
Continue hiring
Google
Speech - Amit Singhal, Michael Riley, … etc.,
NL – Franz Och, Krishna Bharat, Dekang Lin, …
Aggressively hiring
Others…
Industry Cares: Industry Cares Other applications
Renaissance Technologies
Hedge fund management – 4 billions in assets
Time-series predication based on S&L technologies
a.k.a ex-IBM S&L group
P. Brown, R. Mercer, P. De Souza, L. Bahl, Della Pietra brothers, …
Startups
Nuance, SpeechWorks, InfoTalk, iPhrase, Lexicus, …
Microsoft Cares : Microsoft Cares Bill Gates’ vision
PC on everyone’s desktop (’75)
Information at your finger tips (’90)
Seamless Computing (’03)
S&L technologies is the key
Billions of $ investment in S&L technologies
Full-size S&L product & research groups
Multi-lingual & multi-products
Continue hiring
Expanded investment due to search/Google
Information Agent: Information Agent “Do what I mean”
“Find what I want”
How to turn on Firewall in Windows?
Speech recognition
Signal to text
Natural language understanding
Syntax/semantics
Domain knowledge
Knowledge search
AI-Complete
A Long Long Journey: A Long Long Journey Speech
Ubiquitous interface
Automatic Speech Recognition
Text-to-Speech
Natural Language
Spelling/grammar/style checking
IME
Machine translation
Information Retrieval & Mining
Speech: Speech SAPI 1.0 – 6.0
Window Sound System in ’92
Platform for building speech app. in Windows
Accessibility support (Screen Reader)
Office Dictation
Chinese, English
Microsoft Speech Server
Telephony speech & multiomdal platform
Other – Encarta, WinCE/Smartphone…
Speech: Speech
MSRA Speech: MSRA Speech TTS – multi-lingual natural TTS
ASR
Chinese LVCSR - dictation/telephony/embedded
Fundamental research
AIME: Audio Info. Management & Extraction
Audio/video file indexing/retrieval
Offline transcription/extraction/summarization
More in Eric’s keynote tomorrow
From the Lab to Ubiquity: Speech Technology's Road to Mainstream
NLP Contributes to MS Products: NLP Contributes to MS Products IME (Chinese, Japanese, …)
Spelling/grammar checking
Spam filtering
English Writing Wizard (EWW)
Spoken language interface
IR and CLIR
Text mining
Machine translation
Search engine
QA (AskMSR)
SLM for Speech
Text analysis for TTS
…..
NLP “Rainbow”: NLP “Rainbow” Dictionary Knowledge base Morphology Syntax Logical Form Source Text Target Text Understanding Word Breaking Dictionary Logical Form Syntax Morphology Transfer Grammar
Checking Machine
Translation Analysis Generation Discourse Discourse
NLP at MSRA: NLP at MSRA Research Linguistic Resources Applications
NLP at MSRA: NLP at MSRA TIME
Email Routing
Spam filtering
Resume routing
Support routing
EWW
Translation
TIME Platform: TIME Platform Text Information Management & Extraction
Goal: extract information from text data
genres: email, newspaper, report, web pages
formats: Word document, PDF/PS, HTML/XML
languages: English, Chinese, Japanese, …
Applications: search, question answering, data mining, machine translation
TIME Components: TIME Components Linguistic processing TIME linguistic platform
Text normalization: sentence splitting, tokenization, morphological analysis
Entity extraction: person name, company name, time expression, phrases
Relation learning: syntactic/semantic dependencies between entities
Information extraction
Document property extraction: title, author, key term, summary
Domain knowledge extraction: concept, concept relation, glossary, taxonomy, event
Cross-lingual information exchange
Translation at word, entity, term, skeleton, text levels
Reading, writing, cross language information retrieval
TIME Demo: TIME Demo
Multi-lingual linguistic unit processing: Multi-lingual linguistic unit processing Word
Tokenization
Named entity recognition (NER)
POS
Sentence
Chunking (VP/NP)
Source-channel models:
TIME (linguistic unit processing): TIME (linguistic unit processing)
Chinese Tokenization & NEI: Chinese Tokenization & NEI
English Chunking and POS Tagging: English Chunking and POS Tagging
English Chunking and POS Tagging: English Chunking and POS Tagging
Skeleton Parser: Skeleton Parser Skeleton ==
Input: He is succeeded by Ivan Allen Jr.
Output [He] is succeeded by [Ivan Allen Jr.] Sub Obj More robust & faster than traditional parser
Adequate for most applications
Collocation checking, Spell checking, Grammar checking, QA, Search
Skeleton Parser: Skeleton Parser Key Dependency Relations
A set of most important relations (e.g. subject, object…)
Definition based on application
Our Target: A Robust & Fast Dependency Extractor
Not rely on high quality (hand-annotated) training data.
High efficiency in dealing with large scale of data (e.g. web data)
Potential Applications
Information Extraction, Q/A, TDT
Who (Subject-Verb), Whom (Verb-object), What (Adj-Noun)
Machine translation
Skeleton translation
NL-based Information Retrieval
Cross-Language IR
Re-ranking by triple matching
Proposed approach: Proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking
The proposed approach: The proposed approach Shallow Parser NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking
The proposed approach: The proposed approach NLPWin Parser Training Data Parsed corpus Raw corpus Noise Filtering Training Input Sentence PoS Tagging Key Dependency Triples Chunking Feature Extraction Classification
Skeleton Parser: Skeleton Parser
Skeleton Parser: Skeleton Parser
Term Extraction: Term Extraction Candidate Generation Options:
Boundary determination
BaseNP
Pattern filtering Ranking Text Term List Terms Options:
Term frequency
TF-IDF
Entropy reduction
ER-IDF
Term Extraction: Term Extraction
Term Extraction: Term Extraction
Text Mining Roadmap: Text Mining Roadmap SQL Text Mining Key technologies
Metadata extraction
Ranking algorithm
Multi-languages support
Text Miner Meta Data for Sharepoint Information Desk
Information Desk: Information Desk http://msra-nlc-tm1
Slide38: http://msra-nlc-tm1/
Machine Translation Roadmap: Machine Translation Roadmap Office EWW Key technologies
Skeleton parser
Collocation checker
Paraphrase
Knowledge acquisition
Adaptive to new language pairs Mobility Search Engine Direction
Template based
Linguistic data acquisition from Web mining
TIME
Slide47: EWW (English Writing Wizard) Features
Idiomatic usages
Synonymous collocation
Collocation translations
Bilingual example sentences Technology Highlights
Auto extraction of idiomatic usage
Auto extraction of synonymous collocation
Auto extraction of collocation translations
Example sentence retrieval Idiomatic Usage Objectives
Make your English writing as good as native speakers Input: question
question (Noun)
Verb+question: raise ~, ask ~, resolve ~, pose ~
Adj+question: unanswered ~, serious ~, big ~, real ~
question (Verb)
question+Noun: ~ motive, ~ value, ~ truth, ~ boy
question+Adv: ~ intensely, ~ orally, ~ closely, ~ at_all
Adv+question: privately ~, cautiously ~, hardly ~
Synonymous Collocation attain~dobj~level achieve~dobj~level
attract~dobj~fan draw~dobj~fan
take~dobj~reins assume~dobj~reins|hold~dobj~reins
bad~Intnsifs~extremely risky~Intnsifs~extremely
unusual~Intnsifs~quite unusual~Intnsifs~rather
vision~Attrib~unusual sight~Attrib~unusual
Improve~Mod~greatly Improve~Mod~considerably 克服~困难
conquer difficulty, overcome difficulty, master~difficulty
overcome~adversity, surmount~difficulty Collocation Translation
Web Search & Mining: Web Search & Mining Internet + Data + Information -> Search, Mining, Sharing, & Intelligence
Lots of text
Text-based IR
Text Mining
Semantic/Structure Mining
Media Search
Surrounding text
Audio/video transcription
Make Billions of $ from trillions of words
Information Retrieval: Information Retrieval Text Processing
Tokenization
Normalization – stemming, …
Precision/Recall
Beyond 1st order statistics (TF-IDF)
N-gram for adaptive indexing
Better model of P(Doc|Query)
Classification vs. term frequency
Result Summarization
Query sensitive
U盘 (优盘) vs. 大拇哥
Result clustering & classification
Search Long Result List: Search Long Result List A user search for information about “jaguar”, a Mac OS
However, the relevant results are mixed with other pages
The user need to go through a long list to find desired information
Clustering vs. Classification: Clustering vs. Classification Clustering Results for “jaguar” Classification Results for “jaguar”
Document Clustering & Sub-topic Identification: Document Clustering & Sub-topic Identification http://msra-idss-04:8080/prototype1 Search Result Grouping
Overview of the returned documents
Locate useful information quickly
Word sense disambiguation
Text Mining: Text Mining New research area
Highly statistically based
TIME on internet
Improving Precision/Recall
Title Extraction
10% improvement in ranking
XP Help & Support (support.microsoft.com)
Aggregate TF from
Newsgroup
Support emails
Text Mining: Text Mining Location finder
Entity location
The physical address of the entity (e.g. organization, corporation or person) owning the web
Crucial for geographical web retrieval and navigation
Yellow Pages, map services
Content location
The location that the content of the web resource is lied on.
Crucial for location based search & services
Context location
The geographical scope that the web resource reaches.
Crucial for B2C applications like local advertisement and e-commerce.
Three Types of Page Locations: Three Types of Page Locations
Distribution of Geographical Keywords: Distribution of Geographical Keywords Demo
Text Mining: Text Mining AskMSR
Providing Answers inline instead of links to answers
USPS, UPC, Vehicle #s, Product IDs, Addresses, Stock & financial #s, etc…
AskMSR: AskMSR Leverage redundant web information
N-gram locator in results pages
Semantic Mining: Semantic Mining Beyond document retrieval
Web mining & knowledge discovery
Hierarchical clustering -> Mining
From non-structure to structure
Entity Identification
Relation Discovery
Mining on relation graph
Clustering Multi-typed Interrelated Objects
Ranking
Graph Evolving
Relation visualization
Graph Matching/Morphing/embedding
http://msra-idss-04:8080/prototype1/(r0l5ivbnvijh4y45d5nyewee)/clustermain.aspx
Structure Paper Search: Structure Paper Search
Relevant Term Mining: Relevant Term Mining Search Term Suggestion (STS)
Document term may not match with real queries
Cluster the query terms into semantic topics
Classifying document terms into semantic topics
Rank the suggested terms by the popularity
http://msra-mm650-06/demo
Web-page Hyperlink Query Thesaurus Query Query Log
Media Search: Media Search Rely mostly on Text!
Surrounding text mining/extraction
Transcription from ASR
Audio/Video
AIME
Result presentation
Clustering/classification
Rely on text again!
Image and keyword co-occurrence matrix
Image Clustering: Image Clustering 1710 JPG images in 1287 pages are crawled within the website Six Categories Fish Bird Mammal Reptile Amphibian Insect
Web Image Thesaurus: Web Image Thesaurus coyote Basic Idea: Use abundant annotated images on the Web as training data
Media Search: Media Search
Cross-lingual Information Access: Cross-lingual Information Access Chinese Query Query
Translation English Query Query Processing
Ontology Search Web Page Chs. Doc Reading
Assistant Eng Docs Search Engine Query Translation Reading Assistant
Cross-lingual Information Access: Cross-lingual Information Access Important for non-English surfer
Access to English content
Using English content for ranking
Web-based Data Acquisition
Vast
Noisy
Parallel text
Cross-Lingual Information Retrieval: Cross-Lingual Information Retrieval 微软研究院
Cross-Lingual Reading Assistant: 量子计算 平板电脑 Cross-Lingual Reading Assistant
Cross-Lingual Summarization: Cross-Lingual Summarization Title: Talking computers nearing reality
Author: Michael Kanellos
Time: July 9, 2003
……
Summary:
Microsoft on Wednesday released the first public beta of its Speech Server, which will let servers better handle oral comments. Title: 说话的计算机临近现实
Author: Michael Kanellos
Time: 2003.7.9
……
Summary:
星期三微软发布了它的第一个说话服务器,将让服务器更好处理口头命令。
Summary: Summary Industry will continue
Build products using speech, NL, IR,…
Hiring people in speech, NL, NL, IR
Require more software to drive market
quoted by Barry Lam of Quanta
We should all care about these technologies