Slide1: Digital Speech Processing
數位語音處理
李琳山
Slide2: Speech Signal Processing
Major Application Areas
Speech Coding:Digitization and Compression
Considerations : 1) bit rate (bps)
2) recovered quality
3) computation
complexity/feasibility
Voice-based Network Access —
User Interface, Content Analysis, User-content Interaction
LPF output Processing Algorithms x(t) x[n] Processing xk
110101… Inverse Processing x[n] Storage/transmission Speech Signals
Carrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.
Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level
Processing and Interaction of the Double-level Information
Speech Signal Processing – Processing of Double-Level Information: Speech Signal Processing – Processing of Double-Level Information Speech Signal Sampling Processing Linguistic Structure Linguistic Knowledge Lexicon Grammar Algorithm Chips or Computers
Slide4: Voice-based Network Access Content Analysis User Interface Internet User-Content Interaction User Interface
—when keyboards/mice inadequate
Content Analysis
— help in browsing/retrieval of multimedia content
User-Content Interaction
—all text-based interaction can be accomplished by spoken language
Slide5: User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals at Any Time, from Anywhere
Handsets, Hand-held Devices, PDA’s, Personal Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable Devices…
Small in Size, Light in Weight, Ubiquitous, Invisible…
Evolving towards a “Post-PC Era”
Keyboard/Mouse Most Convenient for PC’s not Convenient any longer
— human fingers never shrink, and application environment is changed
Service Requirements Growing Exponentially
Voice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere
Slide6: Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)
Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse
The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval
Multimedia Content Analysis based on Speech Information Future Integrated Networks Real–time
Information
weather, traffic
flight schedule
stock price
sports scores
Electronic
Commerce
virtual banking
on–line transactions
on–line investments
Knowledge
Archieves
digital libraries
virtual museums Intelligent Working
Environment
e–mail processors
intelligent agents
teleconferencing
distant learning
Private Services
personal notebook
business databases
home appliances
network
entertainments
Slide7: User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing voice information
Multimedia Content Internet voice input/ output text
information Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech
User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues
Many Hand-held Devices with Multimedia Functionalities Commercially Available Today
Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech Information Multimedia Content Analysis Text Information Retrieval Text Content Voice-based Information Retrieval Text-to-Speech Synthesis Spoken and multi-modal Dialogue
Slide8: Voice-based Information Retrieval Speech may become a New Data Type
Both the User Instructions and Network Content Can be in form of Speech
Slide9: Spoken and Multi-modal Dialogues Almost All User-Content Interaction can be Accomplished by Spoken or Multi-modal Dialogues
An Example of Client-Server Computing Environment
Databases Sentence Generation
and Speech Synthesis Output Speech Input Speech Dialogue
Manager Speech Recognition and Understanding User’s Intention Discourse Context Response to the user Internet Wireless Networks Users Dialogue Server
Slide10: Convergence of PSTN and Internet PSTN (for Voice) and Internet (for Data and Multi-media Contents) are Converging Driving Force for the Convergence
“anywhere, any time” of wireless services
voice provides the most convenient and natural interaction interface
attractive contents over the Internet
contents (human information) are why the Internet is attractive, while voice directly carries human information
Speech-enabled Access of Web-based Applications
Slide11: Wireless Access of Global Information As Handset Size Shrinks While Required Functionalities Grows and the User Environment Changes, Voice Interface will be Useful for all Different User Terminals
As More Network Content becomes Multi-media, Content Analysis based on Speech Information will be Essential
Integration of Many Different Technologies
information processing, networking, transmission, internet, wireless, speech processing
Speech Processing is the only Major Missing Link in the Semi-mature Technology Chain
Slide12: Future World of Communications and Computing Speech Processing Technologies Wireless Technologies Communications and Networking Technologies Information Processing Technologies
Outline: Outline Both Theoretical Issues and Practical Problems will be Discussed
Starting with Fundamentals, but Entering Research Topics Gradually
Part I: Fundamental Topics
1.0 Introduction to Digital Speech Processing
2.0 Fundamentals of Speech Recognition
3.0 Map of Subject Areas
4.0 More about Hidden Markov Models
5.0 Acoustic Modeling
6.0 Language Modeling
7.0 Speech Signals and Front-end Processing
8.0 Search Algorithms for Speech Recognition
Part II: Advanced Topics
9.0 Speaker Variabilities: Adaption and Recognition
10.0 Latent Semantic Analysis for Linguistic Processing
11.0 Spoken Document Understanding and Organization
12.0 Voice-based Information Retrieval
13.0 Robustness for Acoustic Environment
14.0 Some Fundamental Problem-solving Approaches
15.0 Utterance Verification and Keyword/Key Phrase Spotting
16.0 Spoken Dialogues
17.0 Distributed Speech Recognition and Wireless Environment
18.0 Some Recent Developments in NTU
19.0 Conclusion
Outline: Outline 教科書:無
主要參考書:
X. Huang, A. Acero, H. Hon, “Spoken Language Processing”, Prentice Hall, 2001,松瑞
F. Jelinek, “Statistical Methods for Speech Recognition”, MIT Press, 1999
L. Rabiner, B.H. Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993, 民全
C. Becchetti, L. Prina Ricotti, “Speech Recognition- Theory and C++ implementation”, Johy Wiley and Sons, 1999, 民全
其他參考文獻課堂上提供
教材:
available on web before the day of class (http://speech.ee.ntu.edu.tw)
適合年級:三、四(電機系、資工系)
課程目的:提供同學進入此一充滿機會與挑戰的新領域所需的基本知識,體驗數學模型與軟體程式如何相輔相成,學習進入一個新領域由基礎進入研究的歷程,體會吸收非結構性知識(Unstructured Knowledge)的經驗
成績評量方式
Midterm Exam 25%
Homeworks (I) (II) (Ⅲ) 15%、5%、15%
Final Exam 10%
Term Project 30%
Slide15: 1.0 Introduction — A Brief Summary of Core Technologies and Current Status References for 1.0
1.“Voice Access of Global Information for Broadband Wireless: Technologies of Today and Challenges of Tomorrow”, Proceedings of IEEE, Jan 2001
2.“Conversational Interfaces: Advances and Challenges” , Proceedings of the IEEE, Aug 2000
Slide16: Speech Recognition as a pattern recognition problem
Slide17: A Simplified Block Diagram
Example Input Sentence
this is speech
Acoustic Models
(th-ih-s-ih-z-s-p-ih-ch)
Lexicon (th-ih-s) → this
(ih-z) → is
(s-p-iy-ch) → speech
Language Model (this) – (is) – (speech)
P(this) P(is | this) P(speech | this is)
P(wi|wi-1) bi-gram language model
P(wi|wi-1,wi-2) tri-gram language model,etc Basic Approach for Large Vocabulary Speech Recognition
Slide18: Speech Recognition Technologies, Applications and Problems Word Recognition
voice command/instructions
Keyword Spotting
identifying the keywords out of a pre-defined keyword set from input voice utterances
Large Vocabulary Continuous Speech Recognition
entering longer texts
remote dictation/automatic transcription
Speaker Dependent/Independent/Adaptive
Acoustic Reception/Background Noise/Channel Distortion
Read/Spontaneous/Conversational Speech
Slide19: Text-to-speech Synthesis Transforming any input text into corresponding speech signals
E-mail/Web page reading
Prosodic modeling
Basic voice units/rule-based, non-uniform units/corpus-based
Slide20: Speech Understanding
Slide21: Speaker Verification Feature Extraction Verification input speech yes/no Verifying the speaker as claimed
Applications requiring verification
Text dependent/independent
Integrated with other verification schemes
Speaker Models
Slide22: Voice-based Information Retrieval Speech Instructions
Speech Documents (or Multi-media Documents including Speech Information)
Indexing Features/Relevance Evaluation
Recall/Precision Rates
Slide23: Spoken Dialogue Systems Almost all human-network interactions can be made by spoken dialogue
Speech understanding, speech synthesis, dialogue management
System/user/mixed initiatives
Reliability/efficiency, dialogue modeling/flow control
Transaction success rate/average number of dialogue turns
Spoken Document Understanding and Organization: Spoken Document Understanding and Organization Unlike the Written Documents which are Better Structured and Easier to Index and Browse, Spoken Documents are just Audio Signals, or a Sequence of Words if Transcribed
— the user can’t listen to (or read carefully) each one from the beginning to the
end during browsing
— better approaches for understanding/organization of spoken documents becomes
necessary
Spoken Document Segmentation
— automatically segmenting a spoken document into short paragraphs, each with
a central topic
Spoken Document Summarization
— automatically generating a summary (in text or speech form) for each short
paragraph
Title Generation for Spoken Documents
— automatically generating a title (in text or speech form) for each short paragraph
Semantic Structuring of Spoken Documents
— construction of semantic structure of spoken documents into graphical hierarchies
Multi-lingual Functionalities: Multi-lingual Functionalities Code-Switching Problem
English words/phrases inserted in spoken Chinese sentences as an example
人人都用Computers,家家都上Internet
the whole sentence switched from Chinese to English as an example
準備好了嗎?Let’s go!
Cross-language Network Information Processing
globalized network with multi-lingual content/users
cross-language network information processing with a certain input language
Dialects/Accents
hundreds of Chinese dialects as an example
code-switching problem─ Chinese dialects mixed with Mandarin (or plus English) as an example
Mandarin with a variety of strong accents as an example
Global/Local Languages
Language Dependent/Independent Technologies
Shared Acoustic Units/Integrated Linguistic Structures
Slide26: An Example Partition of Speech Recognition Processes into Client/Sever Distributed Speech Recognition (DSR) and Wireless Environment Front-end
Signal Processing Acoustic
Models Feature
Vectors Linguistic Decoding
and
Search Algorithm Output
Sentence Speech
Corpora Acoustic
Model
Training Language
Model
Construction Text
Corpora Lexical
Knowledge-base Language
Model Input Speech Grammar encoded feature parameters transmitted in packets
Client/Server Structure Server Client
Distributed Speech Recognition (DSR) and Wireless Environment: Distributed Speech Recognition (DSR) and Wireless Environment Wireless Environment
examples: Personal Area Networks (Bluetooth, etc.), Wireless LAN (IEEE 802.11), Cellular (GSM, GPRS, 3G), etc.
Link Level
time-varying fading and noise characteristics
time-varying signal level and signal-to-noise ratios
bursty errors with much higher error rates
much smaller and dynamic bandwidth, much lower
and changing bit rates
Transport Level
TCP/IP: errors retransmission delay
UDP/IP: errors real-time/no delay packet loss
packets out of sequence