Clippers1007

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

What I did on my Summer “Vacation“: 

What I did on my Summer “Vacation“ Jeremy Morris 10/06/2006

Summer at AFRL - DAGSI: 

Summer at AFRL - DAGSI AFRL Air Force Research Labs Wright-Patterson AFB, Dayton OH DAGSI Student/Faculty Resarch Fellowship program Dayton Area Graduate Studies Institute Effort to encourage collaboration between Ohio universities and AFRL

Summer at AFRL – SCREAM Lab: 

Summer at AFRL – SCREAM Lab SCREAM Lab Speech and Communication Research, Engineering, Analysis and Modeling Lab Interest in a wide variety of speech research issues for the military Speech-to-speech translation, rapid development of speech recognition systems, etc.

Summer at AFRL – Why us?: 

Summer at AFRL – Why us? SCREAM Lab members were interested in collaborating with OSU SCREAM Lab working on research in using phonological features in speech recognition Perceived overlap with ASAT project

Review – Phonological Features: 

Review – Phonological Features For the ASAT Project, we have been using phonological feature detectors We train detectors on a particular phonological feature e.g. manner or place for consonant, height, frontness, etc. for vowels We then combine these features together for ASR purposes

Phonological Features (cont.): 

Phonological Features (cont.) SCREAM Lab very interested in phonological feature detectors Need for quick development of new ASR systems for new languages A full set of phonological feature detectors would allow reuse of acoustic data for training across new languages Multi-lingual detectors are clearly needed to get full coverage of all features

Phonological Features (cont.): 

Phonological Features (cont.) Our phonological feature detectors Monolingual (English only) Trained using a set of multi-layer perceptron neural networks Output a set of phonological feature class probabilities SCREAM lab feature detectors Monolingual and multilingual Trained using Gaussian Mixture Models Output a set of likelihoods Based on work by Tanja Schultz (CMU)

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Besides acoustic models, new ASR systems for new languages have other needs An ASR system needs a lexicon mapping phones-to-words Normally hand-constructed Require time and expertise

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Our proposal: look at methods of bootstrapping new lexicons from: Acoustic data Word-level transcripts Phonological feature detector outputs How? Start by looking at work on deriving Acoustic Sub-Word Units

Summer at AFRM - Proposal: 

Summer at AFRM - Proposal Acoustic Sub-Word Units (ASWUs) Similar to phones in that they are smaller pieces of words BUT – automatically derived from acoustics instead of manually defined Used to derive both a sub-word unit set and a lexicon for that set simultaneously Research in this area has been mainly to improve ASR performance

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Can we use these methods along with phonological features as inputs to induce new lexicons? Using phonological features, the sub-word units may be mappable to standard IPA phone labels

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal The proposed system is inspired by an ASWU by (Singh et al., 2002) Notable for not requiring word boundaries to be marked for training Start with a basic dictionary (including a starting phoneset size) Train a set of acoustic models on the training data with that dictionary Alter the basic dictionary in a manner that improves your pronunciations Repeat until a stopping criterion is reached

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Start with a basic dictionary Start with an assumption that the number of phones in a word is related to the number of letters in the orthography Basic dictionary maps word to sequence of letters in that word: ABLE  A B L E BANNED  B A N N E D

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Train a set of acoustic models Using the basic dictionary, map words in the transcript to these “pronunciations” Train an HMM-model using the output of the feature detectors as its input, and the above mapping as training labels

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Alter the basic dictionary Using some metric, find a candidate “phone” to be modified We’ve looked at a couple of metrics – more on this later Once the phone is identified, see if the phone should be “split” or “deleted” A “split” indicates that the given phone label actually represents two different sounds, and so should be replaced with two different phone labels A “delete” indicates that for a particular word or words the model fits better if that phone label is removed from the pronunciation

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Split example: BE  B E DEVELOP  D E1 V E1 L O P Delete examples: ABLE  A B L E :: ABLE  A B L ABANDONED  A B A N D O N D

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal For splits, all possible alterations are added to temporary lexicon For deletes, we alter the HMM to add a possible deletion arc for the phone After lexicon or HMM is altered, word transcript is force aligned using new possible pronunciations Best pronunciations are pulled from this alignment and used to build new lexicon Steps are repeated using the new lexicon in place of the basic lexicon

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal How do we determine the candidate “phone label” to alter? Initially, modelled each phone with two Gaussians in the HMM Compared the two Gaussians to each other using their KL-divergences Took the phone label with the largest KL divergence as the one to alter Idea was that each Gaussian described a cluster – the further these centers were from each other, the more probable they were describing two different phones

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal KL-divergence metric did not work well System would pick candidates that a human would find unreasonable (such as “F” or “Q”) System would split or delete these phones multiple times, continually returning to the same phone label

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal Why did the KL divergence perform this way? Suspcion: Large variations in the two Gaussians in areas that do not matter for that phone pushed up the scores (e.g. vowel features for consonants) Splitting these phones only allowed the coverage to spread wider, drawing the system back to those phones

Summer at AFRL - Proposal: 

Summer at AFRL - Proposal What next? Tried Mahalanobis distance metric, with poor results also Returned to Acoustic Sub-Word papers for inspiration Instead of looking at cluster stats, multiple papers use an average frame likelihood metric for each phone cluster to determine candidate phone for altering Have started moving my code to use this framework – preliminary passes show promise, but no results quite yet

Conclusion – It’s 75 miles to Dayton: 

Conclusion – It’s 75 miles to Dayton Advice for those thinking of doing work at WPAFB Working in the SCREAM Lab was great Hundreds of processors, tons of multi-lingual corpora Friendly people, decent work environment (if a bit dark) Many hoops to jump through, even just for a summer student ID badges, computer usage training, etc. Sometimes feels like you’re working at a corporation… until the guys in uniform come around The base is built like a campus crossed with a prison cinderblock is the building material of choice. Don’t forget your ID Badge It’s 75 miles from Columbus to Dayton