Incorporating In-domain Confidence and Discourse Coherence Measures in Utterance Verification ドメイン内の信頼度と談話の整合性 を用いた音声認識誤りの検出 Ian R. Lane, Tatsuya Kawahara Spoken Language Communications Research Laboratories, ATR 1 School of Informatics, Kyoto University Introduction • Current ASR technologies not robust against: – Acoustic mismatch: noise, channel, speaker variance – Linguistic mismatch: disfluencies, OOV, OOD • Assess confidence of recognition hypothesis, and detect recognition errors Effective user feedback • Select recovery strategy based on type of error and specific application 2 Previous Works on Confidence Measures • Feature-based – [Kemp] word-duration, AM/LM back-off • Explicit model-based – [Rahim] likelihood ratio test against cohort model • Posterior probability – [Komatani, Soong, Wessel] estimate posterior probability given all competing hypotheses in a word-graph Approaches limited to “low-level” information available during ASR decoding 3 Proposed Approach • Exploit knowledge sources outside ASR framework for estimating recognition confidence e.g. knowledge about application domain, discourse flow Incorporate CM based on “high-level” knowledge sources • In-domain confidence – degree of match between utterance and application domain • Discourse coherence – consistency between consecutive utterances in dialogue 4 Utterance Verification Framework Input utterance Xi-1 Out-of-domain Detection ASR front-end Topic Classification In-domain Verification dist(Xi,Xi-1) CMin-domain(Xi-1) CMdiscourse(Xi|Xi-1) Out-of-domain Detection Xi ASR front-end Topic Classification In-domain Verification CMin-domain(Xi) CM(Xi) CMgpp(Xi) in-domain confidence CMdiscourse(Xi|Xi-1): discourse coherence CMin-domain(Xi): CM(Xi): joint confidence score, combine above with generalized posterior probability CMgpp(Xi) 5 In-domain Confidence • Measure of topic consistency with application domain – Previously applied in out-of-domain utterance detection Examples of errors detected via in-domain confidence Mismatch of domain REF: How can I print this WORD file double-sided ASR: How can I open this word on the pool-side hypothesis not consistent by topic in-domain confidence low Erroneous recognition hypothesis REF: I want to go to Kyoto, can I go by bus ASR: I want to go to Kyoto, can I take a bath hypothesis not consistent by topic in-domain confidence low 6 REF: correct transcription ASR: speech recognition hypothesis In-domain Confidence Input Utterance Xi (recognition hypothesis) Transformation to Vector-space Feature Vector Classification of Multiple Topics SVM (1~m) In-Domain Verification Topic confidence scores (C(t1|Xi), ... ,C(tm|Xi)) Vin-domain(Xi) CMin-domain(Xi) In-domain confidence 7 In-domain Confidence Input Utterance Xi (recognition hypothesis) e.g. ‘could I have a non-smoking seat’ Transformation to Vector-space (a, an, …, room, …, seat, …, I+have, … (1, 0 , …, 0 , …, 1 , …, 1 ,… Classification of Multiple Topics SVM (1~m) accom. airplane airport … In-Domain Verification 0.05 0.36 0.94 Vin-domain(Xi) CMin-domain(Xi) 90 % 8 In-domain Verification Model • Linear discriminate verification model applied Vin-domain ( X i ) j C t j | X i m i 1 C(tj|Xi): topic classification confidence score of topic tj for input utterance X j: discriminate weight for topic tj • 1, …, m trained on in-domain data using “deleted interpolation of topics” and GPD [lane ‘04] CM in-domain X i sigmoidVin-domain Xi 9 Discourse Coherence • Topic consistency with preceding utterance Examples of errors detected via discourse-coherence Erroneous recognition hypothesis Speaker A: Previous utterance [Xi-1] REF: What type of shirt are you looking for? ASR: What type of shirt are you looking for? Speaker B: Current utterance [Xi] REF: I’m looking for a white T-shirt. ASR: I’m looking for a white teacher. topic not consistent across utterances discourse coherence low 10 REF: correct transcription ASR: speech recognition hypothesis Discourse Coherence • Euclidean distance between current (Xi) and previous (Xi-1) utterances in topic confidence space dist Euclidean( X i , X i 1 ) C t m j 1 | X i C t j | X i 1 2 j CM discourse X i | X i 1 sigmoid distEuclidean X i , X i 1 • CMdiscourse large when Xi, Xi-1 related, low when differ 11 Joint Confidence Score Generalized Posterior Probability • Confusability of recognition hypothesis against competing hypotheses [Lo & Soong] • At utterance level: l CM gpp X GWPP( x j ) j 1 GWPP(xj): generalized word posterior probability of xj xj: j-th word in recognition hypothesis of X 12 Joint Confidence Score CM ( X i ) gpp CM gpp ( X i ) in-domain CM in-domain ( X i ) discourse CM discourse ( X i | X i 1 ) where gpp in-domain discourse 1 • For utterance verification compare CM(Xi) to threshold () • Model weights (gpp, in-domain, discourse), and threshold () trained on development set 13 Experimental Setup • Training-set: ATR BTEC (basic-travel-expressions-corpus) – ~400k sentences (Japanese/English pairs) – 14 topic classes (accommodation, shopping, transit, …) – Train: topic-classification + in-domain verification models • Evaluation data: ATR MAD (machine aided dialogue) – Natural dialogue between English and Japanese speakers via ATR speech-to-speech translation system – Dialogue data collected based on set of pre-defined scenarios – Development-set: 270 dialogues Test-set: 90 dialogues On development set train: CM sigmoid transforms CM weights (gpp, in-domain, discourse) Verification threshold () 14 Speech Recognition Performance • ASR performed with ATRASR; 2-gram LM applied during decoding, rescore lattice with 3-gram LM # dialogues Development Test 270 90 Japanese Side # utterances 2674 1011 WER 10.5% 10.7% SER 41.9% 42.3% English Side # utterances 3091 1006 WER 17.0% 16.2% SER 63.5% 55.2% 15 Evaluation Measure • Utterance-based Verification – No definite “keyword” set in S-2-S translation – If recognition error occurs (one or more errors) prompt user to rephrase entire utterance • CER (confidence error rate) – FA: false acceptance of incorrectly recognized utterance – FR: false rejection of correctly recognized utterance # FA # FR CER # utterances 16 GPP-based Verification Performance • Accept All: Assume all utterances are correctly recognized • GPP: Generalized posterior probability CER (%) 60 40 20 Accept All Accept All GPP GPP 0 Japanese side English side Large reduction in verification errors compared with “Accept all” case 17 CER 17.3% (Japanese) and 15.3% (English) Incorporation of IC and DC Measures (Japanese) GPP: Generalized posterior probability IC: In-domain confidence DC: Discourse coherence CER (%) 18.0 16.0 GPP 14.0 GPP +IC GPP +DC GPP +IC +DC 12.0 CER reduced by 5.7% and 4.6% for “GPP+IC” and “GPP+DC” cases 18 CER 17.3% 15.9% (8.0% relative) for “GPP+IC+DC” case Incorporation of IC and DC Measures (English) GPP: Generalized posterior probability IC: In-domain confidence DC: Discourse coherence CER (%) 18.0 16.0 14.0 GPP GPP +IC GPP +DC GPP +IC +DC 12.0 Similar performance for English side CER 15.3% 14.4% for “GPP+IC+DC” case 19 Conclusions Proposed novel utterance verification scheme incorporating “high-level” knowledge In-domain confidence: degree of match between utterance and application domain Discourse coherence: consistency between consecutive utterances Two proposed measures effective Relative reduction in CER of 8.0% and 6.1% (Japanese/English) 20 Future work “High-level” content-based verification Ignore ASR-errors that do not affect translation quality Further improvement in performance Topic Switching – Determine when users switch task Consider single task per dialogue session 21
© Copyright 2024 ExpyDoc