スライド 1

Incorporating In-domain Confidence and
Discourse Coherence Measures in
Utterance Verification
ドメイン内の信頼度と談話の整合性
を用いた音声認識誤りの検出
Ian R. Lane, Tatsuya Kawahara
Spoken Language Communications Research Laboratories, ATR
1
School of Informatics, Kyoto University
Introduction
• Current ASR technologies not robust against:
– Acoustic mismatch: noise, channel, speaker variance
– Linguistic mismatch: disfluencies, OOV, OOD
• Assess confidence of recognition hypothesis, and
detect recognition errors
Effective user feedback
• Select recovery strategy based on type of error and
specific application
2
Previous Works on Confidence
Measures
• Feature-based
– [Kemp] word-duration, AM/LM back-off
• Explicit model-based
– [Rahim] likelihood ratio test against cohort model
• Posterior probability
– [Komatani, Soong, Wessel] estimate posterior probability
given all competing hypotheses in a word-graph
Approaches limited to “low-level” information
available during ASR decoding
3
Proposed Approach
• Exploit knowledge sources outside ASR framework for
estimating recognition confidence
e.g. knowledge about application domain, discourse flow
Incorporate CM based on “high-level”
knowledge sources
• In-domain confidence
– degree of match between utterance and application domain
• Discourse coherence
– consistency between consecutive utterances in dialogue
4
Utterance Verification Framework
Input
utterance
Xi-1
Out-of-domain Detection
ASR
front-end
Topic
Classification
In-domain
Verification
dist(Xi,Xi-1)
CMin-domain(Xi-1)
CMdiscourse(Xi|Xi-1)
Out-of-domain Detection
Xi
ASR
front-end
Topic
Classification
In-domain
Verification
CMin-domain(Xi)
CM(Xi)
CMgpp(Xi)
in-domain confidence
CMdiscourse(Xi|Xi-1): discourse coherence
CMin-domain(Xi):
CM(Xi):
joint confidence score, combine above with
generalized posterior probability CMgpp(Xi)
5
In-domain Confidence
• Measure of topic consistency with application domain
– Previously applied in out-of-domain utterance detection
Examples of errors detected via in-domain confidence
Mismatch of domain
REF: How can I print this WORD file double-sided
ASR: How can I open this word on the pool-side
hypothesis not consistent by topic in-domain confidence low
Erroneous recognition hypothesis
REF: I want to go to Kyoto, can I go by bus
ASR: I want to go to Kyoto, can I take a bath
hypothesis not consistent by topic in-domain confidence low
6
REF: correct transcription
ASR: speech recognition hypothesis
In-domain Confidence
Input Utterance Xi
(recognition hypothesis)
Transformation to Vector-space
Feature Vector
Classification of Multiple Topics
SVM (1~m)
In-Domain Verification
Topic confidence scores
(C(t1|Xi), ... ,C(tm|Xi))
Vin-domain(Xi)
CMin-domain(Xi)
In-domain confidence
7
In-domain Confidence
Input Utterance Xi
(recognition hypothesis)
e.g. ‘could I have a
non-smoking seat’
Transformation to Vector-space
(a, an, …, room, …, seat, …, I+have, …
(1, 0 , …, 0 , …, 1 , …,
1
,…
Classification of Multiple Topics
SVM (1~m)
accom. airplane airport …
In-Domain Verification
0.05
0.36
0.94
Vin-domain(Xi)
CMin-domain(Xi)
90 %
8
In-domain Verification Model
• Linear discriminate verification model applied
Vin-domain ( X i )    j C t j | X i 
m
i 1
C(tj|Xi): topic classification confidence score of topic tj for input utterance X
j:
discriminate weight for topic tj
• 1, …, m trained on in-domain data using “deleted
interpolation of topics” and GPD [lane ‘04]
CM in-domain  X i   sigmoidVin-domain Xi 
9
Discourse Coherence
• Topic consistency with preceding utterance
Examples of errors detected via discourse-coherence
Erroneous recognition hypothesis
Speaker A: Previous utterance [Xi-1]
REF: What type of shirt are you looking for?
ASR: What type of shirt are you looking for?
Speaker B: Current utterance [Xi]
REF: I’m looking for a white T-shirt.
ASR: I’m looking for a white teacher.
topic not consistent across utterances
 discourse coherence low
10
REF: correct transcription
ASR: speech recognition hypothesis
Discourse Coherence
• Euclidean distance between current (Xi) and
previous (Xi-1) utterances in topic confidence space
dist Euclidean( X i , X i 1 ) 
 C t
m
j 1
| X i   C t j | X i 1 
2
j
CM discourse  X i | X i 1   sigmoid distEuclidean X i , X i 1 
• CMdiscourse large when Xi, Xi-1 related, low when differ
11
Joint Confidence Score
Generalized Posterior Probability
• Confusability of recognition hypothesis against
competing hypotheses [Lo & Soong]
• At utterance level:
l
CM gpp  X    GWPP( x j )
j 1
GWPP(xj):
generalized word posterior probability of xj
xj:
j-th word in recognition hypothesis of X
12
Joint Confidence Score
CM ( X i )  gpp CM gpp ( X i )  in-domain CM in-domain ( X i )
 discourse CM discourse ( X i | X i 1 )
where gpp  in-domain  discourse  1
• For utterance verification compare CM(Xi) to threshold ()
• Model weights (gpp, in-domain, discourse), and threshold ()
trained on development set
13
Experimental Setup
• Training-set: ATR BTEC
(basic-travel-expressions-corpus)
– ~400k sentences (Japanese/English pairs)
– 14 topic classes (accommodation, shopping, transit, …)
– Train: topic-classification + in-domain verification models
• Evaluation data: ATR MAD (machine aided dialogue)
– Natural dialogue between English and Japanese speakers via
ATR speech-to-speech translation system
– Dialogue data collected based on set of pre-defined scenarios
– Development-set: 270 dialogues
Test-set: 90 dialogues
On development set train:
CM sigmoid transforms
CM weights (gpp, in-domain, discourse)
Verification threshold ()
14
Speech Recognition Performance
• ASR performed with ATRASR; 2-gram LM applied during
decoding, rescore lattice with 3-gram LM
# dialogues
Development
Test
270
90
Japanese Side
# utterances
2674
1011
WER
10.5%
10.7%
SER
41.9%
42.3%
English Side
# utterances
3091
1006
WER
17.0%
16.2%
SER
63.5%
55.2%
15
Evaluation Measure
• Utterance-based Verification
– No definite “keyword” set in S-2-S translation
– If recognition error occurs (one or more errors)
 prompt user to rephrase entire utterance
• CER (confidence error rate)
– FA: false acceptance of incorrectly recognized utterance
– FR: false rejection of correctly recognized utterance
# FA  # FR
CER 
# utterances
16
GPP-based Verification Performance
• Accept All: Assume all utterances are correctly recognized
• GPP: Generalized posterior probability
CER (%)
60
40
20
Accept
All
Accept
All
GPP
GPP
0
Japanese side
English side
Large reduction in verification errors compared with “Accept all” case
17
 CER 17.3% (Japanese) and 15.3% (English)

Incorporation of IC and DC Measures
(Japanese)
GPP: Generalized posterior probability
IC: In-domain confidence
DC: Discourse coherence
CER (%)
18.0
16.0
GPP
14.0
GPP
+IC
GPP
+DC
GPP
+IC
+DC
12.0
CER reduced by 5.7% and 4.6% for “GPP+IC” and “GPP+DC” cases
18
 CER 17.3%  15.9% (8.0% relative) for “GPP+IC+DC” case

Incorporation of IC and DC Measures
(English)
GPP: Generalized posterior probability
IC: In-domain confidence
DC: Discourse coherence
CER (%)
18.0
16.0
14.0
GPP
GPP
+IC
GPP
+DC
GPP
+IC
+DC
12.0
Similar performance for English side
 CER 15.3%  14.4% for “GPP+IC+DC” case

19
Conclusions

Proposed novel utterance verification scheme
incorporating “high-level” knowledge
In-domain confidence:
degree of match between utterance and application domain
Discourse coherence:
consistency between consecutive utterances

Two proposed measures effective

Relative reduction in CER of 8.0% and 6.1%
(Japanese/English)
20
Future work

“High-level” content-based verification


Ignore ASR-errors that do not affect translation quality
Further improvement in performance
Topic Switching
– Determine when users switch task
Consider single task per dialogue session
21