Automatic phonetic transcription for Danish ASR - The Bridge

ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Is it good enough
Automatic phonetic transcription for Danish ASR
Andreas Søeborg Kirkedal
AT&T & CBS
[email protected]
31. januar 2014
ASR
Automatic Speech Recognition
Outline
1
ASR
2
Automatic Speech Recognition
What happens?
Focus
3
Dictionaries
Automatic transcription
4
Data
Text
Speech
Systems
5
Experiments
Kaldi ASR systems
6
Is it good enough
Critique
Dictionaries
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
What happens?
This is what happens
Dictionaries
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Focus
The topic of today
Dictionaries
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Focus
Overview
AM
PD
LM
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Focus
Overview
AM
AFVIST
BIAVL
FORÆRE
TRETTEN
PD
A w
b-0
f V
t-h
v i: s d-0
i A w l
E: 6
R { d-0 @ n
LM
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Focus
Overview
b-0
F
6
AM
AFVIST
BIAVL
FORÆRE
TRETTEN
FORÆRE
BIAVL
PD
A w
b-0
f V
t-h
v i: s d-0
i A w l
E: 6
R { d-0 @ n
LM
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Cost barrier
Large
Out-of-vocabulary errors
Domain
Expensive
Pronunciation variants
Quality
Genre
Read-aloud speech
Dictation
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Automatic transcription
Automatic transcription
Automatic transcription is not as good as manual
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Automatic transcription
Automatic transcription
Automatic transcription is not as good as manual
Not always
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Automatic transcription
Automatic transcription
Automatic transcription is not as good as manual
Not always
But it is
Cheaper
Quicker
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Automatic transcription
Automatic transcription
Automatic transcription is not as good as manual
Not always
But it is
Cheaper
Quicker
Can it be good enough?
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Automatic transcription
eSpeak
Open Source TTS engine
Downloadable
Multiple languages supported (ca. 50)
Advanced transcription
Includes suprasegmental features
Variant of IPA
Quality
Accuracy is very important
Maturity of language specied
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Automatic transcription
eSpeak [Duddington, 2010]
Rewriting system (shallow)
Spelling-to-phoneme rules
Case-by-case rules
Developed by several people
Quality depends on the developer
Danish native speaker (current)
Language professional
Linguistics background?
Complexity
8600 Spelling-to-phoneme rules
11000 Word/word-class rules
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Automatic transcription
Phonix [Henrichsen, 2014]
Linguistic preprocessing
Deep morphological analysis
Decomposition
Several fall-back strategies
Several knowledge bases
Dictionaries
Letter-to-sound rules
KBs must agree on transcription
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Automatic transcription
Phonix [Henrichsen, 2014]
Linguistic preprocessing
Deep morphological analysis
Decomposition
Several fall-back strategies
Several knowledge bases
Dictionaries
Letter-to-sound rules
KBs must agree on transcription
You just heard everything in more detail anyway
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Text
Data
Proprietary
General
Commands
Names
Spelling
Numbers
Medical diagnoses
Public
Danish parliament (Folketinget)
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Text
Preparation
1
2
Uniform encoding
Cleaning up text
Remove duplicates, e.g:
'4' : 're'
3
4
5
Handle abbreviations
Tokenisation
Expand dates and numbers
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Speech
Mirsk parallel data
16kHz
Mono
Genre
Read-aloud speech
Dictation
Gender
Accent and Dialect
Age
350 speakers
Dictionaries
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Systems
Text databases
1
2
1
2
3
Mirsk 3gram LM
Mirsk 4gram LM
eSpeak dictionary
Phonix dictionary
Combination dictionary
Map from one phonetic alphabet to the other
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Systems
Multiple transcriptions
tRAdn
tRAt@n
Tretten (13)
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Systems
Multiple transcriptions
totusEnVtol
tywetol
2012
Is it good enough
ASR
Automatic Speech Recognition
Systems
Phone mapping
{ & ?&
{ : & ?&
@ @ 3
2 ?W W
2 : ?W W
6 V
6: V:
9 W
9: W
a &
A A
Dictionaries
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Kaldi ASR systems
Training
1
2
3
4
5
Train monophone system
Create triphone system from monophone system
Optimise systems with dierent metrics (WER/MMI)
Use dierent combinations of LMs
Use additional features
Kaldi ASR toolkit [Povey et al., 2011]
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Is it good enough
Kaldi ASR systems
Baselines
eSpeak
System LM
3gram
mono
4gram
3gram
tri1
4gram
3gram
tri2a
4gram
3gram
tri2b
4gram
3gram
tri3b
4gram
tri4a
3gram
WER
50.33
50.16
27.25
26.27
24.72
24.25
22.88
22.60
25.76
25.48
21.06
Phonix
System LM
3gram
mono
4gram
3gram
tri1
4gram
3gram
tri2a
4gram
3gram
tri2b
4gram
3gram
tri3b
4gram
tri4a
3gram
WER
60.09
60.09
37.80
36.92
33.77
31.13
ASR
Automatic Speech Recognition
Dictionaries
Data
Kaldi ASR systems
Combination
Conguration
WER
System LM
eSpeak Combination
3gram 50.33
50.21
mono
4gram 50.16
50.23
3gram 27.25
26.92
tri1
4gram 26.27
3gram 24.72
25.82
tri2a
4gram 24.25
3gram 22.88
23.16
tri2b
4gram 22.60
3gram 25.76
20.52
tri3b
4gram 25.48
tri4a
3gram 21.06
18.55
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Is it good enough
Critique
Comparison
Genre
Dictation task is more dicult
Dictation closer to PE scenario
DK SOA ASR: 50-20% WER
[Molgaard et al., 2007]
Best system: 18.55% WER
Commercial system WER
See e.g. META-NET white paper [Pedersen et al., 2012]
ASR
Automatic Speech Recognition
Critique
Medical dictation
More dicult task
LM accuracy
Very large vocabulary
Low error tolerance
World knowledge
Medical history
Treatment codes
Dictionaries
Data
Experiments
Is it good enough
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Is it good enough
Critique
Might be good enough for ...
Post-editing
Speech-to-speech translation
Access to additional knowledge
Pass n-best list to MT system
Leverage information in translation probabilities to choose the
right translation
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Is it good enough
Critique
Duddington, J. (2010).
espeak text to speech.
Henrichsen, P. (2014).
Phonix: Danish grapheme-to-phoneme transcriber.
To appear in Translation in transition: between cognition, computing and
technology, 1.
Molgaard, L., Jorgensen, K., and Hansen, L. K. (2007).
Castsearch-context based spoken document retrieval.
In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
International Conference on, volume 4, pages IV93. IEEE.
ASR
Automatic Speech Recognition
Dictionaries
Data
Experiments
Is it good enough
Critique
Pedersen, B. S., Wedekind, J., Bøhm-Andersen, S., Henrichsen, P. J.,
Hoensetz-Andresen, S., Kirchmeier-Andersen, S., Kjærum, J. O., Larsen,
L. B., Maegaard, B., Nimb, S., Rasmussen, J.-E., Revsbech, P., and
Thomsen, H. E. (2012).
Det danske sprog i den digitale tidsalder The Danish Language in the Digital Ag
META-NET White Paper Series. Georg Rehm and Hans Uszkoreit (Series
Editors). Springer.
Available online at http://www.meta-net.eu/whitepapers.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N.,
Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J.,
Stemmer, G., and Vesely, K. (2011).
The kaldi speech recognition toolkit.
In IEEE 2011 Workshop on Automatic Speech Recognition and
Understanding. IEEE Signal Processing Society.
IEEE Catalog No.: CFP11SRW-USB.