Bootstrapping a Language-Independent Synthesizer

Bootstrapping a LanguageIndependent Synthesizer
Craig Olinsky
Media Lab Europe / University College Dublin
15 January 2002
Introducing the Problem
Given a set of recordings and transcriptions in an
arbitrary language, can we quickly and easily build
a speech synthesizer?
YES, if we know something about the language.
However, for the majority of languages for which such resources
don’t exist…
Starting from Sample
PROS
 The existing synthesizer


provides a store of “linguistic”
knowledge we can start from.
Analogue to speaker adaptation
in Speech Recognition systems.
Overall, quality should be better.
CONS
 Difficulty related to degree of

different between sample and
target language.
Best as a gradual process:
accent/dialect, not language
Starting from Scratch
PROS
 Difficulty directly proportional to

complexity of the language.
Common (machine-learning)
procedure based upon machine
learning from recordings and
transcript.
CONS
 Don’t have a great deal of

relevant knowledge to apply to
the task.
If not using principled phone
set, necessary to segment /
label recordings cleanly
The Obvious Compromise
Take what we do know from building speech synthesis,
and generalize it to an existing framework.
-- we’re not specifically learning from “scratch”
-- at the same time, we’re not making linguistic
assumptions pre-coded into the source voices
“Generic” Synthesis
Framework/Toolkit
 Set of Scripts, Utilities, and Definition files to help to
help to automate the creation of reasonable speech
synthesis voices from an arbitrary language without
the need for linguistic or language-specific
information.
 Build on top of the Festival Speech Synthesis System
and FestVox toolkit (for wave form synthesis; most of
text processing and pronunciation handling
externalized to locally-developed tools)
Language-Dependent
Synthesis Components
 Phone set
 Durations
 Word pronunciation
 Intonation (accents
(lexicon and/or letterto-sound rules)
 Token processing
rules (numbers etc)
and F0 contour)
 Prosodic phrasing
method
Phoneme Sets
 If we rely on a pre-existing set of pronunciation rules,


lexicon, etc., we are automatically limited to using the
phone-set used in those resources (or something
which they can be mapped to); most likely something
language-dependent.
IPA, SAMPA: something language-universal?
We need to generate pronunciations: how do we
create the relationship between our training database
/ phonetic representation / orthography?
“Multilingual” Phoneme
Sets: IPA, SAMPA
We don’t want to be stuck with a set of phonemes
targeted for a specific language, so we instead use a
phoneme definition designed to be inclusive of all
But… this still assumes we know the relationship
between the phone set and orthography of the language;
i.e. for any given text we can generate a pronunciation.
This approach still assumes linguistic knowledge!
Orthography as
Pronunciation
cf: R. Singh, B. Raj and R.M. Stern, “Automatic Generation of Phone Sets
and Lexical Transcriptions;” ..
Suppose we begin with the orthography of the written
language.
e.g. CAT = [c] [a] [t]
DOG = [d] [o] [g]
This implies
• A relation between number of characters in a spelling and
the length of the pronunciation
• The orthography of a language is consistent / efficient
Orthography as
Pronunciation
Implications for Data
Labeling and Training
Non-Roman Orthography:
Questions of Transcription
Difficulties in Machine
Learning of Pronunciation
“But there is a much more fundamental problem … in that it crucially assumes
that letter-to-phoneme correspondences can in general be determined on the basis
of information local to a particular portion of the letter string. While this is clearly
true in some languages (e.g. Spanish), it is simply false for others….
“…It is unreasonable to expect that good results will be obtained from a system
trained with no guidence of this kind, or … with data that is simply insufficient to the
task.”
– Sproat et. al, Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, pp.76-77
Lexicon / Letter-toSound Rules
Token Processing
Duration and Stress
Modeling
Intonation and Phrasing
Unit Selection and
Waveform Synthesis
Overview: Adaptation for
Accent and Dialect
Final Points