スライド タイトルなし

Modeling and Generation of
Accentual Phrase F0
Contours Based on Discrete
HMMs Synchronized at
Mora-Unit Transitions
Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)
Koji Iwano (currently with Tokyo Institute of Technology, Japan)
Keikichi Hirose (Dep. of Frontier Eng., The University of Tokyo, Japan)
Introduction to Corpus-Based
Intonation Modeling
• Traditional approach: rules derived from linguistic
expertise
Human-dependent (too complicated and not satisfactory, because the
phenomena involved are not completely understood)
• Corpus-based approach: modeling derived from
statistical analysis of speech corpora
Automatic (potential to improve as better speech corpora become
available)
Background
• HMMs are widely used in speech recognition, and
fast learning algorithms exist
• Macroscopic discrete HMMs associated to accentual
phrases can store information such as accent type
and prosodic structure
• Morae are extremely important to describe Japanese
intonation - sequences of high and low mora can
characterize accent types
Overview of the Method
• Definition of HMM and alphabet:
– Accent types modeled by discrete HMMs
– 2-code mora F0 contour alphabet used as output
symbols
– State transitions sychronized with mora transitions
• Classification of HMMs and training:
– HMMs classified according to linguistic attributes
– Training by usual FB algorithm
• Generation of F0 contours:
– Best sequence of symbols generated by a modified
Viterbi algorithm
The Mora-F0 Alphabet
• Two codes: stylized mora F0 contours and morato-mora F0: 34 symbols each
• Obtained by LBG clustering from a 500-sentence
database (ATR continuous speech database,
speaker MHT)
• All the database is labeled using the 2-code
symbols.
The Accentual Phrase HMM
• Accentual phrases are classified according to:
– Accent type
– Position of accentual phrase in the sentence
– (Optional: number of morae, part-of-speech,
syntactic structure)
State transition
HMM
Mora transition
Accentual phrase
Example:
Example: ‘Karewa Tookyookara kuru. (He comes from Tokyo)
Label sequence
Accent type
Position
[],[],[]
M1:
1
1
M2:
[],[],[],[],[],[]
0
2
M3:
[],[]
1
3
shape1 ,
F01
shape2
F02
HMM Topologies
(a) Accent types 0 and 1
(a) Other accent types
Training Database
• ATR Continuous Speech Database (500 sentences,
speaker MHT)
• Segmented in mora and accentual phrases
• Mora labels using the mora-F0 alphabet: shape
(stylized F0 contour), mora F0.
• Accentual phrase labels: number of morae,
position in the sentence
Output Code Generation
How to use the HMM for synthesis?
A) Recognition
1 output sequence
Likelihood
Best path
B) Synthesis
Best output sequence
Best path
Intonation Modeling Using
HMM
Viterbi Search for the Recognition Problem:
for t=2,3,...,T
for it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]
+[-log b(y(t)| it)]}
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]
+[-log b(y(t)| it)]}
next it
next t
Intonation Modeling Using
HMM
Modified Viterbi Search for the Synthesis Problem:
for t=2,3,...,T
for it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]
+[-log b(ymax(t)| it)]}
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]
+[-log b(ymax(t)| it)]}
next it
next t
Use of Bigram Probabilities
for t=2,3,...,T
for it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]
+maxk{[-log b(y (t)| y(t-1),it)]}}
k=1,…,K (dimension of y)
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]
+maxk{[-log b(y (t)| y(t-1),it)]}}
next it
next t
k=1,…,K (dimension of y)
Accent Type Modeling Using
HMM
4.15
"Type0"
"Type1"
"Type2"
"Type3"
log(Hz)
4.1
4.05
4
3.95
3.9
3.85
3.8
3.75
3.7
Mora #
3.65
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Phrase Boundary Level Modeling
Using HMM
4.08
"level1.graph"
"level2.graph"
"level3.graph"
log(Hz)
4.06
4.04
4.02
4
3.98
3.96
3.94
3.92
Mora #
3.9
0
J-TOBI
B.I.
3
3
2
Pause
Y/N
Y
N
N
Bound.
Level
1
2
3
0.5
1
1.5
2
2.5
3
3.5
4
"PH1_0"
"PH1_0"
0.4
0.4
0.2
logF0 [Hz]
logF0 [Hz]
0.2
0
0
-0.2
-0.2
-0.4
-0.4
0
50
100
150
200
250
t [msec]
300
350
400
450
500
0
50
100
150
200
250
t [msec]
300
350
400
"PH1_1"
500
"PH1_1"
0.4
0.4
0
-0.2
-0.4
0
50
100
150
200
250
t [msec]
300
PH1_0.original
350
400
450
500
0.2
logF0 [Hz]
0.2
logF0 [Hz]
450
0
-0.2
-0.4
0
50
100
150
200
PH1_0.bigram
250
t [msec]
300
350
400
450
500
The Effect of
Bigrams
"PH1_2"
"PH1_2"
0.4
0.4
0
-0.4
0
50
100
150
200
PH1_1.original
250
t [msec]
300
350
400
450
500
PH1_2.original
0.2
logF0 [Hz]
logF0 [Hz]
0.2
-0.2
0
-0.2
-0.4
0
50
100
150
200
PH1_1.bigram
250
t [msec]
300
350
400
450
500
PH1_2.bigram
Comments
• We presented a novel approach to intonation
modeling for TTS synthesis based on discrete morasynchronous HMMs.
• For now on, more features should be included in the
HMM modeling (phonetic context, part-of-speech,
etc.), and the approach should be compared to rulebased methods.
• Training data scarcity is a major problem to
overcome (by feature clustering, an F0 contour
generation model, etc.)
Hidden Markov Models (HMM)
A Hidden Markov Model (HMM) is a Finite State Automaton where both
state transitions and outputs are stochastic. It changes to a new state each
time period, generating a new vector according to the output distribution
of that state.
a11
a22
a33
a44
a13
a12
1
b(1|1)~b(K|1)
2
a23
b(1|2)~b(K|2)
Symbols: 1,2, ..., K
3
a34
b(1|3)~b(K|3)
4
b(1|4)~b(K|4)
ステップ1:データベース作成
•ATRの連続音声データベースを使用(500文,話者
MHT)
•モーラ単位に分割
•モーララベルの付与
•F0パターンを抽出
•LBG法によるクラスタリング
•全データベースにクラスタクラスを付与
Bigramの導入
for t=2,3,...,T
for it=1,2,...,S
Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]
+maxk{[-log b(y (t)| y(t-1),it)]}}
k=1,…,K (dimension of y)
(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]
+maxk{[-log b(y (t)| y(t-1),it)]}}
next it
next t
k=1,…,K (dimension of y)
考察・今後の展望
•学習データが少ない
•TTSシステムへの組込みにはさらなる工夫が必要
他の言語情報を考慮(音素、モーラ数、品詞等)
データ不足を克服するための工夫(クラスタリング等)
モデルの接続に関する検討
Hidden Markov Models (HMM)
A Hidden Markov Model (HMM) is a Finite State Automaton where both
state transitions and outputs are stochastic. It changes to a new state each
time period, generating a new vector according to the output distribution
of that state.
a11
a22
a33
a44
a13
a12
1
b(1|1)~b(K|1)
2
a23
b(1|2)~b(K|2)
Symbols: 1,2, ..., K
3
a34
b(1|3)~b(K|3)
4
b(1|4)~b(K|4)