スライド タイトルなし

Generation of F0 Contours Using
a Model-Constrained DataDriven Method
Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)
Nobuaki Minematsu (Dep. of Comm. Eng., The Univ. of Tokyo, Japan)
Keikichi Hirose (Dep. of Frontier Eng., The Univ. of Tokyo, Japan)
Corpus-Based Intonation
Modeling
• Rule-based approach: ad-hoc rules derived from
experience
– Human-dependent, labor-expensive
• Corpus-based approach: mapping from linguistic to
prosodic features statistically derived from a database
– Automatic, potential to improve as larger corpora become
available
– The F0 model : a parametric model that reduces degrees of
freedom and improves learning efficiency
F0 Model
I
J
lnF (t)  lnF
  A G (t  T0i )   A {G (t  T )  Gaj(t  t )}
0
min
pi pi
aj aj
1j
2j
i1
j 1
 exp(  it ) ( t  0)
G pi ( t )   it
0
( t  0)

min[1  (1   jt ) exp(   jt ),  ] ( t  0)
Gaj ( t )  
0
( t  0)

F0 Model
• Characteristics of the F0 Model:
– Direct representation of physical F0 contours
– Relatively good correspondence with syntactic structure
– Ability to express an F0 contour with a small number of
parameters
Better training efficiency by reducing
degrees of freedom
Training/Generation Mechanism
1) Training Phase
Prosodic Database
Linguistic features
+
F0 model parameters
Training
module
Intonation
model
2) Generation Phase
Linguistic
features
Intonation
model
F0 model parameters
Parameter Prediction Using a
Neural Network
• Neural networks are good for non-linear mappings
• The generalizing ability of neural networks can deal
with imperfect or inconsistent databases (prosodic
databases labeled by hand)
• Feedback loops can be used to capture the relation
between accentual phrases (partial recurrent networks)
Neural Network Structure
(a) Elman network
Input
Layer
Hidden
Layer
Output
Layer
(b) Jordan network
Input
Layer
Hidden
Layer
Output
Layer
State
Layer
Context
Layer
(c) Multi-layer perceptron
(MLP)
Input
Layer
Hidden Output
Layer Layer
Input Features
Input Feature
Position of accentual phrase within utterance
Number of morae in accentual phrase
Accent type of accentual phrase
Number of words in accentual phrase
Part-of-speech of first word
Conjugation form of first word
POS category of last word
Conjugation form of last word
Max. value
18
15
9
8
21
7
21
7
Input Features - Example
Chiisana unagiyani nekkinoyoona monoga minagiru
(小さな うなぎ屋に 熱気のような ものが みなぎる)
“unagiyani”
Position of accentual phrase within utterance:
No. of morae in accentual phrase:
Accent type of accentual phrase:
No. of words in accentual phrase:
POS, conjugation type/category of first word:
POS, conjugation type/category of last word:
2
5
0
2
noun/0
particle/0
Output Features
Waveform
Accent nucleus
tA, tB, tC, tD: mora boundaries
t0, t1, t2: F0 model parameters
t
tA tB
tC tD
Command
Output Feature
Phrase command magnitude (Ap)
Accent command amplitude (Aa)
Phrase command delay (t0 off = tA - t0)
Delay of accent command onset (t1 off = tA - t1 or tB - t1)
Delay of accent command reset (t2 off = tC - t2)
Phrase command flag
Ap
Aa
t0
t1
t2
t
Parameter Prediction Using
Binary Regression Trees
• Background
– Neural networks provide no additional information on the
modeling
– Binary regression trees provide human interpretability
– The knowledge obtained from binary regression trees could be
used as a feedback in other kinds of modeling
• Outline
– Input and output features equal to the neural network case
– Tree-growing stop criterion: minimum number of examples per
leaf node
Neural network example
WAVEFORM
mhtsdj02
0.0
1.0
2.0
3.0
TIME [s]
LABEL
o
o
r
o,u
b
d
0.0
e
d
a
i
h
o
m
1.0
k
a
tt
ssh
a
pau
k
a
i
o
2.0
a
u,h
3.0
1.0
2.0
3.0
1.0
2.0
3.0
t
o
o
N,b
u
o
m
tt
TIME [s]
FREQUENCY [Hz]
800.0
100.0
40.0
0.0
TIME [s]
PROSODIC COMMAND
1.0
0.0
TIME [s]
Binary regression tree example
WAVEFORM
mhtsdj02
0.0
1.0
2.0
3.0
TIME [s]
LABEL
o
o
r
o,u
b
d
0.0
e
d
a
i
h
o
m
1.0
k
a
tt
ssh
a
pau
k
a
i
o
2.0
a
u,h
3.0
1.0
2.0
3.0
1.0
2.0
3.0
t
o
o
N,b
u
o
m
tt
TIME [s]
FREQUENCY [Hz]
800.0
100.0
40.0
0.0
TIME [s]
PROSODIC COMMAND
1.0
0.0
TIME [s]
Experimental Results (1): MSE
Error for Neural Networks
Neural net
configuration
#Elements in
hidden layer
Mean square
error
MLP
MLP
Jordan
Jordan
Elman
Elman
10
20
10
20
10
20
0.218
0.217
0.220
0.215
0.214
0.232
1
MSE 
N
N
[log(F
i 1
0i
)  log( F0'i )]2
elman-10
Experimental Results (2): MSE Error
for Binary Regression Trees
Stop
criterion
Mean square
error
10
20
30
40
50
50
0.215
0.222
0.210
0.220
0.217
0.220
1
MSE 
N
N
[log(F
i 1
0i
)  log( F0'i )]2
stop-30
Experimental Results (3):
Comparison with Rule-Based
Parameter Prediction
Method
MSE
Neural network (elman-10)
Binary regression tree (stop-30)
Rule set I
Rule set II
0.214
0.210
0.221
0.193
Rule set I:
Phrase and accent commands derived from rules (including phrase command flag)
Rule set II:
Phrase and accent commands derived from rules (excluding phrase command flag)
Experimental Results (4):
Listening Tests
Number of listeners: 8
Number of sentencees
Neural network
Preference
28
Binary regression trees
39
Rule-based
13
Conclusions
• Advantages of data-driven intonation modeling:
– No need of ad-hoc expertise
– Fast and straightforward learning
• Difficulties:
– Prediction errors
– Difficulty in finding cause-effect relations for
prediction errors
• For now on:
– To explore other learning methods
– To deal with the data scarcity problem