Generation of F0 Contours Using a Model-Constrained DataDriven Method Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Nobuaki Minematsu (Dep. of Comm. Eng., The Univ. of Tokyo, Japan) Keikichi Hirose (Dep. of Frontier Eng., The Univ. of Tokyo, Japan) Corpus-Based Intonation Modeling • Rule-based approach: ad-hoc rules derived from experience – Human-dependent, labor-expensive • Corpus-based approach: mapping from linguistic to prosodic features statistically derived from a database – Automatic, potential to improve as larger corpora become available – The F0 model : a parametric model that reduces degrees of freedom and improves learning efficiency F0 Model I J lnF (t) lnF A G (t T0i ) A {G (t T ) Gaj(t t )} 0 min pi pi aj aj 1j 2j i1 j 1 exp( it ) ( t 0) G pi ( t ) it 0 ( t 0) min[1 (1 jt ) exp( jt ), ] ( t 0) Gaj ( t ) 0 ( t 0) F0 Model • Characteristics of the F0 Model: – Direct representation of physical F0 contours – Relatively good correspondence with syntactic structure – Ability to express an F0 contour with a small number of parameters Better training efficiency by reducing degrees of freedom Training/Generation Mechanism 1) Training Phase Prosodic Database Linguistic features + F0 model parameters Training module Intonation model 2) Generation Phase Linguistic features Intonation model F0 model parameters Parameter Prediction Using a Neural Network • Neural networks are good for non-linear mappings • The generalizing ability of neural networks can deal with imperfect or inconsistent databases (prosodic databases labeled by hand) • Feedback loops can be used to capture the relation between accentual phrases (partial recurrent networks) Neural Network Structure (a) Elman network Input Layer Hidden Layer Output Layer (b) Jordan network Input Layer Hidden Layer Output Layer State Layer Context Layer (c) Multi-layer perceptron (MLP) Input Layer Hidden Output Layer Layer Input Features Input Feature Position of accentual phrase within utterance Number of morae in accentual phrase Accent type of accentual phrase Number of words in accentual phrase Part-of-speech of first word Conjugation form of first word POS category of last word Conjugation form of last word Max. value 18 15 9 8 21 7 21 7 Input Features - Example Chiisana unagiyani nekkinoyoona monoga minagiru (小さな うなぎ屋に 熱気のような ものが みなぎる) “unagiyani” Position of accentual phrase within utterance: No. of morae in accentual phrase: Accent type of accentual phrase: No. of words in accentual phrase: POS, conjugation type/category of first word: POS, conjugation type/category of last word: 2 5 0 2 noun/0 particle/0 Output Features Waveform Accent nucleus tA, tB, tC, tD: mora boundaries t0, t1, t2: F0 model parameters t tA tB tC tD Command Output Feature Phrase command magnitude (Ap) Accent command amplitude (Aa) Phrase command delay (t0 off = tA - t0) Delay of accent command onset (t1 off = tA - t1 or tB - t1) Delay of accent command reset (t2 off = tC - t2) Phrase command flag Ap Aa t0 t1 t2 t Parameter Prediction Using Binary Regression Trees • Background – Neural networks provide no additional information on the modeling – Binary regression trees provide human interpretability – The knowledge obtained from binary regression trees could be used as a feedback in other kinds of modeling • Outline – Input and output features equal to the neural network case – Tree-growing stop criterion: minimum number of examples per leaf node Neural network example WAVEFORM mhtsdj02 0.0 1.0 2.0 3.0 TIME [s] LABEL o o r o,u b d 0.0 e d a i h o m 1.0 k a tt ssh a pau k a i o 2.0 a u,h 3.0 1.0 2.0 3.0 1.0 2.0 3.0 t o o N,b u o m tt TIME [s] FREQUENCY [Hz] 800.0 100.0 40.0 0.0 TIME [s] PROSODIC COMMAND 1.0 0.0 TIME [s] Binary regression tree example WAVEFORM mhtsdj02 0.0 1.0 2.0 3.0 TIME [s] LABEL o o r o,u b d 0.0 e d a i h o m 1.0 k a tt ssh a pau k a i o 2.0 a u,h 3.0 1.0 2.0 3.0 1.0 2.0 3.0 t o o N,b u o m tt TIME [s] FREQUENCY [Hz] 800.0 100.0 40.0 0.0 TIME [s] PROSODIC COMMAND 1.0 0.0 TIME [s] Experimental Results (1): MSE Error for Neural Networks Neural net configuration #Elements in hidden layer Mean square error MLP MLP Jordan Jordan Elman Elman 10 20 10 20 10 20 0.218 0.217 0.220 0.215 0.214 0.232 1 MSE N N [log(F i 1 0i ) log( F0'i )]2 elman-10 Experimental Results (2): MSE Error for Binary Regression Trees Stop criterion Mean square error 10 20 30 40 50 50 0.215 0.222 0.210 0.220 0.217 0.220 1 MSE N N [log(F i 1 0i ) log( F0'i )]2 stop-30 Experimental Results (3): Comparison with Rule-Based Parameter Prediction Method MSE Neural network (elman-10) Binary regression tree (stop-30) Rule set I Rule set II 0.214 0.210 0.221 0.193 Rule set I: Phrase and accent commands derived from rules (including phrase command flag) Rule set II: Phrase and accent commands derived from rules (excluding phrase command flag) Experimental Results (4): Listening Tests Number of listeners: 8 Number of sentencees Neural network Preference 28 Binary regression trees 39 Rule-based 13 Conclusions • Advantages of data-driven intonation modeling: – No need of ad-hoc expertise – Fast and straightforward learning • Difficulties: – Prediction errors – Difficulty in finding cause-effect relations for prediction errors • For now on: – To explore other learning methods – To deal with the data scarcity problem
© Copyright 2024 ExpyDoc