Modeling and Generation of Accentual Phrase F0 Contours Based on Discrete HMMs Synchronized at Mora-Unit Transitions Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center) Koji Iwano (currently with Tokyo Institute of Technology, Japan) Keikichi Hirose (Dep. of Frontier Eng., The University of Tokyo, Japan) Introduction to Corpus-Based Intonation Modeling • Traditional approach: rules derived from linguistic expertise Human-dependent (too complicated and not satisfactory, because the phenomena involved are not completely understood) • Corpus-based approach: modeling derived from statistical analysis of speech corpora Automatic (potential to improve as better speech corpora become available) Background • HMMs are widely used in speech recognition, and fast learning algorithms exist • Macroscopic discrete HMMs associated to accentual phrases can store information such as accent type and prosodic structure • Morae are extremely important to describe Japanese intonation - sequences of high and low mora can characterize accent types Overview of the Method • Definition of HMM and alphabet: – Accent types modeled by discrete HMMs – 2-code mora F0 contour alphabet used as output symbols – State transitions sychronized with mora transitions • Classification of HMMs and training: – HMMs classified according to linguistic attributes – Training by usual FB algorithm • Generation of F0 contours: – Best sequence of symbols generated by a modified Viterbi algorithm The Mora-F0 Alphabet • Two codes: stylized mora F0 contours and morato-mora F0: 34 symbols each • Obtained by LBG clustering from a 500-sentence database (ATR continuous speech database, speaker MHT) • All the database is labeled using the 2-code symbols. The Accentual Phrase HMM • Accentual phrases are classified according to: – Accent type – Position of accentual phrase in the sentence – (Optional: number of morae, part-of-speech, syntactic structure) State transition HMM Mora transition Accentual phrase Example: Example: ‘Karewa Tookyookara kuru. (He comes from Tokyo) Label sequence Accent type Position [],[],[] M1: 1 1 M2: [],[],[],[],[],[] 0 2 M3: [],[] 1 3 shape1 , F01 shape2 F02 HMM Topologies (a) Accent types 0 and 1 (a) Other accent types Training Database • ATR Continuous Speech Database (500 sentences, speaker MHT) • Segmented in mora and accentual phrases • Mora labels using the mora-F0 alphabet: shape (stylized F0 contour), mora F0. • Accentual phrase labels: number of morae, position in the sentence Output Code Generation How to use the HMM for synthesis? A) Recognition 1 output sequence Likelihood Best path B) Synthesis Best output sequence Best path Intonation Modeling Using HMM Viterbi Search for the Recognition Problem: for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +[-log b(y(t)| it)]} (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +[-log b(y(t)| it)]} next it next t Intonation Modeling Using HMM Modified Viterbi Search for the Synthesis Problem: for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +[-log b(ymax(t)| it)]} (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +[-log b(ymax(t)| it)]} next it next t Use of Bigram Probabilities for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +maxk{[-log b(y (t)| y(t-1),it)]}} k=1,…,K (dimension of y) (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +maxk{[-log b(y (t)| y(t-1),it)]}} next it next t k=1,…,K (dimension of y) Accent Type Modeling Using HMM 4.15 "Type0" "Type1" "Type2" "Type3" log(Hz) 4.1 4.05 4 3.95 3.9 3.85 3.8 3.75 3.7 Mora # 3.65 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Phrase Boundary Level Modeling Using HMM 4.08 "level1.graph" "level2.graph" "level3.graph" log(Hz) 4.06 4.04 4.02 4 3.98 3.96 3.94 3.92 Mora # 3.9 0 J-TOBI B.I. 3 3 2 Pause Y/N Y N N Bound. Level 1 2 3 0.5 1 1.5 2 2.5 3 3.5 4 "PH1_0" "PH1_0" 0.4 0.4 0.2 logF0 [Hz] logF0 [Hz] 0.2 0 0 -0.2 -0.2 -0.4 -0.4 0 50 100 150 200 250 t [msec] 300 350 400 450 500 0 50 100 150 200 250 t [msec] 300 350 400 "PH1_1" 500 "PH1_1" 0.4 0.4 0 -0.2 -0.4 0 50 100 150 200 250 t [msec] 300 PH1_0.original 350 400 450 500 0.2 logF0 [Hz] 0.2 logF0 [Hz] 450 0 -0.2 -0.4 0 50 100 150 200 PH1_0.bigram 250 t [msec] 300 350 400 450 500 The Effect of Bigrams "PH1_2" "PH1_2" 0.4 0.4 0 -0.4 0 50 100 150 200 PH1_1.original 250 t [msec] 300 350 400 450 500 PH1_2.original 0.2 logF0 [Hz] logF0 [Hz] 0.2 -0.2 0 -0.2 -0.4 0 50 100 150 200 PH1_1.bigram 250 t [msec] 300 350 400 450 500 PH1_2.bigram Comments • We presented a novel approach to intonation modeling for TTS synthesis based on discrete morasynchronous HMMs. • For now on, more features should be included in the HMM modeling (phonetic context, part-of-speech, etc.), and the approach should be compared to rulebased methods. • Training data scarcity is a major problem to overcome (by feature clustering, an F0 contour generation model, etc.) Hidden Markov Models (HMM) A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state. a11 a22 a33 a44 a13 a12 1 b(1|1)~b(K|1) 2 a23 b(1|2)~b(K|2) Symbols: 1,2, ..., K 3 a34 b(1|3)~b(K|3) 4 b(1|4)~b(K|4) ステップ1:データベース作成 •ATRの連続音声データベースを使用(500文,話者 MHT) •モーラ単位に分割 •モーララベルの付与 •F0パターンを抽出 •LBG法によるクラスタリング •全データベースにクラスタクラスを付与 Bigramの導入 for t=2,3,...,T for it=1,2,...,S Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)] +maxk{[-log b(y (t)| y(t-1),it)]}} k=1,…,K (dimension of y) (t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)] +maxk{[-log b(y (t)| y(t-1),it)]}} next it next t k=1,…,K (dimension of y) 考察・今後の展望 •学習データが少ない •TTSシステムへの組込みにはさらなる工夫が必要 他の言語情報を考慮(音素、モーラ数、品詞等) データ不足を克服するための工夫(クラスタリング等) モデルの接続に関する検討 Hidden Markov Models (HMM) A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state. a11 a22 a33 a44 a13 a12 1 b(1|1)~b(K|1) 2 a23 b(1|2)~b(K|2) Symbols: 1,2, ..., K 3 a34 b(1|3)~b(K|3) 4 b(1|4)~b(K|4)
© Copyright 2025 ExpyDoc