人文社会系研究科基礎文化研究専攻言語学専門分野 音響音声学 (Topics in Acoustic Phonetics) 峯松 信明 工学系研究科電気系工学専攻 Speech corpus of World Englishes Speech Accent Archive (SAA) [Weinberger’13] A common paragraph read by >1.8K international speakers The paragraph is designed to achieve high phonemic coverage of AE. Speech samples and their narrow IPA transcripts are provided. Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. 興味深いビデオ 24 Accents in English (from YouTube) https://www.youtube.com/watch?v=dABo_DCIdpM Pronunciation Structure Analysis (PSA) 1 utterances is represented as 1 distance matrix. Waveform to feature instances waveform c1 spectrogram (=vector sequence) c2 trajectory in a vector space cD c4 c3 Feature instances to feature relations (structure) c1 c2 c1 cD c4 c3 vector sequence c2 cD c4 c3 distribution seq. “Please call Stella. Ask her ....” distance calculation 221 x 221 distance matrix The paragraph contains 221 phonemes based on the CMU pron. dic. Can remove age and gender differences effectively [Minematsu’06]. Detail procedure of extracting the PS UBM-HMM and MAP adaptation for structure calculation 2 MAP adaptation "Please call Stella." 1 SAA + WSJ phoneme HMM training adapted paragraph HMM structure calculation 0 paragraph UBM-HMM 221 x 221 distance matrix 0 0 0 0 3 Pron. distance calculation using structure A common paragraph to pron. structure 221 c2 221 c1 cD c4 c3 1.5 billions IPA-based distance 1.5 billions Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack .......... Experimental conditions Differential features between two speakers #elements = 221 x 220 / 2 = 24K Prediction mechanism Support Vector Regression using the radial basis function as kernel : SVR ? Two modes of preparing training data and testing data Speaker-open mode Openness is guaranteed in speakers used for training and testing. Speaker-pair-open mode Openness is guaranteed only in speaker pairs used for training and testing. #speakers = 370, #speaker pairs = 370 x 369 / 2 = 68,265 Experimental results (1/2) Corr. bet. IPA distances and predicted ones [Kasahara et al.,’14] speaker-open speaker-pair-open correlation 0.547 0.903 What we can do and cannot do now Speaker-open Speaker-pair-open Predicted distance mode Proposed method (K=220) 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 Corr.=0.903 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Reference distance ? ? Experimental results (2/2) Use of phonemic transcripts to calculate pron. distance The IPA transcripts of the SAA are converted into phonemic ones. Monophone HMMs are trained using the SAA and the TIMIT. Use of network grammar carefully designed for the SAA paragraph. DTW between two phonemic transcripts are done with the HMMs Corr. bet. IPA distances and phonemic distances Phoneme recognition accuracy = 73.5 % mode ASR ground truth correlation 0.458 0.829 mode speaker-open speaker-pair-open correlation 0.547 0.903 Application of speaker-pair-open prediction TED talks browser from your viewpoint If TED talkers provide their SAA readings.... If these readings are transcribed by phoneticians.... Visualization of pronunciation diversity [Kawase et al.,’14] ? Application of speaker-pair-open prediction TED talks browser from your viewpoint ? If TED talkers provide their SAA readings.... If these readings are transcribed by phoneticians.... Visualization of pronunciation diversity [Kawase et al.,’14] young young young old old old old young Y. Kawase, et al., “Visualization of pronunciation diversity of World Englishes from a speaker’s self-centered viewpoint” Possible application of spk-open prediction Individual-based and really global map of WE pron. Can be used for WE communication facilitator Easy access to speakers with pronunciation similar to yours at a restaurant at a hotel at a ticket office Possible application of spk-open prediction Adaptation of others’ accents to a listener’s own accent accent info. of the owner accent info. of the owner = accent optical adapter adapter 本発表の流れ 刺激の物理的多様性とその認知的不変性 見え/色み/音高の多様性と自然・進化が編み出した解決方法 音声の物理的多様性とその認知的不変性 音色の多様性と工学者が編み出した解決方法 音声の構造的表象とそれに関する様々な考察 常識を覆すことで,違和感の解消を試みてみる。 音声の構造的表象と数学的表現と技術的実装 体格・性別に不変な音声波形・スペクトルの表現とは? 音声の構造的表象を用いた音声アプリケーション 音声認識,音声合成,発音分析,etc 音声の構造的表象の言語学的妥当性 新しい技術?古い技術? そして,言語の起源に関して 古典的音韻論に見られる主張 Roman Jakobson (1896-1982) The sound shape of language (1949) Physiologically identical sounds may possess different values in conformity with the whole sound system, i.e. with their relations to the other sounds. We have to put aside the accidental properties of individual sounds and substitute a general expression that is the common denominator of these variables. 古典的音韻論に見られる主張 Nikolay Trubetskoy (1890-1938) The Principle of Phonology (1939) The phonemes should not be considered as building blocks out of which individual words are assembled. Each word is a phonic entity, a Gestalt, and is also recognized as such by the hearer. As a Gestalt, each word contains something more than sum of its constituents (phonemes), namely, the principle of unity that holds the phoneme sequence together and lends individuality to a word. c1 c2 c1 cD c4 c3 c2 cD c4 c3 古典的音韻論に見られる主張 Ferdinand de Saussure (1857-1913) Father of modern linguistics Course in General Linguistics (1916) What defines a linguistic element, conceptual or phonic, is the relation in which it stands to the other elements in the linguistic system. The important thing in the word is not the sound alone but the phonic differences that make it possible to distinguish this word from the others. Language is a system of only conceptual differences and phonic differences. c1 c2 c1 cD c4 c3 c2 cD c4 c3 「あ」って何だろう? 二つの見方, Element first or system first? An introduction to descriptive linguistics Written by Gleason, H. A, 1961 A phoneme is a class of sounds that: (1) are phonetically similar and (2) show certain characteristic patterns of distribution in the language or dialect under consideration. A phoneme is one element in the sound system of a language having a characteristic set of interrelations with each of the other elements in that system. The phoneme cannot, therefore, be acoustically defined. The phoneme is a feature of language structure. That is, it is an abstraction from the psychological and acoustical patterns that enables a linguist to describe the observed repetitions of things that seem to function within the system as identical in spite of obvious differences. The phonemes of a language are a set of abstractions. あ あ い え あ う い お え あ う い お え あ う お ヤコブソンの弁別素性 弁別素性を用いた音的差異・音体系の記述 k a compact L201.5 Introduction to Linguistic Theory Ji-yung Kim p grave diffuse t acute Natural Classes: Distinctive Features Spring 2001 T, 03/06/01 u i grave acute 弁別素性を用いた音素の記述 Table 1. Distinctive Features of American English Consonants Back High Coronal Anterior Labial Continuant Lateral Nasal Sonorant Strident Voiced p + + - b + + + m + + + + + f + + + + - v + + + + + T + + + - D + + + + t + + - d + + + n + + + + + s + + + + - z + + + + + l + + + + + + R + + + + S + + + + - Z + + + + + tS + + + - dZ + + + + j + + + + ® + + + + k + + - g + + + N + + + + + w + + + + + + ? + - h + + - Table 1. Distinctive Features of American English Consonants 二つの構造主義とその変遷 Back High Coronal Anterior Labial Continuant Lateral Nasal Sonorant Strident Voiced p + + - b + + + m + + + + + f + + + + - v + + + + + T + + + - D + + + + t + + - d + + + n + + + + + s + + + + - z + + + + + l + + + + + + R + + + + S + + + + - Z + + + + + レヴィ・ストロース (1905-) 構造人類学・神話学 tS + + + - dZ + + + + j + + + + ® + + + + k + + - g + + + N + + + + + w + + + + + + ? + - 構造主義思想 h + + - Table 2. Distinctive Features of American English Vowels ローマン・ヤコブソン ç A ! ´ アメリカ構造言語学 (1896-1982) high + low 構造音韻論 ヨーロッパ構造言語学 + + back + rounded フェルディナン・ド・ソシュール ATR (1857-1913) 一般言語学 1900 2000 i + + 1800 I + - e + E - æ + - u + + + + U + + + - o + + + エルンスト・マッハ (1838-1916) フェルディナン・ド・ソシュール(1857-1913) ローマン・ヤコブソン(1896-1913) レヴィ・ストロース(1905-) 物理学・音響学・心理学 「未開社会の親族構造の中に不変的な数学的構造を見いだした。また, 「言語が含むのは,言語体系に先立って存在する観念でも音でもなく, 「二音素間の差異を幾つかの弁別素性を用いて表現」 クリスチャン・フォン・エーレンフェルス 音素群の幾何学構造 (1859-1932) ただこの体系から生じる観念的差異と音的差異だけである」 抽象化を通して種々の神話の中に不変的な構造を見いだした」 ジェームス・ギブソン ゲシュタルト心理学 やがて音素そのものを素性の束として定義するが,これは勇み足か? (1904-1979) 意味の世界が,各言語において,どのように領域分割しているのか? 構造=ある変換操作によって変化しない特性・関係 生態学的認識論・アフォーダンス 音の世界が,各話者において,どのように領域分割しているのか? 構造と変換は表裏一体であり,どの変換に着眼するかで構造の定義も異なる。 コネクショニスト 要素の定義は,他者との差異を通して初めて可能になる。 複雑系 二つの構造主義とその変遷 エルンスト・マッハ(1838-1916) レヴィ・ストロース クリスチャン・フォン・エーレンフェルス(1859-1932) ジェームス・ギブソン(1904-1979) 構造主義思想 (1905-) 「私達は個々の要素についての研究から始めなければならないが,自然は要 「知覚は,対象の個別刺激を統合して起こるものではなく,それ以前 「視覚は『要素刺激感覚+そのまとめ上げ』ではなく,『対象の動き 「物体の質量は,その物体の周りの全ての物体との関係で決る。他に 構造人類学・神話学 素と共に始まった訳ではない。私達が圧倒的な全体系から時々目をそらし, ローマン・ヤコブソン に全体的な枠組の中で認識が起こる」 (変形)の中に見られる不変的特性』に視覚の本質がある」 何もない空間の中では,ある物体の質量には,何の意味もない」 アメリカ構造言語学 (1896-1982) 個別的な点に集中できるのは確かに幸福なことである。しかしさしあたり無 ゲシュタルト=全体=「部分の単純な総和」以上のもの 変形と不変 移調性と不変性 変換と構造 全ての自然現象は感性的諸要素の関数的相互依存関係に基づく複合体 構造音韻論 ヨーロッパ構造言語学 視されていたことを,改めて補修し修正しながら研究することを怠ってはな 移調性と不変性:ゲシュタルトにある種の変形を施しても,ゲシュタルトは不変 面性の知覚:光が作る差異の構造に基づいて行なわれる(光の差異が成す不変項)。 全ての観測結果は人間の感覚量でしかない。空間・時間感覚も人間の感覚の一つ。 フェルディナン・ド・ソシュール (1857-1913) (メロディーと転調など) 情報は人間の頭の中で作り出すのではなく,環境そのものの中に存在している。 らない。」(要素還元主義への警鐘) その意味において,ユークリッド空間よりも非ユークリッド空間の方が本質的。 1800 一般言語学 1900 2000 エルンスト・マッハ (1838-1916) 物理学・音響学・心理学 クリスチャン・フォン・エーレンフェルス (1859-1932) ジェームス・ギブソン ゲシュタルト心理学 (1904-1979) 生態学的認識論・アフォーダンス コネクショニスト 複雑系 二つの構造主義とその変遷 要素感覚を集めるのではなく,要素間の関係・差異を統合することで 着眼する変換・変形操作に対して不変なる構造を見いだすこと。 レヴィ・ストロース (1905-) 構造人類学・神話学 1800 構造主義思想 ローマン・ヤコブソン アメリカ構造言語学 (1896-1982) 構造音韻論 ヨーロッパ構造言語学 フェルディナン・ド・ソシュール (1857-1913) 一般言語学 1900 2000 エルンスト・マッハ (1838-1916) 物理学・音響学・心理学 クリスチャン・フォン・エーレンフェルス (1859-1932) ジェームス・ギブソン ゲシュタルト心理学 (1904-1979) 生態学的認識論・アフォーダンス コネクショニスト 複雑系 Origin and evolution of language A MODULATION-DEMODULATION MODEL FOR SPEECH COMMUNICATION AND ITS EMERGENCE NOBUAKI MINEMATSU Graduate School of Info. Sci. and Tech., The University of Tokyo, Japan, [email protected] Perceptual invariance against large acoustic variability in speech has been a long-discussed question in speech science and engineering (Perkell & Klatt, 2002), and it is still an open question (Newman, 2008; Furui, 2009). Recently, we proposed a candidate answer based on mathematically-guaranteed relational invariance (Minematsu et al., 2010; Qiao & Minematsu, 2010). Here, transform-invariant features, f -divergences, are extracted from the speech dynamics in an utterance to form an invariant topological shape which characterizes and represents the linguistic message conveyed in that utterance. In this paper, this representation is interpreted from a viewpoint of telecommunications, linguistics, and evolutionary anthropology. Speech production is often regarded as a process of modulating the baseline timbre of a speaker’s voice by manipulating the vocal organs, i.e., spectrum modulation. Then, extraction of the linguistic message from an utterance can be viewed as a process of spectrum demodulation. This modulation-demodulation model of speech communication has a strong link to known morphological and cognitive differences between humans and apes. Modulation used in telecommunication = From Wikipedia Figure 1. The tallest and shortest adults Figure 2. Frequency modulation and demodulation A musician modulates the tone from a musical instrument by varying its volume, timing and pitch. The three key parameters of a carrier sine wave are its amplitude (“volume”), its phase (“timing”) and its frequency (“pitch”), all of which can be modified in accordance with a content signal to obtain the modulated carrier. We can say that a melody contour is a pitch-modulated (frequency-modulated) carrierthe carrier corresponds to the baseline pitch. version of a carrier wave, where modulated carrier We speak using our instruments, i.e., vocal organs, by varying not only the = parameter, called the timbre or above parameters, but also the most important spectrum envelope. From this viewpoint, it can be said that an utterance is genmessage demodulation message erated by spectrummodulation modulation (Scott, 2007). The default shape and length of a carrier and, by changing the shape vocal tube determines speaker-dependent voice quality modulated carrieran utterance is produced and transmitted. or modulating the spectrum envelope, In a large number of previous studies = in automatic speech recognition (ASR), to bridge a gap between the ASR performance and the performance of human speech (HSR), much attention was paid to the dynamic aspects modulation messagerecognitiondemodulation messageof A way of characterizing speech production Speech production as spectrum modulation Modulation in frequency (FM), amplitude (AM), and phase (PM) = Modulation in pitch, volume, and timing (from Wikipedia) = Pitch contour, intensity contour, and rhythm (= prosodic features) What about a fourth parameter, which is spectrum (timbre)? = Modulation in spectrum (timbre) [Scott’07] = Another prosodic feature? Front Central Back Front Central Back beat Back Front Central boot beat Front Central Back boot put beat Back bit bootbirdput Front Central beat bit Front Central Back bootbirdput beat Back bit bootbirdput Central bought d bit bootbirdput bet bought about bit bootbirdput bet bought about bit birdput bet bought about bat bird bet bought about bat but pot bet bought about bat pot but bet bought about bat but pot bet about bat pot but bat but pot bat but pot but pot Mi w Lo w L ow Lo w Lo w Lo w L ow Lo d Mi d Mi Mi d Mi d Mi d d Hig h Hig h bit Mi h beat beat Hig Tongue = modulator Schwa = most lax = most frequent = home position = spk.-specific baseline timbre Front Hig h Hig h Hig h Hig h t time Modulation spectrum Critical-band based temporal dynamics of speech “In pursuit of an invariant representation” (Greenberg’97) RASTA (=RelAtive SpecTrA, Hermansky’94) lowpass cutoff = 28 Hz FFT 2 Critical-band FIR filter bank speech Normalize by long-term avg. Limiting to peak 30 dB and bilinear interpolation 100X lowpass cutoff = 28 Hz 100X Normalize by long-term avg. FFT image 2 (Greenberg’97) No mathematical proof for invariance Direction of a trajectory is rotated by VTL difference (Saito’08) Invariant speech structure Utterance to structure conversion using f-div. [Minematsu’06] c1 c1 cD c4 c3 c2 c2 Bhattacharyya distance cD Sequence of spectrum slices c4 c3 BD-based distance matrix Sequence of spectrum slices spectrogram (spectrum slice sequence) Sequence of cepstrum vectors Sequence of spectrum slices Sequence of spectrum slices cepstrum vector sequence Sequence of cepstrum vectors Sequence of distributions distribution sequence Sequence of cepstrum vectors Sequence of cepstrum vectors Sequence of distributions Sequence of distributions Sequence of distributions Structuralization by interrelating temporally-distant events An event (distribution) has to be much smaller than a phoneme. Demodulation used in telecommunication Demodulation in frequency, amplitude, and phase Demodulation = a process of extracting a message intactly by removing the carrier component from the modulated carrier signal. Not by extensive collection of samples of modulated carriers (Not by hiding the carrier component by extensive collection) carrier modulated carrier = message modulation demodulation modulated carrier message carrier = message demodulation modulation message Spectrum demodulation Speech recognition = spectrum (timbre) demodulation Demodulation = a process of extracting a message intactly by removing the carrier component from the modulated carrier signal. By removing speaker-specific baseline spectrum characteristics Not by extensive collection of samples of modulated carriers (Not by hiding the carrier component by extensive collection) carrier modulated carrier = message modulation modulated carrier demodulation carrier message modulation message = message demodulation Two questions Q1: Does the ape have a good modulator? Does the tongue of the ape work as a good modulator? Q2: Does the ape have a good demodulator? Does the ear (brain) of the ape extract the message intactly? carrier modulated carrier = message modulation modulated carrier demodulation carrier message modulation message = message demodulation Structural diff. in the mouth and the nose pharynx larynx stomach lung pharynx larynx lung stomach Flexibility of tongue motion The chimp’s tongue is much stiffer than the human’s. “Morphological analyses and 3D modeling of the tongue musculature of the chimpanzee” (Takemoto’08) Less capability of manipulating the shape of the tongue. 新旧「猿の惑星」 Q1: Does the ape have a good modulator? Morphological characteristics of the ape’s tongue Two (almost) independent tracts [Hayama’99] One is from the nose to the lung for breathing. The other is from the mouth to the stomach for eating. Much lower ability of deforming the tongue shape [Takemoto’08] The chimp’s tongue is stiffer than the human’s. carrier message modulation carrier modulation message The nature’s solution for static bias? How old is the invariant perception in evolution? [Hauser’03] 1 2 1=2 At least, frequency (pitch) demodulation seems difficult. Language acquisition through vocal imitation VI = children’s active imitation of parents’ utterances Language acquisition is based on vocal imitation [Jusczyk’00]. VI is very rate in animals. No other primate does VI [Gruhn’06]. Only small birds, whales, and dolphins do VI [Okanoya’08]. A’s VI = acoustic imitation but H’s VI = acoustic = ?? Acoustic imitation performed by myna birds [Miyamoto’95] They imitate the sounds of cars, doors, dogs, cats as well as human voices. Hearing a very good myna bird say something, one can guess its owner. Beyond-scale imitation of utterances performed by children No one can guess a parent by hearing the voices of his/her child. Very weird imitation from a viewpoint of animal science [Okanoya’08]. ? Q1: Does the ape have a good modulator? Morphological characteristics of the ape’s tongue Two (almost) independent tracts [Hayama’99] One is from the nose to the lung for breathing. The other is from the mouth to the stomach for eating. Much lower ability of deforming the tongue shape [Takemoto’08] The chimp’s tongue is stiffer than the human’s. carrier message modulation carrier modulation message Q2: Does the ape have a good demodulator? Cognitive difference bet. the ape and the human Humans can extract embedded messages in the modulated carrier. It seems that animals treat the modulated carrier as it is. From the modulated carrier, what can they know? The apes can identify individuals by hearing their voices. Lower/higher formant frequencies = larger/smaller apes carrier modulated carrier = message modulation modulated carrier demodulation carrier = demodulation modulation message Function of the voice timbre What is the original function of the voice timbre? For apes The voice timbre is an acoustic correlate with the identity of apes. For speech scientists and engineers They had started research by correlating the voice timbre with messages conveyed by speech stream such as words and phonemes. Formant frequencies are treated as acoustic correlates with vowels. “Speech recognition” started first, then, “speaker recognition” followed. c fn = n 2l1 c fn = n 2l2 A2 c f= 2⇡ A1 l1 l2 1/2 Function of the voice timbre What is the original function of the voice timbre? For apes The voice timbre is an acoustic correlate with the identity of apes. For speech scientists and engineers They had started research by correlating the voice timbre with messages conveyed by speech stream such as words and phonemes. Formant frequencies are treated as acoustic correlates with vowels. “Speech recognition” started first, then, “speaker recognition” followed. But the voice timbre can be changed easily. Speaker-independent acoustic model for word recognition P (o|w) = s P (o, s|w) = s P (o|w, s)P (s|w) s P (o|w, s)P (s) Speaker-adaptive acoustic model for word recognition HMMs are always modified and adapted to users. These methods don’t remove speaker components in speech. Invariant speech structure Utterance to structure conversion using f-div. [Minematsu’06] c1 c1 cD c4 c3 c2 c2 Bhattacharyya distance cD Sequence of spectrum slices c4 c3 BD-based distance matrix Sequence of spectrum slices spectrogram (spectrum slice sequence) Sequence of cepstrum vectors Sequence of spectrum slices Sequence of spectrum slices cepstrum vector sequence Sequence of cepstrum vectors Sequence of distributions distribution sequence Sequence of cepstrum vectors Sequence of cepstrum vectors Sequence of distributions Sequence of distributions Sequence of distributions Structuralization by interrelating temporally-distant events An event (distribution) has to be much smaller than a phoneme. シンボルグラウンディング問題 シンボルは如何にして生まれたか? シンボルは何故「意味」を持つようになったのか? シンボルは何故「ある対象物」と結びつくようになったのか? シンボルは何故「ある記憶」と結びつくようになったのか? 寄稿 特集 ことばをとどける 「声の力」 声とは、言葉とは、何か ̶音声研究を通して考えること 東京大学大学院工学系研究科教授 峯松 信明 声とは何か、言葉とは何か。この根源的なテーマに応えてくだ さったのは、音声工学の第一人者である峯松信明先生。機械 に音声を認識させる・合成させる、その研究を通して対極に見 えてきたものとは何でしょうか。それはヒトの持つ不思議な能力 ―言葉と記憶、ヒトは言葉を操作しながら、実は言葉によっ て記憶を操作されている̶その謎に科学の目で迫ります。 プロフィール/みねまつ・のぶあき 1990年 東京大学工学部卒業、95年 東京大学大学院工学系研究科にて博 士(工学)を取得。95年より豊橋技 術科学大学に勤務し、2000年より東 京大学に戻る。現在、東京大学大学 院工学系研究科電気系工学専攻教 授。音声科学から音声工学に至るま で、幅広い観点から音声コミュニ ケーションに関する研究に従事。特 に音声技術を使った語学教育に関す る造詣が深く、2009年よりOJADの 開発を手がけている。 Laboratory experiment for B3 students “Development of a moving robot controlled wirelessly by voice commands” Some questions raised by students. Questions from a specific group of students “Why do we have to extract spectral envelopes or cepstrums?” “Why don’t we use waveform as input features for ASR?” “Why don’t we apply machine learning and/or statistical modeling to waveforms?” A common thing about these students They did not take classes of “Fundamentals of signal analysis (信号 解析基礎)” and “Signal processing (信号処理工学)” They did not understand well which part of signals are irrelevant to the linguistic messages conveyed in a speech stream. Phase and pitch are irrelevant to the phonemic information in speech. What’s missing? Focus on shape of the vocal tract Two steps of information separation phase characteristics 20000 "A_a_512" 15000 10000 speech waveforms 5000 0 -5000 -10000 -15000 -20000 0 5 10 15 20 25 30 35 Insensitivity to phase differences 8 7 6 source characteristics amplitude characteristics -25000 w+s 9 5 4 3 2 0 "A_a_512_hamming_logspec" 9 8 7 6 5 4 3 2 1 0 2000 4000 6000 8000 10000 12000 14000 15 os 5 0 ow 10 o filter characteristics 10 16000 Two acoustic models for speech/speaker recognition Speaker-independent acoustic model for word recognition P (o|w) = s P (o, s|w) = s P (o|w, s)P (s|w) s P (o|w, s)P (s) Text-independent acoustic model for speaker recognition P (o|s) = w P (o, w|s) = w P (o|w, s)P (w|s) w P (o|w, s)P (w) Require intensive collection or constant adaptation o ow + os is possible? "A_a_512_ha 20 Why so many samples are not needed? Statistical independence or physical independence Statistically speaker-independent acoustic model of P(o | w) P (o|w) = s P (o, s|w) = s P (o|w, s)P (s|w) s P (o|w, s)P (s) o = spectrum envelope Fully statistically-independent acoustic model for word recognition 20000 "A_a_512" P (o|w) 20000 15000 15000 10000 10000 5000 "A_a_512" s,h,p P (o|w, s, h, p)P (s)P (h)P (p) 20000 "A_a_512" 20000 15000 20000 15000 10000 -5000 10000 0 15000 5000 20000 0 -10000 5000 -5000 10000 "A_a_512" o = waveform, h = harmonics, p = phase 0 5000 "A_a_512" "A_a_512" 20000 15000 -5000 -10000 -15000 15000 10000 0 20000 -5000 -10000 -15000 -20000 "A_a_512" 10000 0 15000 5000 20000 -10000 -15000 -25000 -20000 "A_a_512" 0 5000 -5000 10000 0 15000 5000 -5000 -10000 -15000 0 -20000 -5000 -15000 0 -10000 5000 -15000 -20000 -25000 -10000 5000 10000 0 -5000 -5000 -10000 -15000 -10000 -15000 -20000 -15000 -20000 -25000 -20000 -25000 5 10 15 20 25 30 -25000 -20000 0 5 10 15 20 25 30 35 -25000 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 -25000 -20000 0 5 10 15 20 25 30 35 -25000 0 0 0 "A_a_512" 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 35 Fully physically-independent acoustic model for word recognition -25000 0 ow = physically speaker-independent word feature. w+s 9 ow ? "A_a_512_hamming_logspec_idft_env" 8 o 7 6 5 4 3 2 0 5 10 15 20 os 25 30 35 5 10 15 20 25 30 35 35 Function of the voice timbre What is the original function of the voice timbre? For apes The voice timbre is an acoustic correlate with the identity of apes. For speech scientists and engineers They had started research by correlating the voice timbre with messages conveyed by speech stream such as words and phonemes. Formant frequencies are treated as acoustic correlates with vowels. “Speech recognition” started first, then, “speaker recognition” followed. But the voice timbre can be changed easily. Speaker-independent acoustic model for word recognition P (o|w) = s P (o, s|w) = s P (o|w, s)P (s|w) s P (o|w, s)P (s) Speaker-adaptive acoustic model for word recognition HMMs are always modified and adapted to users. These methods don’t remove speaker components in speech. What is the goal of speech engineering? 計算できる馬 賢馬ハンスから学べること 何が欠けているのか? 二つの軸 発達 我々は成長の中でどのように言語を身に付けるのか? 進化 我々は進化の中でどのように言語を身に付けたのか? この二軸を真っ正面に見据えて技術開発しないと・・・・ それは,言葉を操るように見せかけるシステムとなる,のでは?
© Copyright 2024 ExpyDoc