スライド資料 - 東京大学

人文社会系研究科基礎文化研究専攻言語学専門分野
音響音声学
(Topics in Acoustic Phonetics)
峯松信明
工学系研究科電気系工学専攻
その情報を運ぶ媒体・音響特徴量
Independence bet.
phonemes and pitch
二段階の分離に基づく特徴量抽出
phase
characteristics
20000
"A_a_512"
15000
10000
speech
waveforms
5000
0
-5000
-10000
-15000
-20000
-25000
0
5
10
15
20
25
30
amplitude
characteristics
35
Insensitivity to
phase differences
source
characteristics
ﬁlter
characteristics
9
"A_a_512_hamming_logspec_idft_env"
8
7
6
5
4
3
スペクトル包絡(o )は何を運ぶのか？言・パラ言・非言
P (o|w) とP (o|s) sw==speaker
二つの音響モデル word
不特定話者の単語音響モデル
2
P
(o|w) =
!
s
0
P (o, s|w) =
5
!
s
10
15
20
25
30
P (o|w, s)P (s|w) ∼
テキスト非依存の話者音響モデル
P
(o|s) =
!
w
P (o, w|s) =
!
w
P (o|w, s)P (w|s) ∼
35
!
s
!
P (o|w, s)P (s)
w
P (o|w, s)P (w)
集めてしまえば「確率の定義」が見たくないものを隠してくれる。
分布間距離群としての音声表象
ケプトラム系列 →
分布系列 →
距離行列
c1
c1
cD
c4
c3
c2
c2
Bhattacharyya distance
cD
c4 slices
Sequence of spectrum
c3
BD-based distance matrix
Sequence of spectrum slices
spectrogram (spectrum slice sequence)
Sequence of cepstrum vectors
Sequence of spectrum slices
Sequence of spectrum slices
cepstrum vector sequence
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of distributions
Sequence of distributions
Sequence of distributions
Sequence of distributions
distribution sequence
Really speaker-independent features
Deep neural network [Hinton+’06, ’12]
Deeply stacked artificial neural networks
Results in a huge number of weights
Unsupervised pre-training and supervised fine-tuning
Findings in DNN-based ASR [Mohamed+’12]
First several layers seem to work as extractor of invariant features
or speaker-normalized features.
Still difficult to interpret structure and weights of DNN physically.
Interpretable DNNs are becoming one of the hot topics [Sim’15].
A simple question asked in tutorial talks of DNN
“What are really speaker-independent features?”
Asked by N. Morgan at APSIPA2013 and ASRU2013
DNN as posterior estimator
General framework for training DNN
Unsupervised pre-training and supervised training
In the latter training, speaker-adapted HMMs are used to prepare
posteriors (=labels) for each frame of the training data.
DNN is trained so that it can extract speaker-invariant features and
estimate posteriors in a speaker-independent way.
Output of DNN = posteriors (phoneme state posteriors in ASR)
Posteriors = normalized similarities
Posteriors of {
}
Can be interpreted as normalized similarity scores biased by priors.
Output of DNN = normalized similarity scores to a definite set of
speaker-adapted acoustic “anchors” of { }.
1
2
3
......
: speaker-dependent
N
1
2
3
......
N
: speaker-independent(invariant)
Similarities scores can be converted to distances to “anchors”.
Either of similarity matrix or distance matrix is used for clustering.
DistancesSequence
to anchors
of spectrum slices
Speech structure
extracted
from
Sequence
of spectrum
slices an utterance
spectrogram (spectrum slice sequence)
Sequence of cepstrum vectors
Sequence of spectrum slices
Sequence of spectrum slices
cepstrum vector sequence
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of distributions
distribution sequence
Sequence of distributions
Sequence of distributions
Sequence of distributions
Structuralization by interrelating temporally-distant events
Structure
extraction
for
speakers
and
Structuralization
by interrelating temporally-distant
events
Structuralization by interrelating temporally-distant events
Structuralization by interrelating temporally-distant events
: speaker-dependent
: speaker-independent(invariant)
Invariant contrasts
DNN as speaker-invariant contrast estimation
Use of spk-dependent HMMs to prepare posterior labels
A huge data to train DNN to guarantee spk-invariance
Str. extraction as speaker-invariant contrast detection
Use of within-utterance acoustic events only
Spk-invariance is guaranteed by invariant properties of f-div.
1
2
3
......
N
Origin and evolution of language
Origin and evolution of language
A MODULATION-DEMODULATION MODEL FOR SPEECH
COMMUNICATION AND ITS EMERGENCE
NOBUAKI MINEMATSU
Graduate School of Info. Sci. and Tech., The University of Tokyo, Japan,
[email protected]
Perceptual invariance against large acoustic variability in speech has been a long-discussed
question in speech science and engineering (Perkell & Klatt, 2002), and it is still an open
question (Newman, 2008; Furui, 2009). Recently, we proposed a candidate answer based on
mathematically-guaranteed relational invariance (Minematsu et al., 2010; Qiao & Minematsu,
2010). Here, transform-invariant features, f -divergences, are extracted from the speech dynamics in an utterance to form an invariant topological shape which characterizes and represents the
linguistic message conveyed in that utterance. In this paper, this representation is interpreted
from a viewpoint of telecommunications, linguistics, and evolutionary anthropology. Speech
production is often regarded as a process of modulating the baseline timbre of a speaker’s voice
by manipulating the vocal organs, i.e., spectrum modulation. Then, extraction of the linguistic message from an utterance can be viewed as a process of spectrum demodulation. This
modulation-demodulation model of speech communication has a strong link to known morphological and cognitive differences between humans and apes.
Modulation used in telecommunication
=
From Wikipedia
Figure 1. The tallest and shortest adults
Figure 2. Frequency modulation and demodulation
A musician modulates the tone from a musical instrument by varying
its volume, timing and pitch. The three key parameters of a carrier
sine wave are its amplitude (“volume”), its phase (“timing”) and its
frequency (“pitch”), all of which can be modified in accordance with
a content signal to obtain the modulated carrier.
We can say that a melody contour is a pitch-modulated (frequency-modulated)
carrierthe carrier corresponds to the baseline pitch.
version of a carrier wave, where
modulated carrier
We speak using our instruments, i.e., vocal organs, by varying not only the
= parameter, called the timbre or
above parameters, but also the most important
spectrum envelope. From this viewpoint, it can be said that an utterance is genmessage
demodulation
message
erated
by spectrummodulation
modulation (Scott, 2007). The
default shape and length
of a
carrier and, by changing the shape
vocal tube determines speaker-dependent voice quality
modulated
carrieran utterance is produced and transmitted.
or modulating the spectrum
envelope,
In a large number of previous studies
= in automatic speech recognition (ASR),
to bridge a gap between the ASR performance and the performance of human
speech
(HSR), much attention was paid
to the dynamic aspects
modulation
messagerecognitiondemodulation
messageof
A way of characterizing speech production
Speech production as spectrum modulation
Modulation in frequency (FM), amplitude (AM), and phase (PM)
= Modulation in pitch, volume, and timing (from Wikipedia)
= Pitch contour, intensity contour, and rhythm (= prosodic features)
What about a fourth parameter, which is spectrum (timbre)?
= Modulation in spectrum (timbre) [Scott’07]
= Another prosodic feature?
Front
Central
Back
Front
Central
Back
beat Back
Front
Central
boot
beat
Front
Central
Back
boot put
beat Back
bit bootbirdput
Front
Central
beat
bit
Front
Central
Back
bootbirdput
beat Back
bit bootbirdput
Central
bought
h
d
d
w
Lo
w
L
ow
Lo
w
Lo
w
Lo
w
L
ow
Mi
d
Mi
d
Mi
d
Mi
d
Mi
d
Mi
Lo
h
bit bootbirdput
bet bought
about
bit bootbirdput
bet bought
about
bit
birdput
bet bought
about
bat
bird
bet bought
about
bat
but pot
bet bought
about
bat
pot
but
bet bought
about
bat
but pot
bet about
bat
pot
but
bat
but pot
bat
but pot
but pot
Hig
h
Hig
bit
Mi
beat
beat
h
Hig
Front
Hig
Tongue =
modulator
Schwa
= most lax
= most frequent
= home position
= spk.-specific
baseline timbre
Hig
h
Hig
h
Hig
h
t
time
Modulation spectrum
Critical-band based temporal dynamics of speech
“In pursuit of an invariant representation” (Greenberg’97)
RASTA (=RelAtive SpecTrA, Hermansky’94)
lowpass
cutoff = 28 Hz
FFT
2
Critical-band
FIR filter bank
speech
Normalize by
long-term avg.
Limiting to peak 30 dB
and bilinear interpolation
100X
lowpass
cutoff = 28 Hz
100X
Normalize by
long-term avg.
FFT
image
2
(Greenberg’97)
No mathematical proof for invariance
Direction of a trajectory is rotated by VTL difference (Saito’08)
Invariant speech structure
Utterance to structure conversion using f-div. [Minematsu’06]
c1
c1
cD
c4
c3
c2
c2
Bhattacharyya distance
cD
Sequence of spectrum
slices
c4
c3
BD-based distance matrix
Sequence of spectrum slices
spectrogram (spectrum slice sequence)
Sequence of cepstrum vectors
Sequence of spectrum slices
Sequence of spectrum slices
cepstrum vector sequence
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of distributions
distribution sequence
Sequence of distributions
Sequence of distributions
Sequence of distributions
Structuralization by interrelating temporally-distant events
An event (distribution) has to be much smaller than a phoneme.
Demodulation used in telecommunication
Demodulation in frequency, amplitude, and phase
Demodulation = a process of extracting a message intactly by
removing the carrier component from the modulated carrier signal.
Not by extensive collection of samples of modulated carriers
(Not by hiding the carrier component by extensive collection)
carrier
modulated carrier
=
message
modulation
demodulation
modulated carrier
message
carrier
=
message
demodulation
modulation
message
Spectrum demodulation
Speech recognition = spectrum (timbre) demodulation
Demodulation = a process of extracting a message intactly by
removing the carrier component from the modulated carrier signal.
By removing speaker-specific baseline spectrum characteristics
Not by extensive collection of samples of modulated carriers
(Not by hiding the carrier component by extensive collection)
carrier
modulated carrier
=
message
modulation
modulated carrier
demodulation
carrier
message
modulation
message
=
message
demodulation
Two questions
Q1: Does the ape have a good modulator?
Does the tongue of the ape work as a good modulator?
Q2: Does the ape have a good demodulator?
Does the ear (brain) of the ape extract the message intactly?
carrier
modulated carrier
=
message
modulation
modulated carrier
demodulation
carrier
message
modulation
message
=
message
demodulation
Structural diff. in the mouth and the nose
pharynx
larynx
stomach
lung
pharynx
larynx
lung stomach
Structural diff. in the mouth and the nose
pharynx
larynx
stomach
lung
pharynx
larynx
lung stomach
Flexibility of tongue motion
The chimp’s tongue is much stiffer than the human’s.
“Morphological analyses and 3D modeling of the tongue
musculature of the chimpanzee” (Takemoto’08)
Less capability of manipulating the shape of the tongue.
新旧「猿の惑星」
Q1: Does the ape have a good modulator?
Morphological characteristics of the ape’s tongue
Two (almost) independent tracts [Hayama’99]
One is from the nose to the lung for breathing.
The other is from the mouth to the stomach for eating.
Much lower ability of deforming the tongue shape [Takemoto’08]
The chimp’s tongue is stiffer than the human’s.
carrier
message
modulation
carrier
modulation
message
The nature’s solution for static bias?
How old is the invariant perception in evolution? [Hauser’03]
1
2
1=2
At least, frequency (pitch) demodulation seems difficult.
Language acquisition through vocal imitation
VI = children’s active imitation of parents’ utterances
Language acquisition is based on vocal imitation [Jusczyk’00].
VI is very rate in animals. No other primate does VI [Gruhn’06].
Only small birds, whales, and dolphins do VI [Okanoya’08].
A’s VI = acoustic imitation but H’s VI = acoustic = ??
Acoustic imitation performed by myna birds [Miyamoto’95]
They imitate the sounds of cars, doors, dogs, cats as well as human voices.
Hearing a very good myna bird say something, one can guess its owner.
Beyond-scale imitation of utterances performed by children
No one can guess a parent by hearing the voices of his/her child.
Very weird imitation from a viewpoint of animal science [Okanoya’08].
?
Q1: Does the ape have a good modulator?
Morphological characteristics of the ape’s tongue
Two (almost) independent tracts [Hayama’99]
One is from the nose to the lung for breathing.
The other is from the mouth to the stomach for eating.
Much lower ability of deforming the tongue shape [Takemoto’08]
The chimp’s tongue is stiffer than the human’s.
carrier
message
modulation
carrier
modulation
message
Q2: Does the ape have a good demodulator?
Cognitive difference bet. the ape and the human
Humans can extract embedded messages in the modulated carrier.
It seems that animals treat the modulated carrier as it is.
From the modulated carrier, what can they know?
The apes can identify individuals by hearing their voices.
Lower/higher formant frequencies = larger/smaller apes
carrier
modulated carrier
=
message
modulation
modulated carrier
demodulation
carrier
=
demodulation
modulation
message
Function of the voice timbre
What is the original function of the voice timbre?
For apes
The voice timbre is an acoustic correlate with the identity of apes.
For speech scientists and engineers
They had started research by correlating the voice timbre with messages
conveyed by speech stream such as words and phonemes.
Formant frequencies are treated as acoustic correlates with vowels.
“Speech recognition” started first, then, “speaker recognition” followed.
c
fn =
n
2l1
c
fn =
n
2l2
!
"1/2
A2
c
f=
2π A1 l1 l2
Invariant speech structure
Utterance to structure conversion using f-div. [Minematsu’06]
c1
c1
cD
c4
c3
c2
c2
Bhattacharyya distance
cD
Sequence of spectrum
slices
c4
c3
BD-based distance matrix
Sequence of spectrum slices
spectrogram (spectrum slice sequence)
Sequence of cepstrum vectors
Sequence of spectrum slices
Sequence of spectrum slices
cepstrum vector sequence
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of distributions
distribution sequence
Sequence of distributions
Sequence of distributions
Sequence of distributions
Structuralization by interrelating temporally-distant events
An event (distribution) has to be much smaller than a phoneme.
シンボルグラウンディング問題
シンボルは如何にして生まれたか？
シンボルは何故「意味」を持つようになったのか？
シンボルは何故「ある対象物」と結びつくようになったのか？
シンボルは何故「ある記憶」と結びつくようになったのか？
寄稿
特集ことばをとどける
「声の力」
声とは、言葉とは、何か
̶音声研究を通して考えること
東京大学大学院工学系研究科教授
峯松信明
声とは何か、言葉とは何か。この根源的なテーマに応えてくだ
さったのは、音声工学の第一人者である峯松信明先生。機械
に音声を認識させる・合成させる、その研究を通して対極に見
えてきたものとは何でしょうか。それはヒトの持つ不思議な能力
―言葉と記憶、ヒトは言葉を操作しながら、実は言葉によっ
て記憶を操作されている̶その謎に科学の目で迫ります。
プロフィール／みねまつ・のぶあき
1990年東京大学工学部卒業、95年
東京大学大学院工学系研究科にて博
士（工学）を取得。95年より豊橋技
術科学大学に勤務し、2000年より東
京大学に戻る。現在、東京大学大学
院工学系研究科電気系工学専攻教
授。音声科学から音声工学に至るま
で、幅広い観点から音声コミュニ
ケーションに関する研究に従事。特
に音声技術を使った語学教育に関す
る造詣が深く、2009年よりOJADの
開発を手がけている。
What is the goal of speech engineering?
計算できる馬
賢馬ハンスから学べること
何が欠けているのか？
二つの軸
発達
我々は成長の中でどのように言語を身に付けるのか？
進化
我々は進化の中でどのように言語を身に付けたのか？
この二軸を真っ正面に見据えて技術開発しないと・・・・
それは，言葉を操るように見せかけるシステムとなる，のでは？
高校生のためのオープンキャンパスにて
言葉が分かるコンピュータってどんなコンピュータ？
東大で言葉の研究をする工学系教員から高校生への素朴な問いかけ
■Siri，喋ってコンシェル，IBM Watson，彼らは「言葉が分かる」コンピュータなのか？
「ニューヨークは今何時？」「8月6日午後10時です」
「清水寺の舞台の高さは？」「約13メートルです」
「ソーダ瓶の回転が止まった時に，瓶の口の前にいる人は唇を突き出すゲームは？」「Spin-the-bottleです」
彼らは話された／書かれた内容を理解して，吟味して，返答しているように見える。
では，彼らは本当に「言葉が分かる」のか，それとも「言葉が分かったように見せかけている」だけなのか？
このポスターは「言葉が分かる」とはどういうことなのか，高校生の皆さんにちょっと深く考えてもらいたくて作
りました。上の問いに対して先人達はどのように考えてきたのか，を紹介します。もしかしたら，本当に言葉が分
かるコンピュータを作ることになるのは，数年後，いや数十年後の貴方，かもしれません。
■「チューリング・テスト」って知ってますか？
数学者アラン・チューリングが考案した「ある機械が知的であるかどうか」を判定するテスト
人間の判定者Cが，隔離された相手A, Bと通常の言語で会話する。A, Bは一方が機
械，他方が人間である。会話の後Cはどちらが人間／機械なのかを当る。その区別が
困難であれば，この機械はテストに合格，つまり，知的であると判定する。
今でも「人工知能」研究でしばしば利用される判定基準である。
高校生のためのオープンキャンパスにて
■「中国語の部屋」って知ってますか？
チューリングテストに対して哲学者ジョン・サールが問うた鋭い突っ込み（思考実験）
ある小部屋にアルファベットしか理解できない人を閉じこめておく。この部屋には外
部と紙切れのやりとりをする穴が一つ空いている。この穴を通してこの人に一枚の紙
切れが差し入れられる。そこには漢字で何か書いてあるが，彼には単なる記号列でし
かない。彼の仕事はこの記号列に対して，新たな記号列を書き加えて外に返すことで
ある。どういう記号列を書き加えればよいのかは，一冊のマニュアルに書いてある。
例えば「★△◎▽☆□」とあれば，「■＠◎▽」と書き加えて外に出せ，のように。
部屋の外で紙切れを観測している人にすれば「中国語が分かる人が内部にいる」と考え
るだろう。部屋にいるのは漢字が全然理解できない人なのに。
■XXするように見せかけている例というのは，結構沢山あるのかも・・・
プラネタリウム：あれは基本的に天動説に基づいて星を動かしています。座席は動きませんから。でも，星の見た
目の動きを再現するという目的であれば，天動説も地動説も結果は殆ど変わりませんよね。
賢馬ハンス：20世紀初頭，ドイツで有名になった「計算できる」馬。後に科学的手法によりトリックが判明。
DaiGo：21世紀初頭，日本のテレビ業界を賑わしているメンタリスト。彼の場合は「トリックがあります」と自
分で明言してますけど。
見た目を上手に作り込むのか，中の
メカニズムにまでこだわるのか？
■結局，何ができれば「言語が分かる」コンピュータなのか，その定義が難しいのですよ。
高校生のためのオープンキャンパスにて
■XXするように見せかけている例というのは，結構沢山あるのかも・・・
プラネタリウム：あれは基本的に天動説に基づいて星を動かしています。座席は動きませんから。でも，星の見た
目の動きを再現するという目的であれば，天動説も地動説も結果は殆ど変わりませんよね。
賢馬ハンス：20世紀初頭，ドイツで有名になった「計算できる」馬。後に科学的手法によりトリックが判明。
DaiGo：21世紀初頭，日本のテレビ業界を賑わしているメンタリスト。彼の場合は「トリックがあります」と自
分で明言してますけど。
見た目を上手に作り込むのか，中の
メカニズムにまでこだわるのか？
■結局，何ができれば「言語が分かる」コンピュータなのか，その定義が難しいのですよ。
「言語が分かる」コンピュータを実現するための必要十分条件の定義が難しい。できるのは，必要条件を洗い出す
ことだけなのかもしれない。で，どの必要条件に着目し，技術として実装するのか，それは各研究者のこだわりと
なって，研究戦略に現れるのだと思います。さてさて，貴方が「言語が分かる」コンピュータを作ろうとしたら，
どんなコンピュータを作りますか？貴方自身の答えを，この部屋で見つけてみて下さい。
冬学期のレポート
平成 26 年度音響音声学・冬学期レポート課題
〆切：2 月 2 日（月）
提出〆切：1/18（月）
提出先：[email protected] までメールで
第 1 問音声認識では，スペクトル包絡ではなく，ケプストラム係数を使うことが多い。まず，ケプストラム
係数とは何かを述べ，次に，何故（スペクトル包絡ではなく）ケプストラムが用いられるのかについて述べよ。
第 2 問 Dynamic Time Warping とはどのような技術なのか，知るところを述べよ。
第 3 問隠れマルコフモデル（Hidden Markov Model）とはどのような技術なのか，知るところを述べよ。
第 4 問 HMM は音声認識の音響モデルとしても，音声合成の音響モデルとしても使われている。基本とな
る技術は同じであるが，その使い方，使う条件は様々な違いがある。どのような違いがあるのか，知るところ
を述べよ。
第 5 問音声認識は，入力音声とシステムが有する複数のテンプレートとの音響的比較（音響的照合）が基本
となる。その昔，音響的照合は DTW が主流であった。その後，音響的照合は，入力音声（ケプストラム系
列）と HMM を比較することが常識となった。前者に対して後者はどのような強みを持つのか？説明せよ。
第 6 問音声認識は，HMM などによる音響モデル以外にも，どのような単語が次に来るのかを予め予測する
言語モデルの助けが必要である。言語モデルの構築の仕方（次単語の予測の仕方，次に来る単語の絞り込み方）
として大きく二つの方法を説明した。各々について説明せよ。
第 7 問ベイズの定理を使って，条件付き確率を下記のように変形することが技術的に広く行なわれている。
まず第一式のを埋めよ。第二，第三のの間に縦棒があることに注意せよ。次に，第二式が成立するこ
とを第一式を使って説明せよ。最後に，第二式の意味するところを，音声認識の文脈で説明せよ。即ち o = 観
測された（入力される）ケプストラム系列，w = 単語列，として説明せよ。
P (w|o) =
P (o|w)P (w)
P (o|w)P (w)
=!
=!
P (o)
′ P (o, )
w′
w
P (o|w)P (w)
P ( | )P ( )
argmax P (w|o) = argmax P (o|w)P (w)
w
w
第 7 問ベイズの定理を使って，条件付き確率を下記のように変形することが技術的に広く行なわれている。
まず第一式のを埋めよ。第二，第三のの間に縦棒があることに注意せよ。次に，第二式が成立するこ
とを第一式を使って説明せよ。最後に，第二式の意味するところを，音声認識の文脈で説明せよ。即ち o = 観
測された（入力される）ケプストラム系列，w = 単語列，として説明せよ。
冬学期のレポート
提出〆切：1/18（月）
P (w|o) =
P (o|w)P (w)
P (o|w)P (w)
=!
=!
P (o)
′ P (o, )
w′
w
P (o|w)P (w)
P ( | )P ( )
argmax P (w|o) = argmax P (o|w)P (w)
w
w
第 8 問更に s = 話者，とした場合，次の二式の意味するところを説明せよ。
P (o|w) =
"
P (o, s|w) =
s
P (o|s) =
"
w
P (o, w|s) =
"
s
"
w
P (o|s, w)P (s|w) ≈
P (o|s, w)P (w|s) ≈
"
P (o|s, w)P (s)
s
"
P (o|s, w)P (w)
w
第 9 問 HMM 音声合成の場合，文 HMM（長∼い HMM）から尤度が最大となるようなケプストラム系列を
得ることができるが，ナイーブに実装すると，状態 i の（分布 i）の平均ケプストラムが数フレーム出力され，
次に，状態 i + 1 の平均ベクトルが数フレーム出力され，と階段状のケプストラム時系列が生成されることにな
る。これでは自然な合成音声は得られない。スムーズなスペクトルパターン（ケプストラム時系列パターン）
を得るための工夫について知るところを述べよ。（必ずしも数式を使うことは要求していない）
第 10 問最後の数回は，音声の構造的表象について，a)（発達，進化という軸を見据えた）その理論的背景，
b) 数学的定式化，c) 実用アプリ・システムへの応用例，d) そして言語の起源に関する妄想について述べた。以
下の三つの作文（web から入手できる）を読み，音声の構造的表象について，自由にコメントを述べよ。各々
異なることが述べられているので，コメントを読めば，どれを読んでないかは一目瞭然となることに注意せよ。
「音声に含まれる言語的情報を非言語的情報から音響的に分離して抽出する手法の提案」
「声とは言葉とは何か ∼音声研究を通して考えること∼」
“A MODULATION-DEMODULATION MODEL FOR SPEECH COMMUNICATION AND ITS EMERGENCE”

Download Report