音響音声学 - 東京大学

人文社会系研究科基礎文化研究専攻言語学専門分野
音響音声学
(Topics in Acoustic Phonetics)
峯松 信明
工学系研究科電気系工学専攻
Speech corpus of World Englishes
Speech Accent Archive (SAA) [Weinberger’13]
A common paragraph read by >1.8K international speakers
The paragraph is designed to achieve high phonemic coverage of AE.
Speech samples and their narrow IPA transcripts are provided.
Please call Stella. Ask her to bring these things with her from
the store: Six spoons of fresh snow peas, five thick slabs of blue
cheese, and maybe a snack for her brother Bob. We also need a
small plastic snake and a big toy frog for the kids. She can
scoop these things into three red bags, and we will go meet her
Wednesday at the train station.
興味深いビデオ
24 Accents in English (from YouTube)
https://www.youtube.com/watch?v=dABo_DCIdpM
Pronunciation Structure Analysis (PSA)
1 utterances is represented as 1 distance matrix.
Waveform to feature instances
waveform
c1
spectrogram (=vector sequence)
c2
trajectory in a
vector space
cD
c4
c3
Feature instances to feature relations (structure)
c1
c2
c1
cD
c4
c3
vector sequence
c2
cD
c4
c3
distribution seq.
“Please call Stella. Ask her ....”
distance calculation
221 x 221 distance matrix
The paragraph contains 221 phonemes based on the CMU pron. dic.
Can remove age and gender differences effectively [Minematsu’06].
Detail procedure of extracting the PS
UBM-HMM and MAP adaptation for structure calculation
2
MAP
adaptation
"Please call Stella."
1
SAA +
WSJ
phoneme
HMM
training
adapted
paragraph HMM
structure
calculation
0
paragraph
UBM-HMM
221 x 221
distance matrix
0
0
0
0
3
Pron. distance calculation using structure
A common paragraph to pron. structure
221
c2
221
c1
cD
c4
c3
1.5 billions
IPA-based distance
1.5 billions
Please call Stella. Ask her to
bring these things with her
from the store: Six spoons
of fresh snow peas, five
thick slabs of blue cheese,
and maybe a snack ..........
Experimental conditions
Differential features between two speakers
#elements = 221 x 220 / 2 = 24K
Prediction mechanism
Support Vector Regression using the radial basis function as kernel
:
SVR
?
Two modes of preparing training data and testing data
Speaker-open mode
Openness is guaranteed in speakers used for training and testing.
Speaker-pair-open mode
Openness is guaranteed only in speaker pairs used for training and testing.
#speakers = 370, #speaker pairs = 370 x 369 / 2 = 68,265
Experimental results (1/2)
Corr. bet. IPA distances and predicted ones [Kasahara et al.,’14]
speaker-open
speaker-pair-open
correlation
0.547
0.903
What we can do and cannot do now
Speaker-open
Speaker-pair-open
Predicted distance
mode
Proposed method (K=220)
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
Corr.=0.903
0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
Reference distance
?
?
Experimental results (2/2)
Use of phonemic transcripts to calculate pron. distance
The IPA transcripts of the SAA are converted into phonemic ones.
Monophone HMMs are trained using the SAA and the TIMIT.
Use of network grammar carefully designed for the SAA paragraph.
DTW between two phonemic transcripts are done with the HMMs
Corr. bet. IPA distances and phonemic distances
Phoneme recognition accuracy = 73.5 %
mode
ASR
ground truth
correlation
0.458
0.829
mode
speaker-open
speaker-pair-open
correlation
0.547
0.903
Application of speaker-pair-open prediction
TED talks browser from your viewpoint
If TED talkers provide their SAA readings....
If these readings are transcribed by phoneticians....
Visualization of pronunciation diversity [Kawase et al.,’14]
?
Application of speaker-pair-open prediction
TED talks browser from your viewpoint
?
If TED talkers provide their SAA readings....
If these readings are transcribed by phoneticians....
Visualization of pronunciation diversity [Kawase et al.,’14]
young
young
young
old
old
old
old
young
Y. Kawase, et al., “Visualization of pronunciation diversity of World Englishes
from a speaker’s self-centered viewpoint”
Possible application of spk-open prediction
Individual-based and really global map of WE pron.
Can be used for WE communication facilitator
Easy access to speakers with pronunciation similar to yours
at a restaurant
at a hotel
at a ticket office
Possible application of spk-open prediction
Adaptation of others’ accents to a listener’s own accent
accent info.
of the owner
accent info.
of the owner
= accent
optical adapter
adapter
本発表の流れ
刺激の物理的多様性とその認知的不変性
見え/色み/音高の多様性と自然・進化が編み出した解決方法
音声の物理的多様性とその認知的不変性
音色の多様性と工学者が編み出した解決方法
音声の構造的表象とそれに関する様々な考察
常識を覆すことで,違和感の解消を試みてみる。
音声の構造的表象と数学的表現と技術的実装
体格・性別に不変な音声波形・スペクトルの表現とは?
音声の構造的表象を用いた音声アプリケーション
音声認識,音声合成,発音分析,etc
音声の構造的表象の言語学的妥当性
新しい技術?古い技術? そして,言語の起源に関して
古典的音韻論に見られる主張
Roman Jakobson (1896-1982)
The sound shape of language (1949)
Physiologically identical sounds may possess different
values in conformity with the whole sound system, i.e.
with their relations to the other sounds.
We have to put aside the accidental properties of
individual sounds and substitute a general expression
that is the common denominator of these variables.
古典的音韻論に見られる主張
Nikolay Trubetskoy (1890-1938)
The Principle of Phonology (1939)
The phonemes should not be considered as building blocks out of
which individual words are assembled. Each word is a phonic entity, a
Gestalt, and is also recognized as such by the hearer.
As a Gestalt, each word contains something more than sum of its
constituents (phonemes), namely, the principle of unity that holds the
phoneme sequence together and lends individuality to a word.
c1
c2
c1
cD
c4
c3
c2
cD
c4
c3
古典的音韻論に見られる主張
Ferdinand de Saussure (1857-1913)
Father of modern linguistics
Course in General Linguistics (1916)
What defines a linguistic element, conceptual or phonic, is the relation
in which it stands to the other elements in the linguistic system.
The important thing in the word is not the sound alone but the phonic
differences that make it possible to distinguish this word from the
others.
Language is a system of only conceptual differences and phonic
differences.
c1
c2
c1
cD
c4
c3
c2
cD
c4
c3
「あ」って何だろう?
二つの見方, Element first or system first?
An introduction to descriptive linguistics
Written by Gleason, H. A, 1961
A phoneme is a class of sounds that: (1) are phonetically similar and (2) show
certain characteristic patterns of distribution in the language or dialect under
consideration.
A phoneme is one element in the sound system of a language having a
characteristic set of interrelations with each of the other elements in that system.
The phoneme cannot, therefore, be acoustically defined. The phoneme is a
feature of language structure. That is, it is an abstraction from the psychological
and acoustical patterns that enables a linguist to describe the observed
repetitions of things that seem to function within the system as identical in spite
of obvious differences. The phonemes of a language are a set of abstractions.
あ
あ
い
え
あ
う い
お え
あ
う い
お え
あ
う
お
ヤコブソンの弁別素性
弁別素性を用いた音的差異・音体系の記述
k
a
compact
L201.5 Introduction to Linguistic Theory
Ji-yung Kim
p
grave
diffuse
t
acute
Natural Classes: Distinctive Features
Spring 2001
T, 03/06/01
u
i
grave
acute
弁別素性を用いた音素の記述
Table 1. Distinctive Features of American English Consonants
Back
High
Coronal
Anterior
Labial
Continuant
Lateral
Nasal
Sonorant
Strident
Voiced
p
+
+
-
b
+
+
+
m
+
+
+
+
+
f
+
+
+
+
-
v
+
+
+
+
+
T
+
+
+
-
D
+
+
+
+
t
+
+
-
d
+
+
+
n
+
+
+
+
+
s
+
+
+
+
-
z
+
+
+
+
+
l
+
+
+
+
+
+
R
+
+
+
+
S
+
+
+
+
-
Z
+
+
+
+
+
tS
+
+
+
-
dZ
+
+
+
+
j
+
+
+
+
®
+
+
+
+
k
+
+
-
g
+
+
+
N
+
+
+
+
+
w
+
+
+
+
+
+
?
+
-
h
+
+
-
Table 1. Distinctive Features of American English Consonants
二つの構造主義とその変遷
Back
High
Coronal
Anterior
Labial
Continuant
Lateral
Nasal
Sonorant
Strident
Voiced
p
+
+
-
b
+
+
+
m
+
+
+
+
+
f
+
+
+
+
-
v
+
+
+
+
+
T
+
+
+
-
D
+
+
+
+
t
+
+
-
d
+
+
+
n
+
+
+
+
+
s
+
+
+
+
-
z
+
+
+
+
+
l
+
+
+
+
+
+
R
+
+
+
+
S
+
+
+
+
-
Z
+
+
+
+
+
レヴィ・ストロース
(1905-)
構造人類学・神話学
tS
+
+
+
-
dZ
+
+
+
+
j
+
+
+
+
®
+
+
+
+
k
+
+
-
g
+
+
+
N
+
+
+
+
+
w
+
+
+
+
+
+
?
+
-
構造主義思想
h
+
+
-
Table 2. Distinctive Features of American English Vowels
ローマン・ヤコブソン
ç
A !
´
アメリカ構造言語学
(1896-1982)
high
+ low
構造音韻論
ヨーロッパ構造言語学
+ + back
+ rounded
フェルディナン・ド・ソシュール
ATR
(1857-1913)
一般言語学
1900
2000
i
+
+
1800
I
+
-
e
+
E
-
æ
+
-
u
+
+
+
+
U
+
+
+
-
o
+
+
+
エルンスト・マッハ
(1838-1916)
フェルディナン・ド・ソシュール(1857-1913)
ローマン・ヤコブソン(1896-1913)
レヴィ・ストロース(1905-)
物理学・音響学・心理学
「未開社会の親族構造の中に不変的な数学的構造を見いだした。また,
「言語が含むのは,言語体系に先立って存在する観念でも音でもなく,
「二音素間の差異を幾つかの弁別素性を用いて表現」
クリスチャン・フォン・エーレンフェルス
音素群の幾何学構造
(1859-1932)
ただこの体系から生じる観念的差異と音的差異だけである」
抽象化を通して種々の神話の中に不変的な構造を見いだした」
ジェームス・ギブソン
ゲシュタルト心理学
やがて音素そのものを素性の束として定義するが,これは勇み足か?
(1904-1979)
意味の世界が,各言語において,どのように領域分割しているのか?
構造=ある変換操作によって変化しない特性・関係
生態学的認識論・アフォーダンス
音の世界が,各話者において,どのように領域分割しているのか?
構造と変換は表裏一体であり,どの変換に着眼するかで構造の定義も異なる。
コネクショニスト
要素の定義は,他者との差異を通して初めて可能になる。
複雑系
二つの構造主義とその変遷
エルンスト・マッハ(1838-1916) レヴィ・ストロース
クリスチャン・フォン・エーレンフェルス(1859-1932)
ジェームス・ギブソン(1904-1979)
構造主義思想
(1905-)
「私達は個々の要素についての研究から始めなければならないが,自然は要
「知覚は,対象の個別刺激を統合して起こるものではなく,それ以前
「視覚は『要素刺激感覚+そのまとめ上げ』ではなく,『対象の動き
「物体の質量は,その物体の周りの全ての物体との関係で決る。他に
構造人類学・神話学
素と共に始まった訳ではない。私達が圧倒的な全体系から時々目をそらし,
ローマン・ヤコブソン
に全体的な枠組の中で認識が起こる」
(変形)の中に見られる不変的特性』に視覚の本質がある」
何もない空間の中では,ある物体の質量には,何の意味もない」
アメリカ構造言語学
(1896-1982)
個別的な点に集中できるのは確かに幸福なことである。しかしさしあたり無
ゲシュタルト=全体=「部分の単純な総和」以上のもの
変形と不変 移調性と不変性 変換と構造
全ての自然現象は感性的諸要素の関数的相互依存関係に基づく複合体
構造音韻論
ヨーロッパ構造言語学
視されていたことを,改めて補修し修正しながら研究することを怠ってはな
移調性と不変性:ゲシュタルトにある種の変形を施しても,ゲシュタルトは不変
面性の知覚:光が作る差異の構造に基づいて行なわれる(光の差異が成す不変項)。
全ての観測結果は人間の感覚量でしかない。空間・時間感覚も人間の感覚の一つ。
フェルディナン・ド・ソシュール
(1857-1913)
(メロディーと転調など)
情報は人間の頭の中で作り出すのではなく,環境そのものの中に存在している。
らない。」(要素還元主義への警鐘)
その意味において,ユークリッド空間よりも非ユークリッド空間の方が本質的。
1800
一般言語学
1900
2000
エルンスト・マッハ
(1838-1916)
物理学・音響学・心理学
クリスチャン・フォン・エーレンフェルス
(1859-1932)
ジェームス・ギブソン
ゲシュタルト心理学
(1904-1979)
生態学的認識論・アフォーダンス
コネクショニスト
複雑系
二つの構造主義とその変遷
要素感覚を集めるのではなく,要素間の関係・差異を統合することで
着眼する変換・変形操作に対して不変なる構造を見いだすこと。
レヴィ・ストロース
(1905-)
構造人類学・神話学
1800
構造主義思想
ローマン・ヤコブソン
アメリカ構造言語学
(1896-1982)
構造音韻論
ヨーロッパ構造言語学
フェルディナン・ド・ソシュール
(1857-1913)
一般言語学
1900
2000
エルンスト・マッハ
(1838-1916)
物理学・音響学・心理学
クリスチャン・フォン・エーレンフェルス
(1859-1932)
ジェームス・ギブソン
ゲシュタルト心理学
(1904-1979)
生態学的認識論・アフォーダンス
コネクショニスト
複雑系
Origin and evolution of language
A MODULATION-DEMODULATION MODEL FOR SPEECH
COMMUNICATION AND ITS EMERGENCE
NOBUAKI MINEMATSU
Graduate School of Info. Sci. and Tech., The University of Tokyo, Japan,
[email protected]
Perceptual invariance against large acoustic variability in speech has been a long-discussed
question in speech science and engineering (Perkell & Klatt, 2002), and it is still an open
question (Newman, 2008; Furui, 2009). Recently, we proposed a candidate answer based on
mathematically-guaranteed relational invariance (Minematsu et al., 2010; Qiao & Minematsu,
2010). Here, transform-invariant features, f -divergences, are extracted from the speech dynamics in an utterance to form an invariant topological shape which characterizes and represents the
linguistic message conveyed in that utterance. In this paper, this representation is interpreted
from a viewpoint of telecommunications, linguistics, and evolutionary anthropology. Speech
production is often regarded as a process of modulating the baseline timbre of a speaker’s voice
by manipulating the vocal organs, i.e., spectrum modulation. Then, extraction of the linguistic message from an utterance can be viewed as a process of spectrum demodulation. This
modulation-demodulation model of speech communication has a strong link to known morphological and cognitive differences between humans and apes.
Modulation used in telecommunication
=
From Wikipedia
Figure 1. The tallest and shortest adults
Figure 2. Frequency modulation and demodulation
A musician modulates the tone from a musical instrument by varying
its volume, timing and pitch. The three key parameters of a carrier
sine wave are its amplitude (“volume”), its phase (“timing”) and its
frequency (“pitch”), all of which can be modified in accordance with
a content signal to obtain the modulated carrier.
We can say that a melody contour is a pitch-modulated (frequency-modulated)
carrierthe carrier corresponds to the baseline pitch.
version of a carrier wave, where
modulated carrier
We speak using our instruments, i.e., vocal organs, by varying not only the
= parameter, called the timbre or
above parameters, but also the most important
spectrum envelope. From this viewpoint, it can be said that an utterance is genmessage
demodulation
message
erated
by spectrummodulation
modulation (Scott, 2007). The
default shape and length
of a
carrier and, by changing the shape
vocal tube determines speaker-dependent voice quality
modulated
carrieran utterance is produced and transmitted.
or modulating the spectrum
envelope,
In a large number of previous studies
= in automatic speech recognition (ASR),
to bridge a gap between the ASR performance and the performance of human
speech
(HSR), much attention was paid
to the dynamic aspects
modulation
messagerecognitiondemodulation
messageof
A way of characterizing speech production
Speech production as spectrum modulation
Modulation in frequency (FM), amplitude (AM), and phase (PM)
= Modulation in pitch, volume, and timing (from Wikipedia)
= Pitch contour, intensity contour, and rhythm (= prosodic features)
What about a fourth parameter, which is spectrum (timbre)?
= Modulation in spectrum (timbre) [Scott’07]
= Another prosodic feature?
Front
Central
Back
Front
Central
Back
beat Back
Front
Central
boot
beat
Front
Central
Back
boot put
beat Back
bit bootbirdput
Front
Central
beat
bit
Front
Central
Back
bootbirdput
beat Back
bit bootbirdput
Central
bought
d
bit bootbirdput
bet bought
about
bit bootbirdput
bet bought
about
bit
birdput
bet bought
about
bat
bird
bet bought
about
bat
but pot
bet bought
about
bat
pot
but
bet bought
about
bat
but pot
bet about
bat
pot
but
bat
but pot
bat
but pot
but pot
Mi
w
Lo
w
L
ow
Lo
w
Lo
w
Lo
w
L
ow
Lo
d
Mi
d
Mi
Mi
d
Mi
d
Mi
d
d
Hig
h
Hig
h
bit
Mi
h
beat
beat
Hig
Tongue =
modulator
Schwa
= most lax
= most frequent
= home position
= spk.-specific
baseline timbre
Front
Hig
h
Hig
h
Hig
h
Hig
h
t
time
Modulation spectrum
Critical-band based temporal dynamics of speech
“In pursuit of an invariant representation” (Greenberg’97)
RASTA (=RelAtive SpecTrA, Hermansky’94)
lowpass
cutoff = 28 Hz
FFT
2
Critical-band
FIR filter bank
speech
Normalize by
long-term avg.
Limiting to peak 30 dB
and bilinear interpolation
100X
lowpass
cutoff = 28 Hz
100X
Normalize by
long-term avg.
FFT
image
2
(Greenberg’97)
No mathematical proof for invariance
Direction of a trajectory is rotated by VTL difference (Saito’08)
Invariant speech structure
Utterance to structure conversion using f-div. [Minematsu’06]
c1
c1
cD
c4
c3
c2
c2
Bhattacharyya distance
cD
Sequence of spectrum
slices
c4
c3
BD-based distance matrix
Sequence of spectrum slices
spectrogram (spectrum slice sequence)
Sequence of cepstrum vectors
Sequence of spectrum slices
Sequence of spectrum slices
cepstrum vector sequence
Sequence of cepstrum vectors
Sequence of distributions
distribution sequence
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of distributions
Sequence of distributions
Sequence of distributions
Structuralization by interrelating temporally-distant events
An event (distribution) has to be much smaller than a phoneme.
Demodulation used in telecommunication
Demodulation in frequency, amplitude, and phase
Demodulation = a process of extracting a message intactly by
removing the carrier component from the modulated carrier signal.
Not by extensive collection of samples of modulated carriers
(Not by hiding the carrier component by extensive collection)
carrier
modulated carrier
=
message
modulation
demodulation
modulated carrier
message
carrier
=
message
demodulation
modulation
message
Spectrum demodulation
Speech recognition = spectrum (timbre) demodulation
Demodulation = a process of extracting a message intactly by
removing the carrier component from the modulated carrier signal.
By removing speaker-specific baseline spectrum characteristics
Not by extensive collection of samples of modulated carriers
(Not by hiding the carrier component by extensive collection)
carrier
modulated carrier
=
message
modulation
modulated carrier
demodulation
carrier
message
modulation
message
=
message
demodulation
Two questions
Q1: Does the ape have a good modulator?
Does the tongue of the ape work as a good modulator?
Q2: Does the ape have a good demodulator?
Does the ear (brain) of the ape extract the message intactly?
carrier
modulated carrier
=
message
modulation
modulated carrier
demodulation
carrier
message
modulation
message
=
message
demodulation
Structural diff. in the mouth and the nose
pharynx
larynx
stomach
lung
pharynx
larynx
lung stomach
Flexibility of tongue motion
The chimp’s tongue is much stiffer than the human’s.
“Morphological analyses and 3D modeling of the tongue
musculature of the chimpanzee” (Takemoto’08)
Less capability of manipulating the shape of the tongue.
新旧「猿の惑星」
Q1: Does the ape have a good modulator?
Morphological characteristics of the ape’s tongue
Two (almost) independent tracts [Hayama’99]
One is from the nose to the lung for breathing.
The other is from the mouth to the stomach for eating.
Much lower ability of deforming the tongue shape [Takemoto’08]
The chimp’s tongue is stiffer than the human’s.
carrier
message
modulation
carrier
modulation
message
The nature’s solution for static bias?
How old is the invariant perception in evolution? [Hauser’03]
1
2
1=2
At least, frequency (pitch) demodulation seems difficult.
Language acquisition through vocal imitation
VI = children’s active imitation of parents’ utterances
Language acquisition is based on vocal imitation [Jusczyk’00].
VI is very rate in animals. No other primate does VI [Gruhn’06].
Only small birds, whales, and dolphins do VI [Okanoya’08].
A’s VI = acoustic imitation but H’s VI = acoustic = ??
Acoustic imitation performed by myna birds [Miyamoto’95]
They imitate the sounds of cars, doors, dogs, cats as well as human voices.
Hearing a very good myna bird say something, one can guess its owner.
Beyond-scale imitation of utterances performed by children
No one can guess a parent by hearing the voices of his/her child.
Very weird imitation from a viewpoint of animal science [Okanoya’08].
?
Q1: Does the ape have a good modulator?
Morphological characteristics of the ape’s tongue
Two (almost) independent tracts [Hayama’99]
One is from the nose to the lung for breathing.
The other is from the mouth to the stomach for eating.
Much lower ability of deforming the tongue shape [Takemoto’08]
The chimp’s tongue is stiffer than the human’s.
carrier
message
modulation
carrier
modulation
message
Q2: Does the ape have a good demodulator?
Cognitive difference bet. the ape and the human
Humans can extract embedded messages in the modulated carrier.
It seems that animals treat the modulated carrier as it is.
From the modulated carrier, what can they know?
The apes can identify individuals by hearing their voices.
Lower/higher formant frequencies = larger/smaller apes
carrier
modulated carrier
=
message
modulation
modulated carrier
demodulation
carrier
=
demodulation
modulation
message
Function of the voice timbre
What is the original function of the voice timbre?
For apes
The voice timbre is an acoustic correlate with the identity of apes.
For speech scientists and engineers
They had started research by correlating the voice timbre with messages
conveyed by speech stream such as words and phonemes.
Formant frequencies are treated as acoustic correlates with vowels.
“Speech recognition” started first, then, “speaker recognition” followed.
c
fn =
n
2l1
c
fn =
n
2l2

A2
c
f=
2⇡ A1 l1 l2
1/2
Function of the voice timbre
What is the original function of the voice timbre?
For apes
The voice timbre is an acoustic correlate with the identity of apes.
For speech scientists and engineers
They had started research by correlating the voice timbre with messages
conveyed by speech stream such as words and phonemes.
Formant frequencies are treated as acoustic correlates with vowels.
“Speech recognition” started first, then, “speaker recognition” followed.
But the voice timbre can be changed easily.
Speaker-independent acoustic model for word recognition
P (o|w) =
s
P (o, s|w) =
s
P (o|w, s)P (s|w)
s
P (o|w, s)P (s)
Speaker-adaptive acoustic model for word recognition
HMMs are always modified and adapted to users.
These methods don’t remove speaker components in speech.
Invariant speech structure
Utterance to structure conversion using f-div. [Minematsu’06]
c1
c1
cD
c4
c3
c2
c2
Bhattacharyya distance
cD
Sequence of spectrum
slices
c4
c3
BD-based distance matrix
Sequence of spectrum slices
spectrogram (spectrum slice sequence)
Sequence of cepstrum vectors
Sequence of spectrum slices
Sequence of spectrum slices
cepstrum vector sequence
Sequence of cepstrum vectors
Sequence of distributions
distribution sequence
Sequence of cepstrum vectors
Sequence of cepstrum vectors
Sequence of distributions
Sequence of distributions
Sequence of distributions
Structuralization by interrelating temporally-distant events
An event (distribution) has to be much smaller than a phoneme.
シンボルグラウンディング問題
シンボルは如何にして生まれたか?
シンボルは何故「意味」を持つようになったのか?
シンボルは何故「ある対象物」と結びつくようになったのか?
シンボルは何故「ある記憶」と結びつくようになったのか?
寄稿
特集 ことばをとどける
「声の力」
声とは、言葉とは、何か
̶音声研究を通して考えること
東京大学大学院工学系研究科教授
峯松 信明
声とは何か、言葉とは何か。この根源的なテーマに応えてくだ
さったのは、音声工学の第一人者である峯松信明先生。機械
に音声を認識させる・合成させる、その研究を通して対極に見
えてきたものとは何でしょうか。それはヒトの持つ不思議な能力
―言葉と記憶、ヒトは言葉を操作しながら、実は言葉によっ
て記憶を操作されている̶その謎に科学の目で迫ります。
プロフィール/みねまつ・のぶあき
1990年 東京大学工学部卒業、95年
東京大学大学院工学系研究科にて博
士(工学)を取得。95年より豊橋技
術科学大学に勤務し、2000年より東
京大学に戻る。現在、東京大学大学
院工学系研究科電気系工学専攻教
授。音声科学から音声工学に至るま
で、幅広い観点から音声コミュニ
ケーションに関する研究に従事。特
に音声技術を使った語学教育に関す
る造詣が深く、2009年よりOJADの
開発を手がけている。
Laboratory experiment for B3 students
“Development of a moving robot controlled wirelessly
by voice commands”
Some questions raised by students.
Questions from a specific group of students
“Why do we have to extract spectral envelopes or cepstrums?”
“Why don’t we use waveform as input features for ASR?”
“Why don’t we apply machine learning and/or statistical modeling
to waveforms?”
A common thing about these students
They did not take classes of “Fundamentals of signal analysis (信号
解析基礎)” and “Signal processing (信号処理工学)”
They did not understand well which part of signals are irrelevant to
the linguistic messages conveyed in a speech stream.
Phase and pitch are irrelevant to the phonemic information in speech.
What’s missing?
Focus on shape of
the vocal tract
Two steps of information separation
phase
characteristics
20000
"A_a_512"
15000
10000
speech
waveforms
5000
0
-5000
-10000
-15000
-20000
0
5
10
15
20
25
30
35
Insensitivity to
phase differences
8
7
6
source
characteristics
amplitude
characteristics
-25000
w+s
9
5
4
3
2
0
"A_a_512_hamming_logspec"
9
8
7
6
5
4
3
2
1
0
2000
4000
6000
8000
10000
12000
14000
15
os
5
0
ow
10
o
filter
characteristics
10
16000
Two acoustic models for speech/speaker recognition
Speaker-independent acoustic model for word recognition
P (o|w) =
s
P (o, s|w) =
s
P (o|w, s)P (s|w)
s
P (o|w, s)P (s)
Text-independent acoustic model for speaker recognition
P (o|s) =
w
P (o, w|s) =
w
P (o|w, s)P (w|s)
w
P (o|w, s)P (w)
Require intensive collection or constant adaptation
o
ow + os is possible?
"A_a_512_ha
20
Why so many samples are not needed?
Statistical independence or physical independence
Statistically speaker-independent acoustic model of P(o | w)
P (o|w) =
s
P (o, s|w) =
s
P (o|w, s)P (s|w)
s
P (o|w, s)P (s)
o = spectrum envelope
Fully statistically-independent acoustic model for word recognition
20000
"A_a_512"
P (o|w)
20000
15000
15000
10000
10000
5000
"A_a_512"
s,h,p P (o|w, s, h, p)P (s)P (h)P (p)
20000
"A_a_512"
20000
15000
20000
15000
10000
-5000
10000
0
15000
5000
20000
0
-10000
5000
-5000
10000
"A_a_512"
o = waveform, h = harmonics, p = phase
0
5000
"A_a_512"
"A_a_512"
20000
15000
-5000
-10000
-15000
15000
10000
0
20000
-5000
-10000
-15000
-20000
"A_a_512"
10000
0
15000
5000
20000
-10000
-15000
-25000
-20000
"A_a_512"
0
5000
-5000
10000
0
15000
5000
-5000
-10000
-15000
0
-20000
-5000
-15000
0
-10000
5000
-15000
-20000
-25000
-10000
5000
10000
0
-5000
-5000
-10000
-15000
-10000
-15000
-20000
-15000
-20000
-25000
-20000
-25000
5
10
15
20
25
30
-25000
-20000
0
5
10
15
20
25
30
35
-25000
0
0
5
5
10
10
15
15
20
20
25
25
30
30
35
35
-25000
-20000
0
5
10
15
20
25
30
35
-25000
0
0
0
"A_a_512"
5
5
5
10
10
10
15
15
15
20
20
20
25
25
25
30
30
30
35
35
35
Fully physically-independent acoustic model for word recognition
-25000
0
ow = physically speaker-independent word feature.
w+s
9
ow
?
"A_a_512_hamming_logspec_idft_env"
8
o
7
6
5
4
3
2
0
5
10
15
20
os
25
30
35
5
10
15
20
25
30
35
35
Function of the voice timbre
What is the original function of the voice timbre?
For apes
The voice timbre is an acoustic correlate with the identity of apes.
For speech scientists and engineers
They had started research by correlating the voice timbre with messages
conveyed by speech stream such as words and phonemes.
Formant frequencies are treated as acoustic correlates with vowels.
“Speech recognition” started first, then, “speaker recognition” followed.
But the voice timbre can be changed easily.
Speaker-independent acoustic model for word recognition
P (o|w) =
s
P (o, s|w) =
s
P (o|w, s)P (s|w)
s
P (o|w, s)P (s)
Speaker-adaptive acoustic model for word recognition
HMMs are always modified and adapted to users.
These methods don’t remove speaker components in speech.
What is the goal of speech engineering?
計算できる馬
賢馬ハンスから学べること
何が欠けているのか?
二つの軸
発達
我々は成長の中でどのように言語を身に付けるのか?
進化
我々は進化の中でどのように言語を身に付けたのか?
この二軸を真っ正面に見据えて技術開発しないと・・・・
それは,言葉を操るように見せかけるシステムとなる,のでは?