ANALYSIS OF NONVERBAL SPEECH SOUNDS

ANALYSIS OF NONVERBAL SPEECH SOUNDS
Thesis submitted in partial fulfillment
of the requirements for the degree of
DOCTOR OF PHILOSOPHY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
by
VINAY KUMAR MITTAL
Roll Number: 201033001
[email protected]
SPEECH AND VISION LABORATORY
International Institute of Information Technology
Hyderabad - 500 032, INDIA
NOVEMBER 2014
c Vinay Kumar Mittal, 2014
Copyright All Rights Reserved
International Institute of Information Technology
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Analysis of Nonverbal Speech Sounds” by
VINAY KUMAR MITTAL, has been carried out under my supervision and is not submitted elsewhere
for a degree.
Date
Adviser: Prof. B. YEGNANARAYANA
To,
my parents
Shri Anil Kumar Mittal and Smt. Padamlata Mittal
Acknowledgments
“You see things twice, first in the mind and then in reality”, said Walt Disney. My first gratitude is
towards my mother Mrs. Padamlata Mittal who saw me completing Ph.D. in her mind years back, even
before I started believing in my this dream. Next, my deepest gratitude is towards Shri R. N. Saxena,
father of my friend Sanjay and one of my life-coach, who convinced me about the need of Ph.D. degree
and feasibility of completing it even at my late middle age. (Yes, I am 47 years young at this juncture.)
Of course, the dream could not have been actualised without my wife Mrs. Ambika and our sons Yash
and Shubham, who not only stood by me in all my career decisions and supported me in all my pursuits,
but also endured the relative financial hardship due to this change of my profession. I feel truly blessed
to have such wonderful people around me, all the time.
My this goal probably would not have been achieved if Prof. B. Yegnanarayana, with whom I am
associated since my M. Tech. days at IIT, Madras (1996-98), had not advised me to join as full-time
researcher instead of part-time. That actually meant, I had to leave the corporate world. It was his
wise counsel and magnanimity of accepting me as his student, that enabled me to leave the lure and
comfort of corporate like Microsoft IDC behind, and take the plunge. In the pursuit of my this goal,
kind-heartedness of Prof. Rajeev Sangal, then director of IIIT, Hyderabad, would remain etched in my
mind for a long time. He offered me both the teaching job as well as Ph.D. admission in the very first
meeting I had with him. That facilitated my decision. At IIIT, Hyderabad, I did get an opportunity to
closely interact with some of its leading faculty members, from whom I learnt many things, not only
inside the class-room, but beyond that at personal level as well. I take this opportunity to express my
gratefulness towards Prof. P. J. Narayanan, Prof. Peri Bhaskararao, Prof Jayanthi Sivaswamy, Prof. C.
V. Jawahar and Dr. Kishore S. Prahalad, who have been my great teachers here, while also treating me
as colleagues at the same time. Hats-off to their balancing act and depth as human beings!
Speech and Vision Lab at IIIT, Hyderabad did carry its legacy of ‘technical depth, discipline and
sharing’ from IIT, Madras to IIIT, Hyderabad, mainly because of Prof. Yegnanarayana. But, the credit
also goes to its sincere and dedicated researchers, then students at the lab, like Guru (Dr. Guruprasad
Sheshadri), Dhanu (Dr. N. Dhananjaya), Anand (Mr. Anand M. Joseph) and few others. When I joined
the lab at IIIT, Hyderabad in 2010, these three became my friends and did hand-hold me during my
initial days of struggle in the lab, while understanding concepts, learning tools and brushing-up things
forgotten. I wish them sincere thanks and good-luck in their respective pursuits currently. Gangamohan,
a sincere research student in the lab and with whom I worked on a sponsored project, has been of great
v
vi
help in several ways, and we could publish couple of joint papers on ‘emotions’, my initial topic of
research work. I wish him success ahead. At this juncture, I wish to say thanks to all the lab members
who directly or indirectly have contributed to whatever little I could accomplish in the last over four
years. I have kept the words and place reserved for Dr. Suryakanth V. Gangashetty, who as colleague,
friend and philosopher has been constantly by my side, even while me facing some ups and downs.
A journey like this with roller-coaster ride, is a discovery within. But, any journey also always has
the role of many part players, some of whom even remain behind the curtain most of the time. My
deepest thanks to all those, who have been part of the successful near completion of this phase of my
journey as a researcher. Lastly, I seek pardon from those, whose names I might have skipped a mention,
but nevertheless their role is acknowledged and much appreciated. Best wishes.
Vinay Kumar Mittal
Abstract
The nonverbal speech sounds such as emotional speech, paralinguistic sounds and expressive voices
produced by human beings are different from normal speech. Whereas, the normal speech conveys
linguistic message and has clear articulatory description, these nonverbal sounds also carry nonlinguistic
information but without any clear description of articulation. Also, these sounds are mostly unusual,
irregular, spontaneous and nonsustainable. Examples of emotional speech are shouts, happy, anger, sad
etc., and of paralinguistic sounds laughter, cry, cough etc. Besides, the expressive voices like Noh voice
or opera singing are trained voices to convey intense emotions. Emotional speech, paralinguistic sounds
and expressive voices differ in the degree of pitch changes. Another categorisation based upon voluntary
control and involuntary changes in the speech production mechanism is also possible.
Production of nonverbal sounds occurs in short bursts of time and involves significant changes in
the glottal source of excitation. Hence, production characteristics of these sounds differ from those
of normal speech, mostly in the vibration characteristics of the vocal folds. Associated changes in the
characteristics of the vocal tract system are also possible. In some cases of normal speech such as trills
or emotional speech like shouts, the glottal vibration characteristics are also affected by the acoustic
loading of the vocal tract system and system-source coupling. Hence, characteristics of these nonverbal
sounds need to be studied from the speech production and perception points of view, to understand better
their differences from normal speech.
Excitation impulse sequence representation of the excitation source component of speech signal has
been of considerable interest in speech research, in past three decades. Presence of secondary impulses
within a pitch period was also observed in some studies. This impulse-sequence representation was
mainly aimed at achieving low bit-rates of speech coding and higher voice quality of synthesized speech.
But, its advantages and role in the analysis of nonverbal speech sounds have not been explored much.
The differences in the locations of these excitation impulse-like pulses in the sequence and their relative
amplitudes, possibly cause differences in various categories of acoustic sounds. In the case of nonverbal
speech sounds, these impulse-like pulses occur also at rapidly changing or nearly random intervals,
along with rapid or sudden changes in their amplitudes. Aperiodicity in the excitation component may
be considered as an important feature of the expressive voices like ‘Noh voice’. Characterizing changes
in the pitch perception that could be rapid in the case of expressive voices, and extracting F0 especially
in the regions of aperiodicity are major challenges, which need to be investigated in detail.
vii
viii
In this research work, the production characteristics of nonverbal speech sounds are examined from
both the electroglottograph and acoustic signals. These sounds are examined in four categories, which
differ in the periodicity (or aperiodicity) of glottal excitation and rapidness of changes in pitch perception. These categories are: (a) normal speech in modal voicing that includes study of trill, lateral,
fricative and nasal sounds, (b) emotional speech that includes four loudness level variations in speech,
namely, soft, normal, loud and shouted speech, (c) paralinguistic sounds like laughter in speech, and
(d) expressive voices like Noh singing voice. The effects of source-system coupling and acoustic loading
of the vocal tract system on the glottal excitation are also examined.
The signal processing methods like zero-frequency filtering, zero-time liftering, Hilbert transform
and group delay function are used for feature extraction. Existing methods like linear prediction (LP)
coefficients, Mel-frequency cepstral coefficients and short-time Fourier spectrum are also used. New
signal processing methods such as modified zero-frequency filtering (modZFF), dominant frequencies
(FD ) computation using LP spectrum or group delay function and saliency computation (a measure for
pitch perception) are proposed in this work. A time-domain impulse sequence representation of the
excitation source is proposed, which also takes into account the pitch perception and aperiodicity in
expressive voices. Using this representation, a method is proposed for extracting F0 even in the regions
of subharmonics and aperiodicity, which otherwise is a challenging task. Validation of results is carried
out using spectrograms, saliency measure, perceptual studies and synthesis.
The efficacy of the signal processing methods proposed in this work, and the features and parameters
derived, is also demonstrated through some applications. Three prototype systems are developed for
automatic detection of trills, shout and laughter in continuous speech. These systems use features
extracted to capture the unique characteristics of each of the respective sound categories examined.
Parameters are derived from features, that help in distinguishing these sounds from normal speech.
Specific databases are collected with ground truth in each case for developing and testing these systems.
Performance evaluation is carried out using some measures proposed such as saliency, perceptual studies
and synthesis. The encouraging results indicate feasibility of developing these systems into complete
systems for diverse practical purposes and a range of real-life applications.
Contents
Chapter
Page
1 Issues in Analysis of Nonverbal Speech Sounds . . . . . . . . . . . . . . . . . . . .
1.1 Verbal and nonverbal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Signal processing and other issues in the analysis of nonverbal sounds . . . . . .
1.3 Objective of the studies in this thesis . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Analysis tools, methodology and scope . . . . . . . . . . . . . . . . . . . . . . .
1.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .
.
.
.
.
.
1
1
3
5
6
9
2 Review of Methods for Analysis of Nonverbal Speech Sounds . . . . . . . . . . . . . . . . .
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Speech production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Analytic signal and parametric representation of speech signal . . . . . . . . . . . . .
2.4 Review of studies on source-system interaction and few special sounds . . . . . . . . .
2.4.1 Studies on special sounds such as trills . . . . . . . . . . . . . . . . . . . . .
2.4.2 Studies on source-system interaction and acoustic loading . . . . . . . . . . .
2.5 Review of studies on analysis of emotional speech and shouted speech . . . . . . . . .
2.5.1 Studies on emotional speech . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Studies on shouted speech . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Review of studies on analysis of paralinguistic sounds and laughter . . . . . . . . . . .
2.6.1 Need for studying paralinguistic sounds like laughter . . . . . . . . . . . . . .
2.6.2 Different types and classifications of laughter . . . . . . . . . . . . . . . . . .
2.6.3 Studies on acoustic analysis of laughter and research issues . . . . . . . . . . .
2.7 Review of studies on analysis of expressive voices and Noh voice . . . . . . . . . . . .
2.7.1 Need for studying expressive voices . . . . . . . . . . . . . . . . . . . . . . .
2.7.2 Studies on representation of source characteristics and pitch-perception . . . .
2.7.3 Studies on aperiodicty in expressive voices such as Noh singing and F0 extraction
2.8 Review of studies for spotting the acoustic events in continuous speech . . . . . . . . .
2.8.1 Studies towards trill detection . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.2 Studies on shout detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8.3 Studies on laughter detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
11
12
15
21
21
22
25
25
27
28
29
29
30
31
31
32
33
36
36
36
37
38
3 Signal Processing Methods for Feature Extraction . . . . . . . . . . . . . . . . . . .
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Impulse-sequence representation of excitation in speech coding . . . . . . . . . .
3.3 All-pole model of excitation in LPC vocoders . . . . . . . . . . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . 39
.
39
.
40
.
43
CONTENTS
x
3.4
3.5
3.6
3.7
3.8
3.9
Methods to estimate the excitation impulse sequence representation . . . . . . . .
3.4.1 MPE-LPC model of the excitation . . . . . . . . . . . . . . . . . . . . . .
3.4.2 Methods for estimating the amplitudes of pulses . . . . . . . . . . . . . .
3.4.3 Methods for estimating the positions of pulses . . . . . . . . . . . . . . .
3.4.4 Variations in MPE model of the excitation . . . . . . . . . . . . . . . . . .
Zero-frequency filtering method . . . . . . . . . . . . . . . . . . . . . . . . . . .
Zero-time liftering method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Methods to compute dominant frequencies . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Computing dominant frequency from LP spectrum . . . . . . . . . . . . .
3.7.2 Computing dominant frequency using group delay method and LP analysis
Challenges in the existing methods and need for new approaches . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
48
51
54
56
58
60
60
61
62
63
4 Analysis of Source-System Interaction in Normal Speech . . . . . . . . . . . . . . .
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Role of source-system coupling in the production of trills . . . . . . . . . . . . .
4.2.1 Production of apical trills . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2 Impulse-sequence representation of excitation source . . . . . . . . . . .
4.2.3 Analysis by synthesis of trill and approximant sounds . . . . . . . . . . .
4.2.4 Perceptual evaluation of the relative role . . . . . . . . . . . . . . . . .
4.2.5 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Effects of acoustic loading on glottal vibration . . . . . . . . . . . . . . . . . . .
4.3.1 What is acoustic loading? . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Speech data for analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Features for the analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.4 Observations from EGG signal . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 Discussion on acoustic loading through EGG and speech signals . . . . .
4.3.6 Quantitative assessment of the effects of acoustic loading . . . . . . . . .
4.3.7 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . 64
.
64
.
65
.
65
.
66
.
67
.
68
.
70
.
71
.
71
.
74
.
74
. 76
.
78
.
86
.
89
.
90
5 Analysis of Shouted Speech . . . . . . . . . . . . . . . . .
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Different loudness levels in emotional speech . . . . .
5.3 Features for analysis of shouted speech . . . . . . . . .
5.4 Data for analysis . . . . . . . . . . . . . . . . . . . .
5.5 Production characteristics of shout from EGG signal .
5.6 Analysis of shout from speech signal . . . . . . . . . .
5.6.1 Analysis from spectral characteristics . . . . .
5.6.2 Analysis from excitation source characteristics
5.6.3 Analysis using dominant frequency feature . .
5.7 Discussion on the results . . . . . . . . . . . . . . . .
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. 91
91
92
92
95
95
96
97
102
104
107
108
CONTENTS
xi
6 Analysis of Laughter Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Data for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Analysis of laughter from EGG signal . . . . . . . . . . . . . . . . . . . . . . .
6.3.1 Analysis using the closed phase quotient (α) . . . . . . . . . . . . . . .
6.3.2 Analysis using F0 derived from the EGG signal . . . . . . . . . . . . . .
6.3.3 Inter-call changes in α and F0 . . . . . . . . . . . . . . . . . . . . . . .
6.4 Modified zero-frequency filtering method for the analysis of laughter . . . . . . .
6.5 Analysis of source and system characteristics from acoustic signal . . . . . . . .
6.5.1 Analysis using F0 derived by modZFF method . . . . . . . . . . . . . .
6.5.2 Analysis using the density of excitation impulses (dI ) . . . . . . . . . . .
6.5.3 Analysis using the strength of excitation (SoE) . . . . . . . . . . . . . .
6.5.4 Analysis of vocal tract system characteristics of laughter . . . . . . . . .
6.5.5 Analysis of other production characteristics of laughter . . . . . . . . . .
6.6 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 109
109
110
111
112
113
114
114
117
119
121
124
126
128
129
132
7 Analysis of Noh Voices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Issues in representing excitation source in expressive voices . . . . . . . . . . . .
7.3 Approach adopted in this study . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Method to compute saliency of expressive voices . . . . . . . . . . . . . . . . .
7.5 Modified zero-frequency filtering method for analysis of Noh voices . . . . . . .
7.5.1 Need for modifying the ZFF method . . . . . . . . . . . . . . . . . . . .
7.5.2 Key steps in the modZFF method . . . . . . . . . . . . . . . . . . . . .
7.5.3 Impulse sequence representation of source using modZFF method . . . .
7.6 Analysis of aperiodicity in Noh voice . . . . . . . . . . . . . . . . . . . . . . .
7.6.1 Aperiodicity in source characteristics . . . . . . . . . . . . . . . . . . .
7.6.2 Presence of subharmonics and aperiodicity . . . . . . . . . . . . . . . .
7.6.3 Decomposition of signal into source and system characteristics . . . . . .
7.6.4 Analysis of aperiodicity using saliency . . . . . . . . . . . . . . . . . .
7.7 Significance of aperiodicity in expressive voices . . . . . . . . . . . . . . . . . .
7.7.1 Synthesis using impulse sequence excitation . . . . . . . . . . . . . . .
7.7.2 F0 extraction in regions of aperiodicity . . . . . . . . . . . . . . . . . .
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 133
133
134
135
136
140
140
142
143
146
146
148
149
151
153
154
156
157
8 Automatic Detection of Acoustic Events in Continuous Speech . . . . . . . . . . . .
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Automatic Trills Detection System . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Shout Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Production features for shout detection . . . . . . . . . . . . . . . . . .
8.3.2 Parameters for shout decision logic . . . . . . . . . . . . . . . . . . . .
8.3.3 Decision logic for automatic shout detection system . . . . . . . . . . .
8.3.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Laughter Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 158
158
158
159
160
160
161
162
164
164
CONTENTS
xii
9 Summary and Conclusions . . . . . . . . . . . . . . . . .
9.1 Summary of the work . . . . . . . . . . . . . . . . .
9.2 Major contributions of this work . . . . . . . . . . .
9.3 Research issues raised and directions for future work
.
.
.
.
. . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
. . 166
. 166
. 168
. 169
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
List of Figures
Figure
2.1
2.2
2.3
2.4
2.5
3.1
3.2
3.3
3.4
3.5
4.1
4.2
4.3
4.4
Page
Schematic diagram of human speech production mechanism (figure taken from [153]) .
Cross-sectional expanded view of vocal folds (figure taken from [157]) . . . . . . . . .
Illustration of waveforms of (a) speech signal, (b) EGG signal, (c) LP residual and
(d) glottal pulse information derived using LP residual . . . . . . . . . . . . . . . . .
Schematic view of vibration of vocal folds for different cases: (a) open, (b) open at back
side, (c) open at front side and (d) closed state (figure taken from [157]) . . . . . . . .
Schematic views of glottal configurations for various phonation types: (a) glottal stop,
(b) creak, (c) creaky voice, (d) modal voice, (e) breathy voice, (f) whisper, (g) voicelessness. Parts marked in (g): 1. glottis, 2. arytenoid cartilage, 3. vocal fold and 4. epiglottis.
(Figure is taken from [60]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Schematic block diagram of the ZFF method . . . . . . . . . . . . . . . . . . . . . . .
Results of the ZFF method for different window lengths for trend removal for a segment
of Noh singing voice. Epoch locations are indicated by inverted arrows. . . . . . . . .
Schematic block diagram of the ZTL method . . . . . . . . . . . . . . . . . . . . . .
HNGD plots through ZTL analysis. (a) 3D HNGD spectrum (perspective view). (b) 3D
HNGD spectrum plotted at epoch locations (mesh form). The speech segment is for the
word ‘stop’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration of LP spectrum for a frame of speech signal. . . . . . . . . . . . . . . . . .
Illustration of stricture for (a) an apical trill, (b) theoretical approximant and (c) an
approximant in reality. The relative closure/opening positions of the tongue tip (lower
articulator) with respect to upper articulator are shown. . . . . . . . . . . . . . . . .
(Color online) Illustration of waveforms of (a) input speech and (b) ZFF output signals,
and contours of features (c) F0 , (d) SoE, (e) FD1 (“•”) and FD2 (“◦”) derived from the
acoustic signal for the vowel context [a]. . . . . . . . . . . . . . . . . . . . . . . . . .
(a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and
(d) Synthesized speech waveform (x13 ), for a sustained apical trill-approximant pair
(trill: 0-2 sec, approximant: 3.5-5.5 sec). Source information is changed (system only
retained) in synthesized speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and
(d) Synthesized speech waveform (x11 ), for the sustained apical trill-approximant pair
(trill: 0-2 sec, approximant: 3.5-5.5 sec). Both system and source information that of
original speech are retained in synthesized speech. . . . . . . . . . . . . . . . . . . .
xiii
12
12
13
14
15
56
57
58
60
61
65
66
67
68
LIST OF FIGURES
xiv
4.5
Illustration of strictures for voiced sounds: (a) stop, (b) trill, (c) fricative and (d) approximant. Relative difference in the stricture size between upper articulator (teeth or
alveolar/palatal/velar regions of palate) and lower articulator (different areas of tongue)
is shown schematically, for each case. Arrows indicate the direction of air flow passing
through the vocal tract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
Illustration of open/closed phase durations, using (a) EGG signal and (b) differenced
EGG signal for the vowel [a]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal, and the α contour for geminated occurrence of apical trill ([r]) in vowel context [a].
The sound is for [arra], produced in female voice. . . . . . . . . . . . . . . . . . . . .
76
Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG
signal, and the α contour for geminated occurrence of apical lateral approximant ([l])
in vowel context [a]. The sound is for [alla], produced in female voice. . . . . . . . . .
77
(Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and
FD2 (“◦”) for geminated occurrence of apical trill ([r]) in vowel context [a]. The sound
is for [arra], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.10 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and
FD2 (“◦”) for geminated occurrence of alveolar fricative ([z]) in vowel context [a]. The
sound is for [azza], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . .
80
4.11 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and
FD2 (“◦”) for geminated occurrence of velar fricative ([È]) in vowel context [a]. The
sound is for [aÈÈa], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . .
81
4.12 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and
FD2 (“◦”) for geminated occurrence of apical lateral approximant ([l]) in vowel context [a]. The sound is for [alla], produced in male voice. . . . . . . . . . . . . . . . . .
82
4.13 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and
FD2 (“◦”) for geminated occurrence of alveolar nasal ([n]) in vowel context [a]. The
sound is for [anna], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . .
83
4.14 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and
FD2 (“◦”) for geminated occurrence of velar nasal ([N]) in vowel context [a]. The sound
is for [aNNa], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
(a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d)
glottal pulse obtained from LP residual (grx [n]) for a segment of normal speech. Note
that all plots are normalized to the range of -1 to +1. . . . . . . . . . . . . . . . . . . .
93
(a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d)
glottal pulse obtained from LP residual (grx [n]) for a segment of shouted speech. Note
that all plots are normalized to the range of -1 to +1. . . . . . . . . . . . . . . . . . . .
93
4.6
4.7
4.8
4.9
5.1
5.2
LIST OF FIGURES
xv
5.3
HNGD spectra along with signal waveforms for a segment of (a) soft, (b) normal, (c)
loud and (d) shouted speech. The segment is for the word ‘help’ in the utterance of the
text ‘Please help!’. Arrows point to the low frequency regions. . . . . . . . . . . . . .
98
5.4 Energy of HNGD spectrum in low frequency (0-400Hz) region (LFSE) for a segment of
(a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’
in utterances of text ‘Please help!’. The vowel regions (V) are marked in these figures.
99
5.5 Energy of HNGD spectrum in high frequency (800-5000 Hz) region (HFSE) for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the
word ‘help’ in utterances of text ‘Please help!’. The vowel regions (V) are marked in
these figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.6 Distribution of high frequency spectral energy (HFSE) vs low frequency spectral energy
(LFSE) of HNGD spectral energy computed in 4 different vowel contexts for the 4 loudness levels. The 4 vowel region contexts are: (a) vowel /e/ in word ‘help’, (b) vowel
/6/ in word ‘stop’, (c) vowel /u/ in word ‘you’ and (d) vowel /o:/ in word ‘go’. The
segments are taken from the utterances by same speaker (S4). (Color online) . . . . . . 100
5.7 Relative spread of low frequency spectral energy (LFSE) of ‘HNGD spectra’ computed
over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and shout. The segment is for the vowel /e/ in the word ‘help’ in the utterance
of the text ‘Please help!’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.8 Relative spread of low frequency spectral energy (LFSE) of ‘Short-time spectra’ computed over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft,
normal, loud and shout. The segment is for the vowel /e/ in the word ‘help’ in the
utterance of the text ‘Please help!’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.9 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour
for a segment “you” of normal speech in male voice. . . . . . . . . . . . . . . . . . . 106
5.10 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour
for a segment “you” of shouted speech in male voice. . . . . . . . . . . . . . . . . . . 106
6.1
6.2
6.3
6.4
6.5
Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for three calls
in a nonspeech-laugh bout after utterance of text “it is a good joke”, by a female speaker.
Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for a laughedspeech segment of the utterance of text “it is a good joke”, by a female speaker. . . . .
Illustration of (a) signal waveform (xin [n]), and (b) EGG signal ex [n], (c) dEGG signal
dex [n] and (d) αdex , contours along with V/NV regions (dashed lines). The segment
consists of calls in a nonspeech-laugh bout (marked 1-4 in (d)) and a laughed-speech
bout (marked 5-8 in (d)) for the text “it is really funny”, produced by a male speaker. .
(Color online) Illustration of inter-calls changes in the average values of ratio α and F0 ,
for 4 calls each in a nonspeech-laugh bout (solid line) and a laughed-speech bout (dashed
line), produced by a male speaker and by a female speaker: (a) αave for NSL/LS male
calls, (b) αave for NSL/LS female calls, (c) F0ave for NSL/LS male calls, and (d) F0ave
for NSL/LS female calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illustration of (a) acoustic signal waveform (xin [n]), (b) the output (y2 [n]) of cascaded
pairs of ZFRs, (c) modified ZFF (modZFF) output (zx [n]), and (d) voiced/nonvoiced
(V/NV) regions (voiced marked as ‘V’), for calls in a nonspeech-laugh bout of a female
speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
111
112
114
115
LIST OF FIGURES
xvi
Illustration of (a) signal (xin [n]), and (b) modZFF output (zx [n]), (c) Hilbert envelope
of modZFF output (hz [n]), and (d) EGG signal (ex [n] for a nonspeech-laugh call, by a
female speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
Illustration of (a) signal (xin [n]), and contours of (b) F0 , (c) SoE (ψ), and (d) FD1 (“•”)
and FD2 (“◦”) with V/NV regions (dashed lines), for calls in a nonspeech-laugh bout of
a male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
118
Illustration of few glottal cycles of (a) acoustic signal (xin [n]), (b) EGG signal ex [n]
and (c) modified ZFF output signal zx [n], for a nonspeech-laugh call produced by a
male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119
Illustration of (a) acoustic signal (xin [n]), and spectrograms of (b) signal, (c) epoch sequence (using the modified ZFF method) and (d) sequence of impulses at all (negative to
positive going) zero-crossings of zx [n] signal, for few nonspeech-laugh calls produced
by a male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
122
6.10 Illustration of changes in the temporal measure for dI , i.e., φ, for NSL and LS calls.
(a) Acoustic signal (xin [n]). (b) φ for NSL and LS calls, i.e., regions 1-4 and 5-8,
respectively. The signal segment is for the text “it is really funny” produced by a male
speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
6.11 (Color online) Illustration of distribution of FD2 vs FD1 for nonspeech-laugh (“•”) and
laughed-speech (“◦”) bouts of a male speaker. The points are taken at GCIs in respective
calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
127
6.12 Illustration of (a) input acoustic signal (xin [n]) and few (b) peaks of Hilbert envelope of
LP residual (hp ) for 3 cases: (i) normal speech, (ii) laughed-speech and (iii) nonspeechlaugh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
128
6.6
6.7
6.8
6.9
7.1
Results of the ZFF method for different window lengths for trend removal for a segment
of voiced speech. Epoch locations are indicated by inverted arrows. . . . . . . . . . .
135
7.2
(a) Saliency plot of the AM pulse train and (b) the synthetic AM sequence. . . . . . . .
138
7.3
Saliency plots ((a),(c),(e),(g)) of the synthetic AM pulse train and the epoch sequences ((b),(d),(f),(h))
derived by using different window lengths for trend removal: 7 ms ((a),(b)), 3 ms
((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for AM sequence and
the cleaned SoE sequence for 1 ms window length, respectively. . . . . . . . . . . . . 138
7.4
(a) Saliency plot of the FM pulse train and (b) the synthetic FM sequence. . . . . . . .
7.5
Saliency plots ((a),(c),(e),(g)) of the synthetic FM pulse train and the epoch sequences ((b),(d),(f),(h))
derived by using different window lengths for trend removal: 7 ms ((a),(b)), 3 ms
((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for FM sequence and
the cleaned SoE sequence for 1 ms window length, respectively. . . . . . . . . . . . . 139
7.6
Illustration of waveforms of (a) input speech signal, (b) LP residual, (c) Hilbert envelope
of LP residual and (d) modZFF output, and (e) the SoE impulse sequence derived using
the modZFF method. The speech signal is a segment of Noh singing voice used in Fig. 3
in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
141
Illustration of waveforms of (a) input speech signal, (b) preprocessed signal and (c) modZFF output, and (d) the SoE impulse sequence derived using the modZFF method. The
speech signal is a segment of Noh singing voice used in Fig. 3 in [55]. . . . . . . . . .
142
7.7
139
LIST OF FIGURES
7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
7.20
8.1
8.2
xvii
Illustration of waveforms of speech signal (in (a)), modZFF outputs (in (b),(d),(f),(h))
and SoE impulse sequences (in (c),(e),(g),(i)), for the choice of last window lengths
as 2.5 ms, 2.0 ms, 1.5 ms and 1.0 ms. The speech signal is a segment of Noh voice used
in Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Selection of last window length: Difference (∆Nimps )(%) in the number of impulses
obtained with/without preprocessing vs choice of last window length (wlast ) (ms), for
3 different segments of Noh singing voice [55]. [Solid line: segment1, dashed line: segment2, dotted line: segment3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
(a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last
window length=1 ms), for the input (synthetic) AM sequence. . . . . . . . . . . . . . 145
(a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last
window length=1 ms), for the input (synthetic) FM sequence. . . . . . . . . . . . . . . 146
(a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE
impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 1 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
(a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE
impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 2 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
(a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE
impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Expanded (a) signal waveform, and spectrograms of (b) signal and its (e) SoE impulse
sequence obtained using the modZFF method, for Noh voice segment corresponding to
Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
(a) Signal waveform, and spectrograms of (b) signal, and its decomposition into (c) source
characteristics and (d) system characteristics, for a Noh voice segment corresponding to
Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Saliency plots computed with LP residual (in (a),(d),(g)), using XSX method (copied
from [55]) (in (b),(e),(h)), and computed with SoE impulse sequence derived using the
modZFF method (in (c),(f),(i)). The signal segments S1, S2 and S3 correspond, respectively, to the vowel regions [o:](Fig. 6 in [55]), [i](Fig. 7 in [55]), and [o](with pitch
rise)(Fig. 8 in [55]) in Noh singing voice [55]. . . . . . . . . . . . . . . . . . . . . . . 152
(a) FM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence.154
(a) AM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence.155
Illustration of (a) speech signal waveform, (b) SoE impulse sequence derived using the
modZFF method and (c) F0 contour extracted using the saliency information. The voice
signal corresponds to the vowel region [o](with pitch rise) in Noh singing voice (Ref.
Fig. 8 in [55]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Schematic block diagram of prototype shout detection system . . . . . . . . . . . . . .
Schematic diagram for decision criteria (d5 , d6 , d7 ) using the direction of change in
gradients of (a) F0 , (b) SoE and (c) FD contours, for the decision of (d) shout candidate
segments 1 & 2 (d5 ), 3 & 4 (d6 ), and 5 & 6 (d7 ). . . . . . . . . . . . . . . . . . . . . .
159
162
List of Tables
Table
4.1
4.2
4.3
4.4
4.5
Page
Criterion for similarity score for perceptual evaluation of two trill sounds (synthesized
and original speech) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
Experiment 1: Results of perceptual evaluation. Average similarity scores between synthesized speech files (x11 , x12 , x13 and x14 ) and original speech file (x10 ) are displayed.
69
Experiment 2: Results of perceptual evaluation. Average similarity scores between each
place of articulation in synthesized speech files (x21 , x22 , x23 and x24 ), and corresponding sound in original speech file (x20 ) are displayed. . . . . . . . . . . . . . . . . . .
70
Comparison between sound types based on stricture differences for geminated occurrences. Abbreviations:- alfric: alveolar fricative [z], vefric: velar fricative [È], approx/appx: approximant [l], frics: fricatives ([z], [È]), alnasal: alveolar nasal [n], venas:
velar nasal [N], stric: stricture, H/L indicates relative degree of low stricture. . . . . . .
85
Changes in glottal source features F0 and SoE (ψ) for 6 categories of sounds (in male
voice). Column (a) is F0 (Hz) for vowel [a], (b) and (c) are F0min and F0max for the
0
(%). Columns (e) is SoE (i.e., ψ) for vowel [a], (f) and (g)
specific sound, and (d) is F∆F
0
[a]
are ψmin and ψmax for the specific sound, and (h) is ψ∆ψ (%). Sl.# are the 6 categories of
[a]
sounds. Suffixes a, b and c in the first column indicate single, geminated or prolonged
occurrences, respectively. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and
‘alnasal’/‘venasal’ denotes alveolar/velar nasal. . . . . . . . . . . . . . . . . . . . . .
4.6
86
Changes in vocal tract system features FD1 and FD2 for 6 categories of sounds (in male
voice). Column (a) is FD1 (Hz) for vowel [a], (b) and (c) are FD1min and FD1max for the
∆F
specific sound, and (d) is FD1D1 (%). Columns (e) is FD2 (Hz) for vowel [a], (f) and (g)
[a]
are FD2min and FD2max for the specific sound, and (h) is
4.7
∆FD2
FD2[a] (%).
Sl.# are the 6 cat-
egories of sounds. Suffixes a, b and c in the first column indicate single, geminated
or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar
fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. . . . . . . . . . . . . . .
87
Changes in features due to effect of acoustic loading of the vocal tract system on the
glottal vibration, for 6 categories of sounds (in male voice). Columns (a)-(d) show
percentage changes in F0 , SoE, FD1 and FD2 , respectively. The direction of change
in a feature in comparison to that for vowel [a] is marked with +/- sign. Note: ‘alfric’/
‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar
nasal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
xviii
LIST OF TABLES
4.8
5.1
5.2
5.3
5.4
5.5
5.6
5.7
6.1
6.2
Changes in features due to effect of acoustic loading of the vocal tract system on the
glottal vibration, for 6 categories of sounds (in female voice). Columns (a)-(d) show
percentage changes in F0 , SoE, FD1 and FD2 , respectively. The direction of change
in a feature in comparison to that for vowel [a] is marked with +/- sign. Note: ‘alfric’/
‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar
nasal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix
89
The percentage change (∆α) in the average values of α for soft, loud and shout with
respect to that of normal speech is given in columns (a), (b) and (c), respectively. Note:
Si below means speaker number i (i = 1 to 17). . . . . . . . . . . . . . . . . . . . .
97
The ratio (β) of the average levels of LFSE and HFSE computed over a vowel segment
of a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes. Note: The
numbers given in columns (a) to (d) are β values multiplied by 100 for ease of comparison.101
Average values of standard deviation (σ), capturing temporal fluctuations in LFSE, computed over a vowel segment of a speech utterance in (a) soft, (b) normal, (c) loud and
(d) shout modes. Note: The numbers given in columns (a) to (d) are σ values multiplied
by 1000 for ease of comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
The percentage change (∆F0 ) in the average F0 for soft, loud and shout with respect to
that of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below
means speaker number i (i = 1 to 17). . . . . . . . . . . . . . . . . . . . . . . . . . . 104
The average values of the ratio (α) of the closed phase to the glottal period for (a) normal, (b) raised pitch (non-shout) and (c) shouted speech, respectively. Columns (d), (e)
and (f) are the corresponding average fundamental frequency (F0 ) values in Hz. The
values are averaged over 3 utterances (for 3 texts) for each speaker. Note: Si below
means speaker number i (i = 1 to 5). F0 values are rounded to nearest integer value. . 105
Results to show changes in the average F0 and SoE values for normal and shouted
speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates
Shout, S# indicates Speaker number, T# indicates Text number and M/F indicates Male/Female.
Note: IPA symbols for the vowels in English phonetics are shown for the vowels used
in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Results to show changes in the Dominant frequency (FD ) values for normal and shouted
speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates
Shout, S# indicates Speaker number, T# indicates Text number and M/F indicates Male/Female.
Note: IPA symbols for the vowels in English phonetics are shown for the vowels used
in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Changes in α and F0EGG for laughed-speech and nonspeech-laugh, with reference to
normal speech. Columns (a)-(c) are average α, (d)-(f) are σα , (g)-(i) are average βα and
(l)-(n) are average F0EGG (Hz) for NS, LS and NSL. Columns (j), (k) are ∆βα (%) and
(o), (p) are ∆F0EGG (%) for LS and NSL cases. Note: Si indicates speaker number (i =
1 to 11) and M/F male/female. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Changes in F0ZF F and temporal measure for F0 (i.e., θ) for laughed-speech and nonspeechlaugh, with reference to normal speech. Columns (a)-(c) are average F0ZF F (Hz), (d)-(f)
are σF0 (Hz), (g)-(i) are average γ1 and (l)-(n) are average θ values for NS, LS and NSL.
Columns (j), (k) are ∆γ1 (%) and (o), (p) are ∆θ (%) for LS and NSL cases. Note: Si
indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . . . . . . . . . 120
LIST OF TABLES
xx
6.3
6.4
6.5
6.6
Changes in dI and temporal measure for dI (i.e., φ) for laughed-speech and nonspeechlaugh, with reference to normal speech. Columns (a)-(c) are average dI (Imps/sec),
(d)-(f) are σdI (Imps/sec), (g)-(i) are average γ2 and (l)-(n) are average φ values for
NS, LS and NSL. Columns (j), (k) are ∆γ2 (%) and (o), (p) are ∆φ (%) for LS and NSL
cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . 123
Changes in SoE (i.e., ψ) and temporal measure for SoE (i.e., ρ) for laughed-speech
and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average ψ,
(d)-(f) are σψ , (g)-(i) are average γ3 and (l)-(n) are average ρ values for NS, LS and NSL.
Columns (j), (k) are ∆γ3 (%) and (o), (p) are ∆ρ (%) for LS and NSL cases. Note: Si
indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . . . . . . . . . 125
Changes in FD1ave and σFD1 for laughed-speech (LS) and non-speech laugh (NSL) in
comparison to those for normal speech (NS). Columns (a),(b),(c) are FD1ave (Hz) and
(d),(e),(f) are σFD1 (Hz) for the three cases NS, LS and NSL. Columns (g),(h),(i) are
the average ν1 values computed for NS, LS and NSL, respectively. Columns (j) and (k)
are ∆ν1 (%) for LS and NSL, respectively. Note: Si below means speaker number i
(i = 1 to 11), and M/F indicates male/female. . . . . . . . . . . . . . . . . . . . . . . 126
Changes in average η and ση for laughed speech (LS) and nonspeech laugh (NSL) with
reference to normal speech (NS). Columns (a)-(c) are average η, (d)-(f) are ση and (g)(i) are average ξ values for NS, LS and NSL. Columns (j) and (k) are ∆ξ (%) for LS and
NSL, respectively. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female.129
7.1
Effect of preprocessing on number of impulses: (a) Last window length (wlast ) (ms),
#impulses obtained (b) without preprocessing (Norig ), (c) with preprocessing (Nwpp ),
N
−Nwpp
and (d) difference (∆Nimps = orig
%). The 3 Noh voice segments correspond
Norig
to Figures 6, 7 and 8 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.1
Results of performance evaluation of shout detection: number of speech regions (a) as
per ground truth (GT), (b) detected correctly (TD), (c) (shout) missed detection (MD)
and (d) wrongly detected as shouts (WD), and rates of (e) true detection (TDR), (f) missed
detection (MDR) and (g) false alarm (FAR). Note: CS is concatenated, NCS is natural
continuous and MixS is mixed speech. . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Abbreviations
AM
CELP
Codec
dEGG
DFT
EGG
F0
FAR
FM
GCI
GMM
HE
HFSE
HMM
HNGD
IF
IDFT
LFSE
LP
LPCs
MDR
modZFF
MPE
NGD
PDF
RPE
SPE
SPE-CELP
Std dev
STFT
SVM
TDR
ZFF
ZFR
ZTL
-
Amplitude modulation
Code-excited linear predictive coding
Coder-decoder
Differenced electroglottograph
Discrete Fourier transform
Electroglottograph
Instantaneous fundamental frequency
False alarm rate
Frequency modulation
Glottal closure instant
Gaussian mixture model
Hilbert envelope
High-frequency band spectral energy
Hidden Markov model
Hilbert envelope of double differenced NGD spectrum
Instantaneous frequency
Inverse discrete Fourier transform
Low-frequency band spectral energy
Linear prediction
Linear prediction coefficients
Missed detection rate
Modified zero-frequency filtering
Mutli-pulse excitation
Numerator of group delay function
Probability density function
Regular-pulse excitation
Single-pulse excitation
Single-pulse excitation for CELP coding
Standard deviation
Short-time Fourier transform
Support vector machine
True detection rate
Zero-frequency filtering
Zero-frequency resonator
Zero-time liftering
xxi
Chapter 1
Issues in Analysis of Nonverbal Speech Sounds
1.1 Verbal and nonverbal sounds
Human speech can be broadly categorized into verbal and nonverbal speech. Verbal speech is associated with a language to convey some message. Since clear description of articulation exists for these
sounds, they have reproducible characteristics. Some verbal sounds have complex articulation, making
analysis of their production characteristics from the speech signal both interesting and challenging. For
example, trills and some consonant clusters (e.g., fricatives, nasals) are interesting sounds for analysis.
Nonverbal speech sounds convey mostly nonlinguistic information, although in some cases they may
also carry linguistic message. Production of these sounds is mostly involuntary and spontaneous. Hence
they do not have any clear description of articulation. These sounds can be divided into three categories:
emotional speech, paralinguistic sounds and expressive voices. Emotional speech communicates the linguistic message along with emotions like anger (shout), happy, sad, fear and surprise. Analysis of emotional speech involves characterization of emotion and extraction of linguistic message. Paralinguistic
sounds refer to sounds produced due to laughter, cry, cough, sneeze and yawn. These sounds are produced involuntarily and hence are not easy to describe in terms of their production characteristics. These
sounds occur mostly in isolation, but may be interspersed sometimes with normal speech. But, significant deviations from normal speech take place in the production of these sounds. Expressive voices are
sounds produced by specially trained voices such as in opera singing or Noh voice. These sounds, being
specially trained voices, are mostly artistic. Production of these voices is controlled voluntarily. They
involve careful control of mostly the excitation component of speech production mechanism.
It is indeed a challenging task to analyse the production characteristics of some types of verbal sounds
and almost all types of nonverbal sounds. This is because, it is difficult to separate the excitation source
and the vocal tract system components of speech production process from speech signal. Also, the
variations in production are due to transient phenomenon. Thus the production characteristics are time
varying and hence are nonstationary. Moreover, complex interaction between the source and system may
occur during production of these sounds. In some cases, the vibration characteristics at the glottis may
be affected due to the acoustic loading of the vocal tract system. Similarly, the resonance characteristics
1
of the vocal tract system may be affected by the vibration characteristics at the glottis. It is observed
that significant changes from normal speech occur in the production characteristics of nonverbal sounds,
especially in the excitation source. Hence determining the features characterizing the nonverbal speech
sounds is a challenging task. But these features are needed in practice, especially for spotting the
nonverbal sounds in continuous speech in particular, and in audio data in general.
Sounds from the following four categories are considered for detailed comparative analysis of production characteristics from speech signals: normal speech, emotional speech, paralinguistic sounds
and expressive voices. Production of normal (voiced) speech involves quasi-periodic vibration of the
vocal folds. Coupling of the excitation source and the vocal tract system plays an important role in the
production-source specific normal speech sounds like trills. The effect of tongue-tip trilling affects the
glottal vibration, due to changes in the pressure difference across the glottis. The acoustic loading effects on the glottal vibration due to stricture in the vocal tract are also examined for a few other sounds,
to compare the effects of types and extent of stricture. The voiced sounds considered for this study are:
apical trill, apical lateral approximant, alveolar fricative, velar fricative, alveolar nasal and velar nasal.
Distinctive features of production of trills are explored for detecting trill regions in continuous speech.
In the production of emotional speech, the characteristics of normal speech sounds are modified,
sometimes to a significant extent. This leads to the challenge of isolating the features corresponding to
the linguistic message part and the emotion part. Among the several emotions and affective states, the
case of shouted speech is considered for detailed study. It is expected that shouted speech will affect
both the excitation source component as well as the vocal tract system components. These effects are
studied for four different levels of loudness, to examine the characteristics of shouted speech in relation
to soft, normal and loud voices. The production characteristics of shouted speech are also examined
to determine methods to extract relevant features from speech signal. These features not only help in
characterizing the deviation of shouted speech from normal speech, but they may also help in identifying
the regions of shouted speech in continuous speech. This study may help in examining speech mixed
with other types of emotions in a similar manner.
Production of paralinguistic sounds involves significant changes in the glottal vibration, with accompanying changes in the vocal tract system characteristics. We consider laughter sounds for detailed
study in this thesis. In laughter, significant changes occur mainly in the excitation, due to involuntary burst of activity. Production characteristics of laughter (nonspeech-laugh) are studied in relation
to laughed-speech and normal speech. Analysis of laughter sounds helps in determining the unique
features in the production of laughter, which in turn help in spotting regions of laughter in continuous speech or in audio data. Synthesis of laughter helps in understanding the significance of different
features in the production of laughter.
Expressive voices like opera singing or Noh voice convey intense emotions, but these voices have
little linguistic content. Excitation source seems to play a major role in the production of these sounds.
Noh voice is chosen for detailed study of the characteristics of the excitation source. The rapid voluntary changes in the glottal vibration results in aperiodicity in the excitation. These changes also result
2
in the perception of subharmonic components in these artistic voices. Modeling the aperiodicities in
the excitation and extraction of these characteristics from these signals is a challenging problem in signal processing, as the resulting model should work satisfactorily for all the cases of nonverbal speech
sounds. Hence, two synthetic impulse sequences modeling the amplitude modulation and frequency
modulation effects are used to examine changes in the excitation source characteristics.
1.2 Signal processing and other issues in the analysis of nonverbal sounds
Various research issues and challenges unique to the analysis of nonverbal speech sounds can be
grouped as related to: (a) production-specific nature, (b) databases and ground truth, (c) spotting and
classification, and (d) signal processing issues. Production-specific challenges are related to differences
in spontaneity, control over production and extent of the pre-meditation required before producing these
nonverbal sounds. Databases related issue pertain to their continuum nature (i.e., no clear boundaries
separating these), quality of emoting (relevant for data collection), and absence of ground truth (reference) in most of the cases. Classification and spotting of emotions related issues are: how to (i) discriminate between normal and nonverbal speech, (ii) spot nonverbal sounds in continuous speech and
(iii) identify its type/category, and (iv) what is the degree of confidence in classifying these emotions.
Signal processing issues are mainly related to extracting the characteristics of the excitation source from
the acoustic signal, and representing it in terms of a time domain sequence of impulse-like pulses.
Differences between nonverbal and normal speech sounds are reflected in their production characteristics. Production of nonverbal sounds occurs in short-bursts of time and involves significant changes in
the glottal source of excitation. Though changes in the characteristics of the vocal tract system are also
likely. Hence, differences in their production characteristics need to be studied in detail. Like normal
speech, the production characteristics of these nonverbal sounds also can be derived from the acoustic
signal. Their analysis involves examining changes in the glottal vibration characteristics, some of which
can be examined better from the EGG signal. The problem is complex, because the signal processing
methods that work well for normal speech in modal voicing, exhibit limitations in the case of these
nonverbal sounds. Thus, the first challenge is - (i) how to derive the excitation source characteristics
from the acoustic signal, especially for nonverbal speech sounds?
Impulse-sequence representation of the excitation source component of speech signal has been of
considerable interest in speech coding research, in past three decades. Different speech coding methods
such as waveform coders, linear prediction based source coders (vocoders) and analysis-by-synthesis
based hybrid codecs etc. had attempted it, in different ways. Representation of the source information
was attempted using multi-pulse [8, 6, 174, 26], or regular-pulse [94], or stochastically generated codeexcited linear predictive [165] excitation impulse sequences. Presence of secondary impulses within
a glottal cycle was also indicated in some studies [8, 6, 26, 165]. This representation was aimed at
achieving low bit-rates in coding and synthesis of natural-sounding speech in speech coders. But, its
role and possible merits in the analysis of nonverbal speech sounds have not been explored yet. Hence,
3
the important question is - (ii) can we represent the excitation source information in terms of a timedomain sequence of excitation impulse-like pulses, for these nonverbal sounds also?
The differences in various categories of sounds are possibly caused by differences in the locations of
these excitation impulse-like pulses in the sequence and their relative amplitudes. For example, in the
case of fricative sounds, these equivalent impulses may occur at random intervals with amplitudes of low
strength. In the case of vowels or vowel-like regions in modal voicing, these impulse-like pulses occur
at nearly regular intervals, with smooth changes in their amplitudes. But, in the case of nonverbal speech
sounds, these impulse-like pulses are likely to occur at rapidly changing or nearly random intervals, with
sudden changes in their amplitudes. It is manifested as rapid changes in the pitch-perception. In the
case of expressive voices like Noh, the aperiodicity in the excitation component is an important feature,
which may be attributed to unequal intervals between successive impulses and unequal strengths of
excitation around these. A measure needed for aperiodicity in the excitation component could possibly
be like weighted mean-squared error of the reconstructed signal, perceptual impressions or saliency of
the pitch perception etc. Hence, in order to characterize and measure changes in the perception of pitch,
that could be rapid in the case of expressive voices, the important question is - (iii) how to determine
the locations and amplitudes of the equivalent impulse-like pulses that characterise and represent the
excitation source information, in accordance with some measure of the pitch perception, especially
for nonverbal sounds? Related issues here are ‘extraction of F0 in the regions of aperiodicity’, and
‘obtaining the sequence of impulses from the information of pitch-perception’, for expressive voices.
In the production of different sounds by humans, the amplitude modulation (AM) and frequency
modulation (FM) play a major role in the voluntary control of pitch of the sound or involuntary changes
in it, for example in singing or for trill sounds, respectively. These are obviously related to changes in the
characteristics of both the excitation source and the vocal tract system, due to source-system coupling.
Their effect on the pitch-perception could be significant in the case of these nonverbal sounds. Hence,
the effect of differences in the locations of excitation impulses and their relative amplitudes needs to be
studied in the AM/FM context for these sounds. Representation of the excitation source characteristics
in terms of an impulse-sequence would also facilitate its manipulation for diverse applications such
as synthesis of emotional speech, spotting of these sounds in continuous speech and classification of
the emotion categories. The key question here is - (iv) how to extract production features/parameters
from the acoustic signal that help in spotting these sounds in continuous speech? A related issue is the
significance of these features, both from perception and synthesis points of view.
The four key challenges in the context of nonverbal speech sounds can be summarised as: how
to (a) derive the excitation source characteristics from the acoustic signal, (b) represent this source
information in terms of a time-domain sequence of impulse-like pulses, (c) characterize these sounds,
and (d) extract features for spotting these in continuous speech, along with the significance of these
features from both synthesis and perception points of view? In this research work, the objective is to
investigate these four key questions posed and possibly answer some of these.
4
1.3 Objective of the studies in this thesis
Human speech sounds can be broadly divided into voiced and unvoiced sounds [102, 99, 37]. Voicing involves excitation of the vocal tract system by the vibration of vocal folds at the glottis [48, 45].
Unvoiced sounds have either insignificant, random or no excitation at all by the glottal source. Voiced
sounds are further classified into verbal and nonverbal. Verbal speech sounds have higher degree of periodicity of glottal vibration, in modal voicing [99, 48]. On the other hand, nonverbal sounds are better
characterised by aperiodicity, rapid changes in the F0 , and its harmonics/subharmonics, that is manifested as rapidly changing pitch-perception. It is difficult to measure aperiodicity and pitch-perception.
A measure of the rapidness of changes in F0 is an even more difficult problem. In this work, the aim
is to examine these nonverbal speech sounds in terms of their production characteristics, predominantly
the excitation source characteristics, and examine the role of aperiodicity in their pitch-perception.
Emotional speech may be produced spontaneously or in a controlled manner (e.g., mimickery). But,
production of paralinguistic sounds is mostly spontaneous and uncontrolled. Whereas, human producer
of emotions is mostly conscious of their production, the production of paralinguistic sounds is rarely
pre-meditated. Thus, production of emotional speech or expressive voices involves voluntary control
over the production mechanism, and possibly speaker training also. But, the production of paralinguistic sounds like laughter involves only the involuntary changes in it. Human producer of paralinguistic
sounds may not even be conscious of their production, sometimes. Also, apparently larger role is played
by the source characteristics in production of paralinguistic sounds, than in emotional speech. Associated changes in the vocal tract system characteristics also may play role in different degree in both
the cases. Hence, selecting or developing appropriate signal processing methods that can accommodate
this wider range of changes in the speech production mechanism, is a major signal processing challenge
here. Another major challenge in the study of these sounds is nonavailability or limited availability
of appropriate databases with ground truth. Most of the existing databases are application-specific or
designed for a specific purpose. Collection of natural data is yet another research issue. Hence, this
thesis is also aimed at addressing these databases related and signal processing challenges.
Production of expressive voices like Noh involves voluntarily controlled changes in the speech production mechanism, that occur in producing the trained voice by the artist achieved through years of
practice. But, the production characteristics of paralinguistic sounds like natural laughter, may have
involuntary changes in these. Hence, the role and effect of voluntary control in speech production
mechanism needs to be studied, first for the normal speech and then possibly for expressive voices.
The role of involuntary changes in the production characteristics of paralinguistic sounds like laughter
needs to be examined in detail. The effect of source-system coupling also needs to be studied for some
dynamic sounds like trills in normal speech, and shouted (emotional) speech. The understanding of
shouted speech is expected to help in better understanding of emotions like anger in speech sounds.
There are some characteristics that are common to all types of nonverbal speech sounds. All are:
(i) nonsustainable, i.e., occur for short bursts of time, (ii) nonnormal, i.e., are deviations from normal,
(iii) form a continuum, i.e., are nondiscrete, and (iv) indicate humaneness, i.e., help distinguishing be5
tween a human and a humanoid. Hence, it is possible to distinguish these sounds from normal speech
with better understanding of these production characteristics, that may also help in discriminating between a human and a machine. Characterising the nonverbal speech sounds may also possibly lead to
discovery of some speaker-specific identity or human-specific signature. Better understanding of the
production characteristics of these sounds would help in a wide range of applications, such as their
spotting in continuous speech, event detection, classification, speaker identification, man/machine discrimination and speech synthesis etc. [161, 127, 63, 83, 101].
1.4 Analysis tools, methodology and scope
Significant changes occur in the excitation source characteristics during the production of nonverbal
sounds, for which direct measurement is not feasible. Hence, analysis of these changes is required first,
in order to characterise these sounds. Changes in the excitation source characteristics during production
of these sounds may be reflected in the glottal pulse shape characteristics such as the open/closed phases
in a glottal cycle, rate of vibration of vocal folds and their relative rate of opening/closing. Representing
the source characteristics in terms of a time-domain sequence of impulses, whose relative amplitudes
and locations capture the excitation information, has also been a research issue. Extracting some of
these features from the acoustic signal may not be adequate and not feasible sometimes. But, changes in
these excitation characteristics are indeed perceived by human being in the acoustic signal of the sounds
produced. Hence, there is need to explore and analyse these sounds using other signals as well, for
example electroglottograph (EGG) signal, throat microphone signal, magnetic resonance imaging etc.
Extracting the excitation information from these signals poses another set of challenges. Focus in the
analysis of nonverbal speech sounds is mostly on the changes in the glottal vibration, which is the main
source of excitation of the vocal tract system. While some characteristics of the glottal vibration can be
studied through auxiliary measurements such as EGG and high-speed video, the main characteristics as
produced are present only in the acoustic output signal of the production system.
Standard signal processing methods assume quasi-stationarity of the speech signal and extract the
characteristics of excitation source and the vocal tract system using segmental block-processing methods, like discrete Fourier transform (DFT) and linear prediction (LP) analysis. These analysis tools
can help in determining the source as either voiced or unvoiced, and the periodicity in the case of
voiced sounds. They also help in describing the response of the vocal tract system in terms of either
smoothed short-time spectral envelope or in terms of resonances (or formants) of vocal tract. Those
block-processing methods have severe limitation of time-frequency resolution issues. Moreover, they
will give only averaged characteristics of the source and the vocal tract system within the analysis segment, although it is well known that both the excitation and vocal tract system vary continuously, mainly
because of the opening and closing of vocal folds at the glottis in each glottal cycle.
In order to study the changes in glottal vibration, it is necessary to know the open and closed regions
in each glottal cycle, besides many secondary excitation due to complexity in glottal vibrations. More6
over, there also occur significant variations in the strengths of major and minor excitations within each
glottal cycles as well as across cycles. This will result in significant aperiodicities in the glottal vibrations, which are due to either involuntary changes or voluntarily controlled changes in the excitation. In
all such cases, the deconvolution or inverse filtering of the vocal tract system response is not possible, for
extracting the source characteristics. Methods are required to extract the excitation source information
directly from the signal, without prior estimation of the vocal tract system for inverse filtering.
Several studies were conducted in last about a decade, to analyse and characterise the emotional
speech, paralinguistic sounds and expressive voices. Characteristics of basic emotions were analysed
from speech signal in [19, 20, 209, 169, 179, 149, 89, 154]. Emotion recognition and classification was
carried out in [127, 167, 168, 120, 44, 156, 88]. Spotting and identification of emotions in human speech
was attempted in [103, 207, 14, 208]. Characteristics of shouted speech and scream were examined
in [146, 217, 219, 115]. Automatic detection of shout/scream in continuous speech was also attempted
in [70, 199, 160, 132, 131, 202]. Characteristics of paralinguistic sounds like laughter were examined
in [161, 83, 194, 195, 139]. Acoustic analysis of laughter was carried out in [11, 16, 134, 187, 111].
Developing systems for automatic detection of laughter in continuous speech were attempted in [83,
194, 195, 198, 84]. Expressive voices like Noh were analysed in [55, 214, 80, 78]. Analysis of Noh
voice using a tool named as TANDEM-STRAIGHT was carried out in [81, 79, 77]. Other phonetic
variations of modal voicing, like trills etc. were also analysed in [40, 116, 98, 178, 155, 41]. The effects
of source-system coupling and acoustic loading were studied in [189, 158, 191, 193, 108, 43].
But, in the studies carried out so far on nonverbal speech sounds, mostly the spectral characteristics are focused [183, 199, 132, 131]. Characteristics of the vocal tract system are examined mostly
using the spectral features like Mel-frequency cepstral coefficients (MFCCs), linear prediction coefficients (LPCs), perceptual linear prediction (PLP) and short-time spectra derived using discrete Fourier
transform (DFT) [199, 131, 183, 217, 160, 101, 195]. Whereas, in the production of nonverbal sounds,
intuitively more changes appear to take place in the characteristics of the glottal source of excitation
than in the vocal tract system characteristics, which need to be examined in more detail. The representation of excitation source information in this form is also convenient for manipulation and control of
its features for different purposes. Limitations of current signal processing methods and tools available
becomes evident in the case of acoustic signals that have rapid variations in pitch. Supra-segmental and
sub-segmental level studies also indicate the need of specialized tools and signal processing methods
for analysing paralinguistic sounds like laughter and expressive voices like Noh voice.
Some recent signal processing methods like zero-frequency filtering (ZFF) and zero-time liftering
(ZTL) are proposed to overcome some of the limitations of block-processing and time-frequency resolution issues, and also to take care of rapidly time-varying characteristics of the excitation and the vocal
tract system. But even these techniques are found to be inadequate for analysis of nonverbal sounds,
especially laughter and expressive voices. Hence, new approaches are needed, which may involve refinement of existing methods like DFT, LP analysis, ZFF and ZTL, or entirely different methods which
bring out some specific characteristics if excitation.
7
In this research work, the production characteristics of nonverbal speech sounds are examined from
both the EGG and acoustic signals, using some relatively new signal processing methods. These sounds
along with normal speech sounds are examined in four categories, which differ in the degree of periodicity (or aperiodicity) of glottal excitation and rapidness of changes in the instantaneous fundamental
frequency (F0 ). These categories are: (a) normal speech in modal voicing that includes study of effects
of source-system interaction for few sounds like trill and fricative sounds, (b) emotional speech that
includes loudness level variations in speech, namely, soft, normal, loud and shouted speech, (c) paralinguistic sounds like laughter and (d) expressive voices like Noh, that convey intense emotions.
The four categories of sounds examined are in decreasing order of periodicity of glottal excitation,
with increasingly more rapid and wider range of changes in the F0 (and hence pitch). Therefore, different signal processing techniques are required for deriving the distinguishing features in each case. In
a few cases, the production characteristics can be derived from the EGG signal [53, 50, 121, 47] that
can distinguish these sounds from normal speech well, and help in classifying these [123, 124]. But, for
the applications such as automatic spotting in continuous speech and classification etc., deriving these
production characteristics from the acoustic signal is preferable, which is explored in detail in this work.
The recently proposed signal processing methods like zero-frequency filtering [130, 216], zero-time liftering [38, 213], Hilbert transform [170, 137] and group delay function [128, 129] are used for feature
extraction. Existing methods based upon LPCs [112, 114], MFCCs [114] and DFT spectrum [137] are
also used. New signal processing methods such as modified zero-frequency filtering (mZFF), dominant frequencies computation using LP spectrum/group delay function and saliency computation, i.e., a
measure for pitch perception in expressive voices, are proposed in this work.
A representation of the excitation source information in terms of a time-domain sequence of impulses is proposed, which is also related to pitch perception of aperiodicity in expressive voices. Using
this representation and saliency of pitch perception, a method is proposed for extracting the instantaneous fundamental frequency (F0 ) of expressive voices, even in the regions of harmonics/subharmonics
and aperiodicity, which otherwise is a challenging task. Validation of results is carried out using spectrograms, saliency measure, perceptual studies and synthesis. Several novel production features are
proposed that capture the unique characteristics of each of the four sound categories examined. Parameters capturing the degree of changes and temporal changes in these features are derived, that help
in distinguishing these sounds from normal speech. Three prototype systems, namely, automatic trill
detection system, automatic shout detection system and laughter detection system are developed, for
spotting in continuous speech the regions of trill, shout and laughter, respectively. Specific databases
with ground truth are collected in each case for developing and testing these systems. Performance
evaluation is carried out using saliency measure, perceptual studies and synthesis.
This work is aimed at examining the feasibility of representing the excitation source characteristics
in terms of a time-domain sequence of impulses, mainly for nonverbal speech sounds. The differences
in locations and respective amplitudes of impulses, relate to aperiodicity and subharmonics present
in highly expressive voices like Noh singing. Only representative sounds of emotional speech and
8
paralinguistic sounds as shouted speech and laughter, respectively, are analysed. Often considered other
emotions like happy or anger, or paralinguistic sounds like cry or cough are not considered in this work.
Objective of the features and parameters is to derive the distinguishing characteristics of these sounds
and normal speech. Hence, only the prototypes are developed for automatic detection of trill, shout
and laughter in continuous speech, in order to validate the efficacy of these parameters. Performance
evaluation of these prototype systems is carried out on limited databases collected especially for this
study, due to nonavailability of any standard databases with EGG signals and ground truth. These
systems can be further developed into online fully automated systems for real-life practical applications,
that would require testing on large databases and in real-life practical scenarios.
1.5 Organization of thesis
The organization of thesis on this research work carried out is as follows:
Chapter 2 reviews the methods for analysis of nonverbal speech sounds. Basics of human speech
production with glottal source processing are revisited. Analytical signal and parametric representation
of speech signal are discussed in brief, in the context of notion of frequency. Earlier studies related to the
analysis of each of the four sound categories are reviewed briefly. Studies on source-system interaction
in normal speech, for a few special sounds such as trills, and effects of acoustic loading on the glottal
vibration are reviewed. Following this, the studies analysing the shouted speech, laughter and Noh voice
are reviewed in brief. Few studies that attempted detection of acoustic events such as trills, shouts and
laughter are also discussed. Limitation of these past studies and related issues are discussed.
In Chapter 3, some standard and few recently proposed signal processing methods are described
that are used, modified and refined later in this thesis work. Standard signal processing techniques used
in speech coding methods for representing the excitation source information in terms of an impulse sequence are reviewed briefly. Mathematical background of the need for new signal processing techniques
is discussed. Then recently proposed signal processing techniques, mainly the zero-frequency filtering
and zero-time liftering methods are discussed, which are used for extracting impulse-sequence representation of the source and the spectral characteristics of the acoustic signal, respectively. Methods for
computing dominant frequencies of the vocal tract system are also discussed.
In Chapter 4, normal speech in modal voicing is examined for dynamic sounds like trills, to study the
relative role of excitation source and vocal tract system, in its production. The effect of source-system
interaction, and the effect of acoustic loading on the excitation source characteristics, are examined for
few consonants. Production characteristics of apical trill, apical lateral approximant, voiced fricatives
and nasal sounds are examined. Both EGG and speech signals are studied in each case. Representation
of the excitation source characteristics in terms of a time-domain impulse sequence is used for deriving
features for the analysis.
In Chapter 5, emotional speech is examined to distinguish between shouted and normal speech.
The production characteristics of speech produced at four different loudness levels, namely, soft, nor9
mal, loud and shout are analysed. The effect of glottal dynamics is examined in the production of
shouted speech, in particular. A set of distinguishing features and parameters are derived from the
impulse-sequence representation of the excitation source characteristics, and from the vocal tract system characteristics.
In Chapter 6, the production characteristics of paralinguistic sounds like laughter are studied. Three
cases are considered, namely, normal speech, laughed-speech and nonspeech-laugh. A modified zerofrequency filtering method is proposed, to derive the excitation source characteristics from the speech
signal consisting of laughter. A set of discriminating features are extracted and parameters are derived,
representing the production characteristics, which help in distinguishing these three cases.
In Chapter 7, the excitation source characteristics of expressive voices, Noh voice in particular, are
examined. Significance of aperiodic component of the excitation that contributes to the voice quality of
expressive voices is analysed. The role of amplitude/frequency modulation is examined using synthetic
AM/FM sequences. Signal processing methods are proposed for deriving a time-domain representation
of the excitation source information, computation of saliency of pitch perception and F0 extraction in
the regions of aperiodicity. Spectrograms, saliency measure and signal synthesis are used for testing
and validating the results.
In Chapter 8, three prototype systems are developed for the automatic detection of trills, shout and
laughter in continuous speech. These prototype systems are helpful in testing the efficacy of features
extracted and parameters derived, in distinguishing between nonverbal speech and normal speech. Limited testing and performance evaluation is carried out on different datasets with ground truth, collected
especially for the study.
Lastly in Chapter 9, contributions of this research work are summarized. Issues arising out of the
research work are discussed, along with scope of further work in this research domain.
10
Chapter 2
Review of Methods for Analysis of Nonverbal Speech Sounds
2.1 Overview
In the production of nonverbal speech sounds, significant role is played by the glottal source of excitation. Hence, basics of speech production and the significance of glottal vibration are revisited in this
chapter. Changes in the instantaneous fundamental frequency (F0 ) are expected to be increasingly more
rapid in the four categories of sounds considered for analysis (in order), in this thesis. Hence, notion
of frequency is also revisited with brief discussion on analytical signal and parametric representation of
speech signal. Earlier studies related to analyses of the four sound categories are reviewed briefly. First,
studies on the nature of production of some special sounds that involve source-system coupling effects,
such as trills, are reviewed. Then studies highlighting the significance of source-system interaction and
acoustic loading effects in the production of some specific sounds are reviewed. Earlier studies on emotional speech, paralinguistic sounds and expressive voices in general, and on shouted speech, laughter
and Noh voice in particular, are reviewed next. Studies on the role of aperiodicity in expressive voices
and F0 extraction are also reviewed. Analysis of the production characteristics of nonverbal speech
sounds has helped in identifying their few unique characteristics. Exploiting these, the distinctive features and parameters can be derived to discriminate these sounds from normal speech. Hence, studies
towards detecting the acoustic events such as trills, shouts and laughter in continuous speech are also
reviewed. Limitation of these past studies and related issues are also discussed.
The chapter is organized as follows. Human speech production with role of glottal vibration in
speech production is discussed briefly in Section 2.2. In Section 2.3, analytical signal and parametric
representation of speech signal are discussed while revisiting the basic concepts on frequency. Earlier
studies on the production of trills are reviewed in Section 2.4. Effects of source-system interaction and
acoustic loading of the vocal tract system on glottal vibration are also discussed. Studies on emotional
speech sounds and shouted speech are reviewed in Section 2.5. Section 2.6, reviews earlier studies on
need of analysing laughter, its different types and acoustic analysis of laughter production. Studies on
motivation for studying expressive voices, representation of the excitation source component in terms
of time domain sequence of impulses and issues involved are reviewed in Section 2.7. In Section 2.8,
11
Figure 2.1 Schematic diagram of human speech production mechanism (figure taken from [153])
Figure 2.2 Cross-sectional expanded view of vocal folds (figure taken from [157])
studies on the feasibility of developing systems for automatic detection of trills, shouts and laughter
regions in continuous speech, are reviewed in brief. A summary of this chapter is given in Section 2.9.
2.2 Speech production
Both human speech and other non-normal sounds are produced by the speech production mechanism
(Fig. 2.1) studied broadly in two components, the excitation source and the vocal tract system [153].
Various parts in the vocal tract system, referred to as the system, can be grouped as lungs, larynx and
the vocal tract. When the airflow from lungs is pushed through the larynx, this airflow is modulated by
the vibration of vocal folds, also termed as vocal cords or the source. This is called excitation of the
system. The primary source of excitation is due to vibration of the vocal folds, shown in a cross-sectional
expanded view in Fig. 2.2 [157]. Since, volume velocity waveform that provides excitation to the system
is termed as glottal source, it is also referred to as glottal vibration. The regular opening/closing of vocal
folds, and the corresponding changes in airflow in the vocal tract result in production of different speech
sounds.
12
(a) Input speech signal waveform
xin[n]
1
0
−1
(b) EGG of input speech signal
ex[n]
1
0
−1
(c) LP residual of speech signal
rx[n]
1
0
−1
(d) Glottal pulses obtained from LP residual
grx[n]
1
0
−1
0
5
10
15
20
25
Time (ms)
30
35
40
45
50
Figure 2.3 Illustration of waveforms of (a) speech signal, (b) EGG signal, (c) LP residual and (d) glottal
pulse information derived using LP residual
Shape of the vocal tract changes dynamically due to movement of upper and lower articulators (lips,
teeth, hard/soft palate, tongue and velum), thus also changing the stricture formed along the vocal tract.
Vocal folds vibration converts the airflow passing through the vocal tract into acoustic pulses, thus
providing an excitation signal to the vocal tract. The vocal tract consists of oral, nasal and pharyngeal
cavities, resonances of which are related to the shape (contour) of short-time spectrum of the modulated
airflow signal. Changes in the shape of the vocal tract also change resonances in it. Hence, it may be
inferred that speech sound is produced by the time-varying excitation of the time-varying vocal tract
system.
The quasi-periodic vibration (opening/closing) of the vocal folds in the case of normal speech, causes
the pitch-perception. The closure of vocal folds is relatively abrupt (i.e., faster) as compared to their
gradual opening. Hence, there is significant excitation around these time-instants, called glottal closure instants (GCIs). The glottal closure instants can be seen better in the electroglottograph (EGG)
signal shown in Fig. 2.3(b), as compared to the corresponding speech signal shown in Fig. 2.3(a). The
interval between successive GCIs, which is nearly periodic during the production of normal speech in
modal voicing (e.g., vowels), is termed as glottal cycle period (T0 ). Inverse of T0 gives instantaneous
fundamental frequency (F0 ), that is a characteristics of each human being. The F0 is related to pitchperception and is different for different sounds. Since excitation is time-varying, the F0 (i.e., pitch)
also varies dynamically with time for each sound produced. Changes in pitch (F0 ) are less for normal
speech, particularly for vowels. But, changes in F0 and pitch are quite rapid in the case of nonverbal
speech. There is secondary excitation also present in the production of some particular sounds, which
is due to stricture in the vocal tract. For example, production of fricative sounds ([s] or [h]) having
higher frequency (1-5 kHz) content is related with this secondary excitation. The production of nonverbal speech sounds may also involve secondary excitation, which may cause significant changes in
the source characteristics such as F0 (pitch).
13
Figure 2.4 Schematic view of vibration of vocal folds for different cases: (a) open, (b) open at back
side, (c) open at front side and (d) closed state (figure taken from [157])
Production characteristics of normal (verbal) speech can be analysed using standard signal processing techniques such as short-time Fourier spectrum, linear prediction analysis, and spectral measures
such as linear prediction cepstral coefficients (LPCCs) and Mel-frequency cepstral coefficients (MFCCs)
etc. The instants of significant excitation (GCIs) can be identified using linear prediction (LP) residual
(Fig. 2.3(c)), EGG signal or using a representation in terms of a time-domain sequence of impulses.
The information of the excitation source is carried in the impulse-sequence, in relative amplitudes of
impulses and their locations that may correspond to GCIs. This impulse-sequence representation of the
excitation information though has been a research challenge. But, if the same can be achieved for nonverbal speech sounds, then that would be immensely useful in speech signal analysis, speech coding,
speech-synthesis and many other applications.
Production of these nonverbal speech sounds involves changes in the characteristics of the vocal
tract system, studied in most of the acoustic analyses of these sounds. But, in production of these
sounds, significant changes seem to occur in the excitation source characteristics, some of which may be
reflected in different characteristics of glottal vibration shown schematically in Fig. 2.4 [157]. Changes
could be in the glottal cycle periods T0 (hence F0 ), relative durations of the open/closed phases in a
glottal cycle, and the rate of opening/closing of vocal folds etc.
In the production of voiced sounds, the regular opening/closing (i.e., vibration) of vocal folds periodically interrupts the airflow from lungs passing through the vocal tract, that causes changes in the
air pressure [200, 92]. Vocal folds are usually open, when no sound is being phonated. In the production of unvoiced sounds, the vocal folds are held apart. Thus airflow is allowed to pass freely and
noise excitation is generated due to turbulence of airflow. During production of voiced sounds, the vocal
folds close-open-close regularly. Closing of the vocal folds is by the control of adductor muscles, that
bring the vocal folds together and provide resistance to the air flow from the lungs. The air pressure
built-up below the closed vocal folds (i.e., subglottal air pressure) forces the vocal folds to open and
allow airflow to pass through the glottis in to the vocal tract. The vocal folds are closed again by two
factors: (i) elasticity of the vocal folds tissue, that forces it to regain its original configuration (closed
14
Figure 2.5 Schematic views of glottal configurations for various phonation types: (a) glottal stop,
(b) creak, (c) creaky voice, (d) modal voice, (e) breathy voice, (f) whisper, (g) voicelessness. Parts
marked in (g): 1. glottis, 2. arytenoid cartilage, 3. vocal fold and 4. epiglottis. (Figure is taken from [60])
position) and (ii) aerodynamic forces described by the Bernoulli effect, that causes the drop of pressure
in the glottis when airflow velocity increases. After the vocal folds are closed, the subglottal air pressure
builds-up again, forcing the vocal folds to open, and thus the cycles continue. Period of this cycle is the
glottal cycle period T0 , and its frequency F0 (= 1/T0 ) is referred to as fundamental frequency.
Changes in the opening/closing direction (front/rear) of the vocal folds and the rate of their opening/closing are related to three voicing types: (i) modal voicing, (ii) breathy voice and (iii) creaky
voice. Unvoiced whisper, voicelessness and glottal stop are also possible. Different phonation types in
normal speech are schematically shown in Fig.2.5 [60]. Vibration of the vocal folds can be observed
only for creaky, modal and breathy voices, shown by the corrugated line between the vocal folds in
Fig. 2.5(c), (d) and (e), respectively. Only the modal voicing phonation type is considered in this study,
because it represents a neutral phonation with little variation in period (T0 ), over successive glottal cycles. Analysing the changes in these characteristics of glottal vibration, from the acoustic signal is still
a challenge. The glottal pulse shape characteristics may be derived from LP residual of the speech signal as shown in Fig. 2.3(d), but only for few cases. Therefore, the production characteristics of these
sounds are analysed in this thesis, from both EGG and acoustic signals. The research issues involved in
analysing these sounds are discussed further in Section 2.4.
2.3 Analytic signal and parametric representation of speech signal
(A) Notion of frequency for stationary signal
(i) Notion of frequency in mechanics:
Frequency (f ) of vibratory motion is defined in mechanics as the number of oscillations per unit time.
Here, oscillation is a complete cycle of to-and-fro motion, starting from the equilibrium position to one
end, then to other end, and then back to initial position. Harmonic motion is special type of vibratory
motion, in which the acceleration is proportional to the displacement and is always directed towards
the equilibrium (or centre) position. For example, if a body is in circular motion then the projection of
this motion on a diameter is in harmonic motion. The displacement, velocity and acceleration of the
15
harmonic motion of this projection at an instant t can be given by (2.1), (2.2) and (2.3), respectively [18].
x(t) = a0 cos(ωt)
(2.1)
′
x (t) = a0 ω sin(ωt)
(2.2)
x′′ (t) = − a0 ω 2 cos(ωt)
= − ω 2 x(t)
(2.3)
where a0 is radius (or maximum displacement), and ω (= 2πf ) is uniform angular speed. The frequency f is obtained by solving the differential equation (2.3). The solution is given by
y(t) = α ej2πf t
(2.4)
ω
where α is an arbitrary constant, and frequency f is given by f = 2π
.
(ii) Notion of frequency for signals:
A signal conveys information about the state or behaviour of a physical system. The variable representing the signal could be continuous (denoted by ‘( )’) or discrete (denoted by ‘[ ]’). Discretetime signals are represented as sequence of numbers [138]. A signal (s(t)) could be electric, acoustic,
wave motion or harmonic motion etc. The stationary signal is defined usually for a causal linear timeinvariant (LTI) stable system. A linear system is defined by the principle of superposition [138]. The
system is linear if and only if
T {x1 [n] + x2 [n]} = T {x1 [n]} + T {x2 [n]}
T {a x[n]} = a y[n]
(2.5)
(2.6)
where x1 [n] and x2 [n are inputs to the system, y1 [n] (= T {x1 [n]}) and y2 [n] (= T {x2 [n]}) are responses (outputs) of the system, T denotes transformation by the system and a is an arbitrary constant.
Two properties of the superposition principle represented by (2.5) and (2.6) are additive property and
homogeneity (scaling property), respectively.
Time-invariant (or shift-invariant) system is that for which a delay or shift (n0 ) of the input sequence
(x[n]) causes a corresponding shift in the output sequence (y[n]) [138], i.e.,
if
then
x1 [n] = x[n − n0 ]
y1 [n] = y1 [n − n0 ],
∀ n0
(2.7)
A system is causal (nonanticipative) if, for every choice of n0 , the output sequence (y[n]) value at
n = n0 depends only on the input sequence (x[n]) values for n ≤ n0 . A stable system has bounded
output for bounded input, i.e., every bounded input sequence (x[n]) produces a bounded output sequence
(y[n]) [138]. If hk [n] represents the response of the system to an input impulse δ[n − k] occurring at
time n = k, then linearity is expressed as [138]
( ∞
)
X
y[n] = T
x[k] δ[n − k]
(2.8)
k=− ∞
16
From the principle of superposition, the (2.8) gives
y[n] =
=
∞
X
x[k] T {δ[n − k]}
k=− ∞
∞
X
x[k] hk [n]
(2.9)
k=− ∞
From the property of time-invariance, if h[n] is the response to δ[n] then the response to δ[n − k] is
h[n − k], i.e., hk [n] = h[n − k]. Hence, for a linear time-invariant (LTI) system, the response (output)
is given by [138]
∞
X
x[k] h[n − k]
(2.10)
y[n] =
k=− ∞
The LTI system response (y[n]) is also called convolution sum, which can be represented as convolution
of the input sequence (x[n]) and the system response (h[n]) to impulse input (δ[n]) [138], as
y[n] = x[n] ∗ h[n]
(2.11)
(iii) Spectral decomposition and reconstruction of the signal:
For a linear time-invariant causal and stable system, a signal (s(t)) can be represented as a weighted
sum of harmonic vibrations [18, 138]. The spectral decomposition of the signal s(t) can be obtained
using the Fourier transform (FT) [138] of the signal, which is defined as
Z ∞
s(t) e−j2πf t dt
(2.12)
S(f ) =
−∞
It is called analysis equation [18]. The signal s(t) can be reconstructed from the spectral decomposition
S(f ), that completely characterises the signal s(t). The reconstructed signal s(t) can be obtained using
the inverse Fourier transform (IFT) [138], which is given by
Z ∞
S(f ) ej2πf t df
(2.13)
s(t) =
−∞
It is called synthesis equation [18]. Using (2.12) and (2.13), “any stationary signal can be represented
as the weighted sum of sine and cosine waves with particular frequencies, amplitudes and phases” [18].
The digital equivalent representation of the spectral decomposition (X[k]) and the reconstructed signal (x[n]) for a periodic input sequence (x[n]), for one period, is expressed as digital Fourier transform
(DFT) pair [9], given by
X[k] =
N
−1
X
x[n] e
N
−1
X
X[k] e
−j2πkn
N
,
0≤k ≤N −1
n=0
= 0,
x[n] =
otherwise
j2πkn
N
,
(2.14)
0≤k ≤N −1
n=0
= 0,
otherwise
17
(2.15)
where N is number of sample points in one period, or in fs /2 where fs is sampling frequency. Here,
X[k] (i.e., discrete Fourier transform (DFT)) represents a frequency-domain sequence, and x[n] (i.e.,
inverse discrete Fourier transform (IDFT)) represents a time-domain sequence.
(B) Notion of frequency for nonstationary signals
The frequency can be defined unambiguously for the stationary signals, but not for nonstationary
signals. Attempts were made to define frequency of nonstationary signals using the term instantaneous
frequency (IF). But, “there is an apparent paradox in associating the words instantaneous and frequency.
For this reason, the definition of IF is controversial, application-related and empirically assessed” [18].
A summarized view of these different approaches, reviewed by B. Boashash in [18], is as follows:
(i) FM based definition of IF (Carson and Fry, 1937) [25]:
In the context of electric circuit theory, a frequency modulated (FM) wave is defined as
Z t
m(t)dt)
(2.16)
ω(t) = exp j(ω0 t + λ
0
where m(t) is low-frequency signal (|m(t)| ≤ 1), ω0 (= 2πfc ) is a constant carrier frequency and λ is
modulation index. Using (2.16), the instantaneous angular frequency (Ω) and the instantaneous cyclic
frequency (fi ) are given by (2.17) and (2.18), respectively, as
Ω(t) = ω0 + λ m(t)
λ
m(t)
fi (t) = f0 +
2π
(2.17)
(2.18)
where Ω(t) = 2πfi (t) and ω0 = 2πf0 .
(ii) AM, PM based definition of IF (Van der Pol, 1946) [201]:
For a simple harmonic motion, IF can be defined by analysing the expression
s(t) = a cos(2πf t + θ)
(2.19)
where a is amplitude, f is frequency of oscillation, phase φ(t) (= 2πf t + θ) is argument of the cosine
function, and phase constant θ is constant part of the phase φ(t). The amplitude modulation (AM) and
phase modulation (PM) are defined by (2.20) and (2.21), respectively, as
a(t) = a0 . (1 + µg(t))
(2.20)
θ(t) = θ0 . (1 + µg(t))
(2.21)
where g(t) is modulating signal and µ is amplitude/phase modulation index. Using (2.20) and (2.21),
the (2.19) can be re-written for nonstationary signal as
Z t
2πfi (t)dt + θ
(2.22)
s(t) = a . cos
0
where, phase φ(t) = (
Rt
0
2πfi (t) + θ). Hence, IF is given by
fi (t) =
1 d φ(t)
2π
dt
18
(2.23)
The IF can be defined by derivative of phase angle, i.e., IF is the rate of change of phase angle at time t.
(iii) Analytic signal based definition of frequency (D. Gabor, 1946) [56]:
Gabor defined the complex analytic signal using Hilbert transform, as
z(t) = s(t) + j y(t)
= a(t) ejφ(t)
where y(t) = H(s(t)) represents Hilbert transform (HT) of s(t). It is defined as
Z ∞
s(t − τ )
dτ
H(s(t)) = p.v.
πτ
−∞
(2.24)
(2.25)
where, p.v. denotes Cauchy principle value of the integral. It satisfies the following properties [18]:
y(t)
=
H(s(t))
s(t)
=
− H(y(t))
s(t)
=
− H 2 (s(t))
and s(t),
y(t) contain same spectral components.
(2.26)
Analytic signal in time-domain, i.e., z(τ ) is defined as analytic function of the complex variable
τ = t + j u, in upper half plane, i.e., Im(τ ) ≥ 0. It means, (2.26) satisfies the Cauchy-Reimann
conditions. The real part of z(τ ) equals s(t) on the real axis. The imaginary part of z(τ ), which takes
on the value y(t), i.e., H(s(t)), is called the quadrature signal, because s(t) and H(s(t)) are out of
phase by π/2. Hence,
z(τ ) = s(t, u) + j y(t, u)
(2.27)
when u → 0, i.e., τ → t, then we get following as in (2.24)
z(t) = s(t) + j y(t)
= a(t) ejφ(t)
(2.28)
Analytic signal in frequency domain, i.e., Z(f ) can also be defined [18], as the complex function
Z(f ) of the real variable f . It is defined for f ≥ 0 such that z(τ ) is inverse Fourier transform (IFT) of
Z(f ), i.e.,
Z
∞
Z(f ) ej2πf τ df
z(τ ) =
(2.29)
0
where the function z(τ ) of complex variable τ = t + j u is defined for Im(τ ) ≥ 0. Hence, analytic
signal has spectrum limited to positive frequencies only, and sampling frequency can be equal to half of
the Nyquist rate. The central moments of frequency of the signal is given by
R ∞ ′′
2
∞ f |Z(f )| df
′′
R∞
(2.30)
<f > = −
2
− ∞ |Z(f )| df
19
where, Z(f ) is the spectrum of the complex signal. It may be noted that if the spectrum of the real
signal was used in (2.30), then all odd moments would be zero since |S(f )|2 is even, which would not
be in-line with physical reality.
(iv) Unification of analytic signal and IF (Ville, 1948) [204]:
Ville unified the analytic signal defined by Gabor [56] with the notion of IF given by Carson and
Fry [25]. For a signal expressed by s(t) = a(t) cos(φ(t)), he defined IF as
1 d
(arg z(t))
2π dt
fi (t) =
(2.31)
where, z(t) is analytic signal given by (2.28). Ville noted that since IF was time-varying, some instantaneous spectrum should be associated with it. The mean value of the frequencies in this instantaneous
spectrum is equal to time average of the IF [204, 18], i.e.,
< f > = < fi >
where,
<f > =
< fi > =
R∞
∞
R−∞
f |Z(f )|2 df
2
− ∞ |Z(f )| df
R∞
2
− ∞ fi (t) |z(t)| dt
R∞
2
− ∞ |z(t)| dt
(2.32)
(2.33)
(2.34)
Here, (2.33) is averaging over frequencies and (2.34) is averaging over time. This had lead to WignerVille Distribution (WVD) [204], i.e., distribution of the signal in time and frequency, expressed as
Z ∞
z(t + τ /2) z ∗ (t − τ /2)e− j2πf τ dτ
(2.35)
W (t, f ) =
−∞
Hence, W (t, f ) is FT of the product z(t + τ /2) z ∗ (t − τ /2) w.r.t. τ . IF is obtained by first moment of
the WVD w.r.t. frequency, and is given by
R∞
f W (t, f )df
(2.36)
fi (t) = R−∞∞
− ∞ W (t, f ) df
(v) Interpretation of instantaneous frequency [18]:
For an analytic signal z(t) = a(t) ej φ(t) as defined in (2.28), its spectrum (Z(f )) is given by
Z ∞
Z(f ) =
z(t)e− j2πf t dt
−∞
Z ∞
a(t) ej(φ(t)−2πf t) dt
(2.37)
=
−∞
The largest value of this integral is at frequency fs , for which the phase is stationary. From stationary
phase principle, the fs is such that, at this value, the
d
(φ(t) − 2πfs t) = 0
dt
1 dφ(t)
fs =
2π dt
20
(2.38)
(2.39)
where, fs (t) (=fi (t)) is IF of the signal at time t, i.e., it is a measure of the frequency domain signal
energy concentration as a function of time.
It is important to note that this interpretation of IF for an analytic signal is not a unique function of time. Several variations of the interpretation (as in (2.39)) were proposed by different researchers [18]. These variations are related with (i) variations in amplitude (as m(t) ej2πf t ) due to
amplitude modulation, (ii) bi-component real signal (s(t) = s1 (t) + s2 (t)) that can be expressed as
z(t) = a1 ej(ω0 −∆ω/2)t + a2 ej(ω0 +∆ω/2)t , or (iii) time-varying amplitude of nonstationary signal that
2
2
can be represented as y(t) = A1 (t) e−t /α cos(2πf0 t + φ0 ), whose FT would have two Gaussian
functions centred at f0 and −f0 (i.e., basic orthogonal functions). Also, as per Ville’s approach [204],
“the frequency is always defined as the first derivative of the phase”, regardless of stationarity [18], as
expressed in (2.31) for monocomponent signals.
(vi) Generalized stationary models for nonstationary signals and IF [18]:
There are many ways of generalizing the stationary models to the nonstationary signals. A model for
the FM class of signals (s(t)), is given by
s(t) =
N
X
sk (t) + n(t)
(2.40)
k=1
where n(t) is noise (or undesirable component) and sk (t) are N single component nonstationary signals.
These are described by the envelopes ak (t) and instantaneous frequency (IF) fik (t) at instant t, such
that the analytic signal zk (t) (associated with signal sk (t)) is given by
zk (t) = ak (t) ejφk (t)
where,
φk (t) = 2π
Z
(2.41)
t
fik (τ ) dτ
(2.42)
−∞
Here, if k = 1 then the signal is called monocomponent signal, else (for k ≥ 2) multicomponent signal.
2.4 Review of studies on source-system interaction and few special sounds
The significance of changing vocal tract system and the associated changes in the glottal excitation
source characteristics in production of some special sounds in normal speech such as trills are examined
in this study, from perceptual point of view. The effect of acoustic loading of the vocal tract system
on the glottal vibration is also studied. The speech signal along with the electroglottograph (EGG)
signal [53, 50] are analyzed for a selected few consonant sounds.
2.4.1
Studies on special sounds such as trills
Speech sounds are produced by the excitation of the time-varying vocal tract system. Excitation of
acoustic resonators is possible through different sources including the glottal vibration, i.e., laryngeal
21
source [183]. Excitation source can also be extraglottal, such as strictural source that is involved in
the production of sounds such as the voiceless fricative [s]. The major source of acoustic energy is the
quasi-periodic vibration of the vocal folds at the glottis [48]. This is referred to as voicing. All the
speech sounds that can be generated by the human speech production mechanism are either voiced or
voiceless. Languages make use of different types of voicing, which are called phonation types [102, 99].
Out of the possible types of phonation, modal voice is considered to be the primary source for voicing
in a majority of languages [99]. The production of modal voice involves excitation of the vocal tract
system by the vibration of vocal folds at the glottis. Changes in the mode of glottal vibration and in the
shape of the vocal tract, both contribute to the production of different sounds [49]. In the production of
sounds involving specific manners of articulation, such as trills, changes in the vocal tract system affect
the glottal vibration significantly [40].
Trilling is a phenomenon where the shape of the vocal tract changes rapidly with an approximate
trilling rate of about 30 cycles per second. Analysis of trill sounds is limited to the study of production
and the acoustic characterization in terms of spectral features. For example, the production mechanism of tongue tip trills were described and modeled in [100, 98, 116], from aerodynamic point of
view. Trill cycle and trilling rate were studied in [100, 116]. Acoustic correlates of phonemic trill
production are reported in [66]. A recent analysis of trill sounds [40] indicated that changes in the
vocal tract system affect the vibration of the vocal folds in the production of tongue tip trilling. In
that study, acoustics of trill sounds was analysed using recently proposed signal processing methods
like zero-frequency filtering (ZFF) [130, 216] and Hilbert envelope of differenced numerator of group
delay (HNGD) method [40, 123]. These studies indicated the possibility of source-system coupling in
the production of apical trills, due to interaction between aerodynamic (airflow and air-pressure) and
articulatory (upper/lower articulators in mouth) components.
The significance of changing vocal tract system and the associated glottal excitation source characteristics due to trilling are studied in this thesis, from perception point of view. These studies are made
by generating speech signal by either retaining the features of the vocal tract system or of the glottal
excitation source of trill sounds. Experiments are conducted to understand the perceptual significance
of the excitation source characteristics on production of different trill sounds. Speech sounds of sustained trill and approximant pair, and apical trills produced by four different places of articulation are
considered. Details of this study are discussed in Chapter 4.
2.4.2
Studies on source-system interaction and acoustic loading
In the production of sounds involving specific manner of articulation, such as trills, the changes in
the vocal tract system also affect the glottal vibration. This is due to acoustic loading of the vocal tract
system, causing pressure difference across the glottis, i.e., difference in air pressure in supraglottal and
subglottal regions. Thus the source of excitation is affected due to the source-tract interaction.
The interaction of glottal source and vocal tract system, also referred to as source-system coupling,
has been a subject of study among researchers over the past several years [49, 32, 189, 192, 108].
22
Mathematical modeling and physical modeling of this interaction were attempted. The source-system
interaction has been explored in a significant way for vowel sounds [42, 43, 171, 136]. It was observed
that involuntary changes in the vocal fold vibrations due to influence of vocal tract resonances occur
during changes in the ‘intrinsic pitch’ (fundamental frequency F0 ) of some high vowels [43, 171, 136].
However, source-system coupling has not been explored in a significant way from the angle of the production characteristics of speech sounds other than vowels. In this study, we aim to examine the effect
of acoustic loading of the vocal tract system on the glottal vibration in the production characteristics of
some consonant sounds, using signal processing methods.
Mode of glottal vibration can be controlled voluntarily for producing different phonation types such
as modal, breathy and creaky voice. Similarly, the rate of glottal vibration (F0 ) can also be controlled,
giving rise to changes in F0 and pitch. On the other hand, glottal vibration could also be affected by
loading of the vocal tract system while producing some types of sounds. This happens due to coupling
of the vocal tract system with the glottis. This change in glottal vibration may be viewed as involuntary
change. Such involuntary changes occur during changes in ‘intrinsic pitch’ (F0 ) of some high vowels [43], i.e., when the resonances of the vocal tract influence the nature of vocal cord vibration and the
way F0 may be varied [136]. The effect could be due to “acoustic coupling between the vocal tract and
the vocal cords similar to that which happens between the resonances of the bugle and the bugler’s lips”,
that occurs when the first formant (F1 ) is near the F0 [136]. Or, it could be due to tongue-pull [43], i.e.,
when “tongue, in producing high vowels, pulls on the larynx and thus increases the tension of the vocal
cords and thus the pitch of voice” [136].
The effects of acoustic coupling between the oral and subglottal cavities were examined for vowel
formants [144, 176, 32]. Discontinuity in the second formant frequency and attenuation in diphthongs
were observed near the second subglottal resonance (SubF2) in the range of 1280-1620 Hz, due to
subglottal coupling [32]. Subglottal coupling effects were found to be generally stronger during the
open phase of glottal cycle than in the closed phase [32]. Studies on the source-system interaction were
also carried out for other categories of speech sounds, such as fricatives and stops [180]. Subglottal
resonances were measured in the case of nasalization [184].
A study of acoustic interaction between the glottal source and the vocal tract, suggests the presence
of nonlinear source-filter coupling [158]. Computer simulation experiments were carried out, along
with analytical studies, to study the effects of source-filter coupling [191, 30]. The acoustic interaction
between the sound source in the larynx and the vocal tract airways, can take place either in linear or
in nonlinear way [189]. In linear source-filter coupling the source frequencies are reproduced without
being affected by the pressure in the vocal tract airways [189]. In nonlinear source-filter coupling the
pressure in the vocal tract contributes to the production of different frequencies at the source [189].
The theory of source-tract interaction suggests that the degree of coupling is controlled by the crosssectional area of the laryngeal vestibule (epilarynx), which raises the inertive reactance of the supraglottal vocal tract [192]. Cooccurrence of acoustically compliant (negative) subglottal reactance and inertive
(positive) supraglottal reactance was found to be most favourable for the vocal fold vibration in modal
23
register. Both subglottal and supraglottal reactances increase the driving pressures of the vocal folds and
the glottal flow of air, which increases the energy level at the source [192]. The theory also mentions
that the source of instabilities in vibration modes is due to harmonics passing through formants during
pitch or vowel change. It was hypothesized that vocal folds vibration is maximally destabilized when
major changes take place in the acoustic load, and that occurs when F0 crosses over F1 [189].
Other studies on the source-system interaction have focused mainly on the physical aspects, such
as nonlinear phenomena and the source-system coupling [181, 193, 64, 221]. The physics of laryngeal
behaviour and larynx modes, and physical models of the acoustic interaction of voice source with subglottal vocal tract system were studied in [190, 193, 64]. The effect of glottal opening on the acoustic
response of the vocal tract system was studied in [49, 12, 13, 163]. It was observed that the first and
second formant frequencies increase with increasing glottal width [13]. The effects of source-system
coupling were studied in [158, 30, 189]. The nonlinear phenomenon due to this coupling, that is related
with air flow across glottis during phonation, was also studied in [221, 192, 108]. The source-system
interactions were observed to induce, under certain circumstances, some complex voice instabilities
such as sudden frequency jumps, subharmonic generation and random changes in frequency, especially
during F0 and F1 crossovers [64, 189, 192].
Studies on source-system coupling were also carried out for other categories of speech sounds, such
as fricative and stop consonants [180]. Subglottal resonances were measured in the case of nasalization [184]. The effect of acoustic coupling between the oral and subglottal cavities was examined for
vowel formants [32]. In that study, the discontinuity in second formant frequency and the attenuation
in diphthongs were observed near the second subglottal resonance (SubF2) due to subglottal coupling.
The range of SubF2 is 1280-1620 Hz, depending on the speaker. Subglottal coupling effects were found
to be generally stronger for open phase than for closed phase of glottal opening [32]. In another study,
a dynamic mechanical model of the vocal folds and the tract was used to study the fluid flow in it [12].
An updated version of this mechanical model with more realistically shaped laryngeal section was used
to study the effect of glottal opening on the acoustic response of the vocal tract system [13]. It was
observed that the first and second formant frequencies increased with increasing glottal width. The influence of acoustic waveguides on the self-sustained oscillations was studied by means of mechanical
replicas [163]. The replicas were used to simulate oscillations, and gather data of parameters such as
subglottal pressure, glottal aperture and oscillation frequency [163].
The role of glottal open quotient in relation with laryngeal mechanism, vocal effort and fundamental
frequency was studied for the case of singing [65]. In that study, the need for controlling the laryngeal
configuration and lung pressure was also highlighted. In another study, a glottal source estimation
method was developed using joint source-filter optimization technique, which was claimed to be robust
to shimmer and jitter in the glottal flow [57]. The technique uses estimation of parameters for the
Liljencrants-Fant (LF) model of glottal source and amplitudes of the glottal flow in each period, and
the coefficients of the vocal-tract filter [57]. In another study, the contribution to voiced speech by
the secondary sources within the glottis (i.e., glottis-interior sources) was investigated [69]. The study
24
analyzed the effects on the acoustic waveforms by the second order ‘sources’ such as, volume velocity
of air at the glottis, pressure of air from lungs, unsteady reactions due to radiating sound and vorticity
in the air flow from the glottis.
In a recent study [108], the effect of source-tract acoustical coupling on the onset of oscillations
of the vocal folds was studied using a mechanical replica of the vocal folds and a mathematical model.
The model is based on lumped description of tissue mechanisms, quasi-steady flow and one-dimensional
acoustics. This study proposed that changes in the vocal tract length and cross section induce fluctuations in the threshold values of both subglottal pressure and oscillation frequency. The level of acoustical
coupling was observed to be inversely proportional to the cross-sectional area of vocal tract [108]. It
was also shown that the transition from a low to high frequency oscillation may occur in two ways,
either by frequency jump or with a smooth variation of frequency [108].
A recent analysis of trill sounds [40] indicated that changes in the vocal tract system due to tongue
tip trilling affect the vibration of the vocal folds. In that study, acoustics of trill sounds was analysed
using signal processing methods like zero-frequency filtering (ZFF) [130, 216] and Hilbert envelope of
differenced numerator of group delay (HNGD) method [40, 123]. In another recent study [122], the
significance of changing vocal tract system and the associated changes in the glottal excitation source
characteristics due to tongue tip trilling were studied from perception (hearing) point of view. Both these
studies indicate the possibility of source-system interaction in the production of apical trills. However,
the effect of acoustic loading of the system on the glottal source due to source-system interaction has
not been explicitly studied in terms of the changes in the characteristics of the speech signal.
In the current study, we examine the effect of acoustic loading of the vocal tract system on the
glottal vibration for a selected set of six categories of voiced consonant sounds. These categories are
distinguished based upon the stricture size, and the manner and place of articulation. The voiced sounds
considered are: apical trill, alveolar fricative, velar fricative, apical lateral approximant, alveolar nasal
and velar nasal. The sounds differ in the size, type and location of the stricture in the vocal tract. These
consonant sounds are considered in the context of vowel [a]. Three types of occurrences, namely, single,
geminated and prolonged are examined for each of the six categories of sounds. The speech signal along
with the electroglottograph (EGG) signal [53, 50] are used for analysis of these sounds. Details of the
study are discussed in Chapter 4.
2.5 Review of studies on analysis of emotional speech and shouted speech
2.5.1
Studies on emotional speech
Emotion is an outburst or expression of state of mind, which is reflected in the behaviour of an
individual that is different from his/her normal behaviour. It is caused by the reaction of the individual
to external events or to conflicts/feelings developed internally within one’s own mind. Emotional state
also depends on the mental and physical health condition and also on the social living condition of
25
the individual. Thus emotional state is highly specific to an individual. Different types of emotions
arise in different situations, especially in communication with others. The characteristics of emotion are
manifested in many ways, including in the language usage, besides facial expressions, visual gestures,
speech and nonverbal audio gestures. Note that emotions are natural reflections of an individual, and
are naturally perceived easily by other human beings. As long as the generation and perception of
emotions are confined to human communication, there is no need to understand the characteristics of
emotion in detail. But to exploit the developments of information technology for various applications,
it is necessary to incorporate machine (computer) in the human communication chain. In such a case,
there is a need to understand the characteristics of emotions, in order to develop automatic methods
to detect and categorize emotions reflected in the sensory data such as audio and video signals. It is
also equally important to generate signals from a machine with desired emotion characteristics, which
human perceive as natural emotions.
Characterization and generation of emotions is a technical challenge, as it is very difficult to identify
and extract features from sensory data to characterize a particular emotion. Another important issue is
that emotions cannot be categorized into distinct non-overlapping classes. An emotional state is a point
in a continuous space of feature vectors, and usually the feature space may be a very high dimensional
one. Emotion is characterized by multimodal features involving video and audio data. It is more useful
in many applications to characterize emotions using audio data, as it is easily available/captured easily
in most communication situations. The expression of emotion through audio is spontaneous, and is also
perceived effortlessly by human listeners. The characteristics of emotion are reflected either in the audio
gestures consisting of short burst like ah, oh, or in nonverbal sounds like laughter, cry, grin, etc.
Several studies were made to extract features characterizing different emotions from speech signals [21, 119, 35] and from audio-visual signals [82, 205, 206]. These studies hypothesized distinct
categories of emotions, such as happiness, sadness, fear, anger, disgust and surprise, or they hypothesized distinct affective states such as interest, distress, frustration and pain [142, 172, 141, 22]. All
these studies focused on classifying emotions into one of the distinct categories using acoustic features [73, 143, 210, 71]. The acoustic features derived from the signal were mostly the standard features
used to represent speech for various applications such as Mel-frequency cepstral coefficients (MFCC)
or linear prediction cepstral coefficients (LPCCs). Most of these features represent the characteristics
of the vocal tract system. These were supplemented with additional features representing the waveform
characteristics and some source characteristics, such as zero crossing rate (ZCR), harmonic to noise
ratio (HNR), etc. [95, 68, 177, 203] and [119, 218, 196, 17]. Attempts were made to derive emotion
characteristics from audio signals collected in different application scenarios such as cell phone conversation, meeting room speech, interviews, etc. [126, 31, 148, 211] and [76, 117, 220, 33, 90]. What
was strikingly missing in these past studies on emotion is, understanding emotions from human speech
production point of view. Also missing is the realization of the continuum nature of emotional states
and the individual-specific nature of emotion characteristics.
26
It may therefore be summarized that emotion is an expression of state of mind, which is reflected
normally in burst of activity. Normally emotion characteristics cannot be sustained for long periods,
as it is not a normal human activity. Emotion detection/sensing is easy for human beings due to their
multimodal pattern recognition capability. Emotion characteristics are strongly reflected more in nonverbal paralinguistic sounds (discussed in separate chapter), and represent points in continuous feature
space. In this research work, the emotion study is audio-focused, involving analysis of audio signal
analysis. The importance of source information in production and perception is highlighted.
2.5.2
Studies on shouted speech
Shout or scream, both are deviations from normal human speech. Shout contains linguistic information and may have phonated speech content, whereas scream contains neither of it. Both are associated
with some degree of urgency. Shout or shouted speech can also be considered as an indicator of verbal aggression, or of a potentially hazardous situation in an otherwise normally peaceful environment.
Naturally, there is growing demand for techniques that may help in automatic detection of shout and
scream. Range of applications including health care, elderly care, crime detection, social behaviour and
psychological studies etc. are enough motivation for attracting researchers to the study of detection of
such unique audio events for different applications.
Since shouted speech signal is also like any other speech signal produced by the production mechanism, it can be analysed in the terms of the excitation source and the vocal tract system features as
in the case of any speech signal analysis. The excitation source characteristics are significant mostly
due to voicing in the shouted speech. Hence, these are investigated by studying the changes in the frequency of the vocal fold vibrations. Thus, source characteristics are studied in terms of changes in the
instantaneous fundamental frequency (F0 ) of the vocal fold vibration at the glottis. The acoustic signal
is generally analysed in terms of the characteristics of the short-time spectral envelope. The deviation
of the features of the spectral envelope for shouted speech in comparison with those for normal speech
are used to characterize the shout.
Shouted speech is perceived to have increased signal intensity, which is characterized by the instantaneous signal energy [70, 199, 160] and power [131, 132]. Change in F0 as a characteristic feature
of shout has been used in different studies such as [131, 132, 202]. Mel-cepstral coefficients (MFCCs)
are normally used to represent the short-time spectral features of shouted speech signal [199, 131]. The
MFCCs with some variations and the wavelet coefficients are used in some shout and scream detection applications [146, 70]. In automatic speech recognition (ASR) MFCCs are used for studying the
performance of the system for shouted speech recognition [132]. In some cases, the finer variations
of spectral harmonics are superimposed on the spectral envelope features to include changes in F0 in
the spectral representation [146]. Properties of pitch, like its presence, salience and height, along with
signal to noise ratio (SNR) and some spectral distortion measures, are used for detection of verbal aggression [202]. Spectral features such as spectral centroid, spectral spread and spectral flatness are used
along with MFCCs for classifying non-speech sounds including scream [104]. Wavelet transformations
27
based on Gabor functions are investigated for detection of emergency screams [115]. MFCCs with linear prediction coefficients (LPCs) and perceptual linear prediction coefficients (PLPs) are examined for
shout event detection in public transport vehicle scenario [160]. In a recent study, the impact of vocal
effort variability on the performance of an isolated-word recognizer is studied for the five vocal modes
(i.e., the loudness levels) [217].
Analysis of shout signals have focused mostly on representing the information in the audio signal
using spectral features. There are not many attempts to study the characteristics of the excitation component in the shouted speech. It is useful to study the excitation source component, especially the nature
of vibration of the vocal folds at the glottis, during the production of shout. It is also useful to study
the changes in the spectral features caused by the excitation in the case of shouted speech in comparison with normal speech. Ideally, it is preferable to derive the changes in the excitation component of
the speech production from the speech signal. One way to do this is to derive the glottal pulse shape
using inverse filtering of speech [58, 2]. This cannot be done accurately in practice due to difficulty in
modeling the response of the vocal tract system for deriving the inverse filter. The difficulty is compounded when the speech signal is degraded as in distant speech. It may be noted that human perceptual
mechanism can easily discriminate shouted speech from normal speech, even when the speech signal is
somewhat degraded.
In this study, changes in the vocal fold vibration in the shouted speech in relation to the normal
speech are examined. Comparison of normal speech is also made with soft and loud speech. It should
be noted that although four distinct loudness levels are considered in this study, they form a continuum,
and hence it is difficult to mark clear boundaries among them. Electroglottograph (EGG) signals are
collected along with the close speaking speech signals for the four different loudness levels, namely, soft,
normal, loud and shout. Differences between shouted speech and normal speech can be seen clearly in
the EGG signals. Since collecting the EGG signal along with the speech signal is not always possible in
practice, there is need to explore features of shout, which can be derived directly from the speech signal.
For this purpose, signal processing methods [40] that can represent the fine temporal variations of the
spectral features are explored in this paper. It is shown that these temporal variations do indeed capture
the features of glottal excitation that can discriminate shout vs normal speech. The effect of coupling
between the excitation source and the vocal tract system during the production of shout is examined in
different vowel contexts. Details of this study are discussed in Chapter 5.
2.6 Review of studies on analysis of paralinguistic sounds and laughter
Humans use nonverbal (nonlinguistic) communication to convey representational messages like
emotions or intentions [139]. Some nonlinguistic signals are associated with specific singular emotions, intentions or external referents [139]. Laughter is a paralinguistic sound, like sigh and scream,
used to communicate specific emotions [36, 135]. Detection of paralinguistic events can help classifi28
cation of the speaker’s emotional state [194]. The speaker’s emotional state enhances the naturalness of
human-machine interaction [194].
2.6.1
Need for studying paralinguistic sounds like laughter
Laughter, a paralinguistic event, is one of the most variable acoustic expression of a human being [135, 159]. Laughter is a special nonlinguistic vocalization because it induces positive affective
state in the listeners, thereby affecting their behaviour [139, 118]. Detection of paralinguistic events
like laughter can help in classification of emotional states of a speaker [194]. Hence, researchers have
been attracted in last few years towards finding the distinguishing features of laughter and developing
systems for detecting laughter in speech [83, 198, 84, 107, 23, 24, 86].
Laughter plays an ubiquitous role in human vocal communication [11] and occurs in diverse social
situations [62, 151, 85]. Laughter has variety of functions like indicator of affection (which could be
species-specific), aggressive behaviour (to laugh in someone’s face), bonding behaviour (in early infancy), play behaviour (interactive playing) or appeasement behaviour (in situations of dominance/ subordination) [159]. Laughter can also help improving the expressive quality of synthesized speech [63,
209, 186].
Laughter is a vocal-expressive communicative signal [161] with variable acoustic features [159, 85].
It is a special human vocalization, mainly because it induces a positive affective state on listeners [118].
Nonlinguistic vocalizations like laughter influence the affective states of listeners, thereby affecting
their behaviour also [139]. Finding distinct features of laughter for automatic detection in speech has
been attracting researchers’ attention [107, 23, 186, 24, 195, 86, 84]. Applications like an ‘audio visual
laughing machine’ have also been attempted [198]. Diverse functions and applications of laughter
continuously motivate researchers to strive for better understanding of the characteristics of laughter,
from different perspectives. Hence, there is need to study the production characteristics of paralinguistic
sounds like laughter in detail.
2.6.2
Different types and classifications of laughter
Laughter characteristics are analyzed at episode, bout, call and segment levels. An episode consist of
two or more laughter bouts, separated by inspirations [161]. A laughter bout is an acoustic event [161],
produced during one exhalation (or inhalation sometimes) [11]. Each laugh bout may consist of one
or more calls. The period of laughter vocalization contains several laugh-cycles or laugh-pulses [125],
called calls, interspersed with pauses [161]. Calls are also referred to as notes or laugh-syllables [11].
Segments reflect change in the production mode in the spectrogram components that that occurs within
a call [11]. A laughter bout may be subdivided into three parts: (i) onset, in which laughter is very short
and steep, (ii) apex, the period where vocalization or forced exhalation occurs, and (iii) offset, the post
vocalization part, where a long-lasting smile smoothly fades out [161]. Laughter with one or two calls
is called exclamation laughter or chuckle [134]. The upper limit on the number of calls in a laugh bout
29
(3-8) is limited by the lung volume [152, 161]. A typical laugh bout consists up to four calls [152, 159].
Laughter at bout and call levels are analysed in this study.
Categorization of laughter sounds were categorized in several studies in different ways. Laughter
was categorized in [161] into three classes: (i) spontaneous laughter, in which there is an urge to laugh
without restraining its expression, (ii) voluntary laughter, a kind of faked laughter to produce a sound
pattern that of natural laughter, and (iii) speaking or singing laughter, in which phonation is not based
on forced breathing but on well-dosed air supply that results in lesser resonance in trachea, breathiness
and aspiration. Three types of laugh bouts were discussed in [11]: (i) song-like laugh, which involves
voiced sounds with pitch (F0 ) modulation, sounding like giggle or chuckle, (ii) snort-like laugh, the
unvoiced call with salient turbulence in nasal-cavity , and (iii) unvoiced grunt-like laugh, that includes
breathy pants and harsher cackles. The three classes of vowel quality (‘ha’, ‘he’ and ‘ho’ sounds) seem
to have marked variation among laugh bouts [152, 150].
Laughter was also categorized as voiced laughter and unvoiced laughter [139]. Voiced laughter occurs when the energy source is the regular vocal fold vibration as in voiced speech, and includes melodic,
song-like bouts, chuckles and giggles [139]. Unvoiced laughter bouts lack the tonality associated with
stable or quasi-periodic vocal fold vibration, and include open-mouth, breathy pant-like sounds, closedmouth grunts and nasal-snorts [139]. The continuum from speech to laugh was divided into three categories: speech, speech-laugh and laugh [135, 118]. In speech-laugh, the duration of vocalization was
observed to increase more likely with changes in one or more features of vowel elongation, syllabic
pulsation, breathiness and pitch [135]. It was shown that voiced laughter induces significantly more
positive emotional responses in listeners in comparison with unvoiced laugh [10]. Laughter in dialogic
interaction was also categorized as speech-smile, speech-laugh and laughter [87], in that speech-smile
included lip-spreading and palatalization superimposed on speech events. Four phonetic types of laughter, namely, voiced, chuckle, breathy (ingression) and nasal-grunt were studied [24, 187]. In this study,
we examine voiced speech-laughter continuum in three categories: (i) normal speech (NS), (ii) laughed
speech (LS) and (iii) nonspeech laugh (NSL). Laughed speech is assumed to have linguistic vocalization
interspersed with nonlinguistic laugh content [135, 118]. Only voiced laugh [139], produced spontaneously, and speaking laughter [161] are considered.
2.6.3
Studies on acoustic analysis of laughter and research issues
Acoustic analysis of laughter was carried out [16], in which features such as fundamental frequency,
root mean square amplitude, time duration and formant structure were used to distinguish laughter from
speech. It was observed that laughs have significantly longer unvoiced regions than voiced regions. The
mean fundamental frequency (F0 ) of laughter sounds as 472 Hz for (Italian/German) females with range
of F0 as 246-1007 Hz was reported [159]. Average F0 of normal speech sounds was reported as 214 Hz
and 124 Hz, for females and males, respectively. Acoustic features such as F0 , number of calls per
bout (3-8), spectrograms and formant clusters (F2 vs F1) were used [11] to analyse temporal features
of laughter, their production modes and source-filter related effects. That study had proposed a sub30
classification of F0 contours in each laugh call as flat, rising, falling, arched and sinusoidal. Two acoustic
features of laughter series, the specific rhythms (duration) and changes in fundamental frequency were
investigated [85] for their role in evaluation of laughter. The acoustic features of laugh-speech continuum such as formant space, pitch range and voice-quality were studied [118]. Combinations of features
such as pitch and energy features, global pitch and voicing features, perceptual linear prediction (PLP)
features and modulation spectrum features were used [194] to model laughter and speech.
Acoustic analysis of laughter produced by congenitally deaf and normal hearing college students
was carried out [111]. That study focused on features such as degree of voicing, mouth position, air
flow direction, relative amplitude, temporal features, fundamental frequency and formant frequencies.
Differences in the degree of variation in the fundamental frequency, intensity and durational patterning
(consisting of onset, main part, pause and offset) were studied [101] to assess the degree of naturalness of synthesized laughter speech. The acoustic features (mostly perceptual and spectral) have been
studied for different purposes and diverse applications such as: distinguishing the laughter types [24],
speech/laughter classification in audio of meetings [84], development of ‘hot-spotter’ system [23], detection of laughter events in meetings [83], automatic laughter detection [107, 194, 86], automatic synthesis of human-like laughter [186] and ‘AV Laughter Cycle’ project [198].
Production characteristics of speech signal with laughter can be expected to change from normal
speech. Production of laughter was studied from respiratory dynamics point of view [51]. In that study,
laugh calls were characterized by sudden occurrence of repetitive expiratory effort, drop in functional
residual capacity of lung volume in all respiratory compartments and dynamic compression of airways.
“Laughter generally takes place when the lung volume is low” (p442) [109]. Production characteristics
of laughter can be analyzed from the speech signal in terms of the excitation source and the vocal
tract system characteristics, like for normal speech. In the production of laughter, significant changes
appear to take place in the characteristics of the excitation source. The acoustic analyses of laughter
have mostly been carried out using spectral and perceptual features [16, 11, 195, 111]. The voice
source characteristics were investigated [118] using features such as glottal open quotient along with
spectral tilt, which were derived (approximately) from the differences in the amplitudes of harmonics
in the Discrete Fourier Transform (DFT) spectrum. Source features such as instantaneous pitch period,
strength of excitation at epochs, and their slopes and ratio were used for analysis of laughter in [185].
In this thesis, we examine changes in the vibration characteristics of the glottal excitation source and
associated changes in the vocal tract system characteristics, during production of laughter, from EGG
and acoustic signals. Details of the study are discussed in Chapter 6.
31
2.7 Review of studies on analysis of expressive voices and Noh voice
2.7.1
Need for studying expressive voices
Expressive voices are special sounds produced by strong interactions between the vocal tract resonances and the vocal fold vibrations. One of the main feature of these sounds is the aperiodicity in the
glottal vibration. Decomposition of speech signals into excitation component and vocal tract system
component was proposed in [214] to derive the aperiodic component of the glottal vibration. In this
study, the characteristics of the aperiodic component of the excitation are examined in detail in the context of artistic voices such as in singing and in Noh (a traditional performing art in Japan [55]), to show
the importance of timing information in the impulse-like sequence in the production of these voices.
Normal voiced sounds are produced due to quasi-periodic vibration [183] of the vocal folds, which
can be approximated to a sequence of impulse-like excitation. The impulse-like excitation in each cycle
is due to relatively sharp closure of the glottis, which takes place after the pressure from lungs is released
by the opening of the closed glottis. In expressive voices the glottal vibrations may be approximated to
a sequence of impulse-like excitation, where the impulses need not be spaced at near equal intervals as
in the case of modal voicing. Moreover, the strengths of successive impulses also need not be same. In
addition, the coupling of the source and system produces different responses at successive impulses in
the sequence.
Thus the aperiodicity of the signal produced in expressive voices may be attributed to (a) unequal intervals between successive impulses in the excitation, (b) unequal strengths of excitation around the successive impulses, and (c) differences in the responses of the vocal tract system for successive impulselike excitations. Besides the differences in the excitation, the differences in the characteristics of the
vocal tract system due to coupling of the source and system may also contribute to the quality of expressive voices. But the aperiodicity in the excitation may be considered as the important feature of the
expressive voices, and hence the need to study the characteristics of the excitation source component of
expressive voice signals.
2.7.2
Studies on representation of source characteristics and pitch-perception
Representation of the excitation source component through multi-pulse excitation sequence was proposed [8, 6, 174], for the purpose of speech synthesis. Linear prediction coding (LPC) based methods
were used to determine the locations and magnitudes of the pulses in successive stages, by considering
one pulse at time [8] or by jointly optimizing the magnitudes of all pulses located up to that stage [173].
The role of multi-pulse excitation in the synthesis of voiced speech was also examined [26]. In that
study, it was observed that “even for periodic voiced speech, the secondary pulses in the multi-pulse excitation do not vary systematically from one pitch period to another” [26]. Multi-pulse coding of speech
through regular-pulse excitation was also proposed [94], using linear prediction analysis. In that study,
an attempt was made to reduce the perceptual error between the original and reconstructed signal.
32
In another study, the effect of pitch perception in the case of expressive voices like Noh singing voice
was examined [55, 81]. In that study, a measure of pitch perception information termed as saliency
was proposed [55]. Subsequently, an approach for F0 extraction and aperiodicity estimation using the
saliency information was also proposed [80]. Recently, a method for extracting the epoch intervals
and representing the excitation source information through a sequence of impulses having amplitudes
corresponding to the strength of excitation was proposed using zero-frequency filtering method [130,
216], for modal voicing. However, to the best of our knowledge, representing the excitation source
information through a sequence of impulses which can be related to the perception of pitch in expressive
voices, is not yet explored.
The research efforts towards characterising the excitation source information of aperiodicity in expressive voices can be categorised according to four questions, in the context of expressive voices.
(i) How to characterise and represent the excitation source component in terms of a sequence of impulselike pulses? (ii) How to characterise changes in the perception of pitch, that could be rapid in the case
of expressive voices? (iii) How to measure the instantaneous fundamental frequency (F0 ) that is also
guided by the perception of pitch, especially in the regions of aperiodicity? (iv) Can we obtain the
sequence of excitation impulses that is related to the perception of pitch? Answers to the first question
were attempted through multi-pulse and regular-pulse excitation [8, 6, 94]. A measure of pitch perception, i.e., saliency and F0 extraction based upon it were proposed [55, 81, 80] as answers to second and
third questions, respectively. But, answer to the last question, along with inter-linking relations amongst
answers to first three questions, has not been explored yet. In this study, we aim to examine afresh
each of these four questions, using signal processing techniques, and possibly address the last and most
important question by exploring ‘representation of excitation source information through a sequence of
impulse-like pulses that is related to the perception of pitch in expressive voices’.
Normally the excitation source characteristics of speech are derived by removing or compensating
for the response of the vocal tract system [78, 128, 72]. Short-time spectrum analysis is performed, and
the envelope information corresponding to the vocal tract system characteristics is extracted [79, 78].
The finer fluctuations in the short-time spectrum are used to derive the characteristics of the excitation
source. The spectral envelope information can be obtained either by cepstrum smoothing or by the linear
prediction (LP) analysis [164, 114, 112]. More recently, the spectral envelope information is derived
effectively by using a TANDEM-STRAIGHT method [81]. Note that in all these cases an analysis
segment of two pitch periods or more are used. Moreover, the residual spectrum due to excitation is
derived by first determining the vocal tract system characteristics in the form of spectral envelope.
The excitation component is generated by the vocal fold vibration in the case of voiced speech. The
vocal tract system is then excited by this component, assuming that the interaction between the source
and the system is negligible. It appears reasonable to extract the characteristics of the excitation source
first, and then use the knowledge of the source to derive the vocal tract system characteristics. To study
the characteristics of the source, it is preferable to derive these characteristics without referring to the
system. The characteristics of the sequence of impulse-like excitation, i.e., the locations (epochs) of the
33
impulses and their relative strengths, can be extracted using a modification proposed in this work, in the
zero frequency filtering (ZFF) method [130, 216]. The sequence of impulses and their relative strengths
are useful to study the characteristics of the aperiodic source in expressive voices, provided they can be
extracted from the acoustic signal.
2.7.3
Studies on aperiodicty in expressive voices such as Noh singing and F0 extraction
In this study, a particular type of expressive voice, called Noh, is considered for studying the characteristics of the aperiodic source [55]. “In Noh, a highly theatrical performance art in Japan, a remarkably
emotional message is largely conveyed by special voice quality and rhythmic patterns that the site uses
in singing”. “The term site is the name used for the role that is taken by the principal player in a Noh performance” [55]. The characteristics of Noh voice quality are described in [55]. That paper also discusses
the importance of analysing the aperiodicity in voice signals to describe the Noh voice quality. The aperiodicity characteristics were examined using TANDEM-STRAIGHT method to derive the vocal tract
system characteristics, and the XSX (excitation structure extraction) method for deriving the fundamental frequency, period by period [55, 81, 80]. “The algorithm tries to identify approximate repetitions
in the signal as a function of time. By finding multiple candidate patterns for repetition, the method
proposes multiple hypotheses for the fundamental period. Using each candidate time interval as the
analysis window, the method tries to compare many different fundamental frequencies (F0 ) to evaluate
their consistency for exploring locally acceptable periodicity by finding the best candidate” [55]. “The
candidate value is associated with an estimate of ‘salience’, which may be interpreted as the likelihood
of the particular F0 value to represent the effective pitch of the voice signal in the given context” [55].
Saliency, indicating the most prominent candidate for instantaneous F0 at an instant of time, was
proposed [55, 80] as a measure of effective pitch perceived in Noh singing voice. Saliency computation
was proposed by using a TANDEM STRAIGHT method [55, 81, 79]. In simple terms, the saliency of an
F0 candidate is approximately the value of the peak at T0 = 1/F0 in the autocorrelation function derived
from the significant part of the spectrum of the excitation component. The significant part is assumed to
be in the low frequency range below about 800 Hz, and is derived by low pass (cut off frequency < 800
Hz) filtering the spectrum of the excitation component [55, 79, 81, 80]. Saliency measure can thus be a
means to characterize the changes in perception of pitch in expressive voices.
Expressive voices like Noh singing carry information of voice quality characteristics, its production properties and pathological conditions of phonation [55]. The emotional content in expressive
voices is conveyed by two constituents: (a) prosody variation of the excitation source characteristics
and (b) changes in voice quality [55, 81]. Prosody changes are reflected in the F0 contour, phonological variables (tone, stress and rhythm) and tonal variables. Voice quality changes are reflected in
the fluctuations of signal parameters related with resonance characteristics (spectral envelope), rise in
pitch (F0 ) and stability of laryngeal configuration (amplitude/intensity) [55]. Hence, the emotional
content in expressive voices can be characterised by (i) coarticulation (sequencing of quasi-stationary
signals), (ii) source-system interaction and (iii) aperiodicity (voice fluctuations) [55, 81].
34
Interestingly, human auditory system perceives these expressive sounds in terms of excitation source
and resonant filter characteristics [214, 55, 80]. The source characteristics determine the prosodic properties of phonetic signals that also carry the spectral timbre information [55]. Perception of voice
depends on sequential phonetic properties (distinctions among words) and suprasegmental properties
(tones, accents or time-varying signal properties). The source-filter theory helps distinguishing between
periodic signals and the deviation from periodicity. The periodic speech signals are usually voiced
sounds, having quasi-stationary portions of vowel-like acoustic segments [183] and F0 as a physical
attribute of voice quality. The deviations from periodicity can be due to random/chaotic changes in linear acoustic system (due to air turbulence), nonlinearity of vocal fold tissue and temporal variability of
voiced signals. The temporal variability of signal can be due to air-flow modulation. It is related to the
changes in F0 , amplitude and spectral envelope of signal waveform, within each glottal cycle [55, 79].
In strict sense, aperiodicity should not be viewed as merely the deviation from periodicity.
Aperiodicity of speech signal reflects emotional content in expressive voices. Aperiodicity is due
to sudden/gradual introduction of subharmonics, their abrupt appearance/disappearance and variability
in harmonic frequencies [55, 79]. The nature of aperiodicity is expected to be different in natural
conversation, in highly emotional speech and in Noh voice [55]. The distinguishing factors are: changes
in (i) F0 (frequency modulation), (ii) amplitude (amplitude modulation), (iii) cycle to cycle fluctuation in
the energy of the excitation source signal, and (iv) vocal tract transfer characteristics (due to articulatory
movement) [81, 80]. The factors for deviations from periodicity, i.e., those contributing to aperiodicity,
can be grouped as: (a) F0 dependent factors and (b) residual fluctuations (having random effect [81]).
Periodicity estimation involves primarily the F0 extraction. The F0 in speech may change from one
glottal cycle to another, hence termed as instantaneous fundamental frequency. Accordingly, the perception of pitch also changes. The intercyclic changes in F0 are very less in the case of modal voicing,
for which the temporal resolution as equal to T0 (i.e., one pitch period) is nearly uniform [80]. However, in the case of expressive voices, there are rapid changes in F0 in some segments, that makes F0
extraction a challenging task. In the last two decades, several methods for F0 extraction have been
evolved. Major ones of these methods can be broadly categorized as: (i) instantaneous frequency (IF)
based [188, 1, 18], (ii) fixed-point analysis based [77], (iii) integrated method (involving IF and autocorrelation) [78], (iv) TANDEM-STRAIGHT (involving excitation structure extractor (XSX) analysis
and IF) [79, 81, 80], (v) group delay [129], (v) DYPSA [91, 133], (vi) inverse-filtering [2, 4, 3] and
(vii) zero-frequency filtering [130, 216] method. Challenge however still lies in extracting F0 in the
regions of harmonics/subharmonics and aperiodicity. The fluctuation spectrum estimation is involved
when either F0 is not constant in time (i.e., T0 is not uniform) or no a priori information about F0 is
available. TANDEM STRAIGHT based method involves multiple F0 hypotheses and saliency based
estimation of F0 [55, 81, 80]. It is apparently the sole attempt so far, to incorporate the information of
pitch perception in estimating F0 for expressive voices.
In this study, we analyse the characteristics of the aperiodic components in the Noh voice signal.
The excitation source characteristics are represented in terms of an impulse-sequence in time-domain.
35
The irregular intervals between epochs, i.e., instants of significant excitation, along with (varying) relative amplitudes of impulses are used to characterize the unique excitation characteristics of Noh voice.
The instantaneous fundamental frequency (F0 ) for expressive voices is obtained from the saliency information, i.e., a measure of pitch-perception, that is derived using few new signal processing methods
proposed in this thesis. Details of this study are discussed in Chapter 7.
2.8 Review of studies for spotting the acoustic events in continuous speech
2.8.1
Studies towards trill detection
Trill sounds are common in some languages around the world, for example Spanish or Todo (Indian) [66, 178]. Analysis of trills can help in understanding the differences among different dialects of
a language [106, 34]. Study of trills can also have sociolinguistic implications [41, 66]. It is envisaged
that automatic detection of trills in continuous speech would have a wide range of applications. Depending upon the places of the articulators, trills are termed as bilabial, dental, alveolar, post-alveolar
and uvular trills [96, 122]. Among these types, dental and alveolar apical trills are more common [116].
Production of apical trills involves satisfying several constraints, categorised as aerodynamic and articulatory. Aerodynamic constraints are related to those factors which are essential for the initiation and
sustenance of apical trill vibrations [175]. Articulatory constraints are related to the aspects like lingual
and vocal tract configuration [155, 100, 40]. The typical rate of trilling of the tongue is in the range
of 20 to 40 Hz, as measured from the acoustic waveform or spectrogram [98, 116, 122]. Two to three
cycles of apical trills are commonly produced in continuous speech [105, 100, 66]. More than three
cycles of apical trills can be produced in isolation as a steady sustained sound.
The phonological aspects of trills were reported [110, 162, 100]. Production mechanism of tongue
tip trills was modelled and described from aerodynamic point of view [98, 116, 100]. The tongue tip
vibrations from the articulatory point of view were described [28, 100, 178]. Articulatory mechanics of
tongue tip trilling was modelled [116]. The aerodynamic characteristics and the phonological pattern
of trills across languages were studied [175]. Trill cycle and trilling rate were studied [100, 116, 105].
Mean trilling rate of 25 Hz (18-33 Hz) was reported [105]. Acoustic correlates of phonemic trill production [34] and acoustic characterisation of trill [66] were reported for Spanish language [66]. The
characteristics of the voiced apical trill [r] are studied in the context of three different vowels [a], [i]
and [u]. It is observed that the period of the glottal vibration changes in each trill cycle due to tongue tip
trilling [40]. Recently, the effect of changing vocal tract shape and the associated changing excitation
source characteristics due to trilling are studied from perception point of view [122]. It is observed that
there is a coupling between both the excitation source and the vocal tract system. Both contribute to
the production and perception of trills, though the role of the excitation source appears to be relatively
more dominant [122]. To the best of our knowledge, not much attention has been paid so far to the auto36
matic detection of trills in continuous speech using the characteristics of trill sounds. Hence, it is worth
investigating methods for automatic detection of trills, details of which are discussed in Chapter 8.
2.8.2
Studies on shout detection
Automatic detection of shout or shouted speech regions in continuous speech has applications in
the domains ranging from security, sociology, behaviour studies and health-care access to crime detection/investigation [132, 70, 160, 202, 199]. Hence, research in acoustic cues to facilitate detection of
shout is gaining increased attention in recent times. In this paper, we aim to exploit the changes in the
production characteristics of shouted speech as compared to normal speech, for developing an automatic
system for detection of shout regions in continuous speech.
Shouted speech consists of linguistic content and voicing in the excitation. The production characteristics of shout, in particular the excitation source characteristics, are likely to deviate from those of
normal speech, especially in the regions of voicing. Associated changes also take place in the characteristics of the vocal tract system. In general, changes in the excitation source characteristics are
examined by studying the changes in the frequency of vocal fold vibrations, i.e., instantaneous fundamental frequency (F0 ) [132, 160, 202, 146]. Changes in the vocal tract system characteristics are usually
examined in terms of changes in the spectral characteristics of speech, such as Mel-frequency cepstral
coefficients (MFCCs) [70, 199, 146, 217].
Studies on the analysis of shout or scream signals mostly used features like F0 , MFCCs and signal energy [132]-[146]. Features such as formant frequencies, F0 and signal power were studied
in [132, 202, 146]. Applications of these features included automatic speech recognition (ASR) for
shouted speech [132]. The MFCCs, frame energy and auto-correlation based pitch (F0 ) features were
studied in [70, 160, 199]. Applications of these features included scream detection using support vector
machine (SVM) classifier [70]. The MFCCs with weighted linear prediction (WLP) features were studied for detection of shout in noisy environment [146]. Spectral tilt and linear predictive coding (LPC)
spectrum based features were used in [217] for studying the impact of vocal effort variability on the performance of an isolated word recognizer. The MFCCs and spectral fine structure (F0 and its harmonics)
were used recently, in a Gaussian mixture model (GMM) based approach for shout detection [147].
In this study, we develop an experimental automatic shout detection system (ASDS) to detect regions
of shout in continuous speech. The major challenge in developing an automatic shout detection system
is the vast variability in shouted/normal speech, that could be speaker, language and application specific. Hence, a rule-based approach is used in the ASDS. The design details of ASDS and performance
evaluation results are discussed in Chapter 8.
2.8.3
Studies on laughter detection
Detection of paralinguistic events like laughter has potentially diverse applications such as indexing/search of audio-visual databases, healthcare and biometrics etc. Detection of laughter regions in
37
continuous speech can also help in classification of emotional states of a speaker [194]. Hence, researchers have been attracted in last few years towards finding the distinguishing features of laughter, to
develop systems for detecting laughter regions in continuous speech [107, 23, 83, 24, 86, 84, 198].
The acoustic spectral and perceptual features have been used in applications such as detection of
laughter events in meetings [83], distinguishing the four (phonetic) types of laughter [24], and speech/
laughter classification in meetings audio [84] etc. Other diverse applications include ‘hot-spotter’ [23],
automatic laughter detection [194, 86] and ‘AVLaughterCycle’ project [198].
Automatic laughter detection in continuous speech, that exploits changes in the production characteristics (mainly source) during laughter production, is attempted in this thesis. Its details and performance
evaluation results are discussed in Chapter 8.
2.9 Summary
In this chapter, we have revisited the basic concepts of speech production, significance of glottal
vibration and notions of frequency (F0 ), along with analytical signal and parametric representation of
acoustic signal. Earlier studies related to the analysis of each of the four sound categories are reviewed
briefly. Studies on the nature of sounds involving source-system coupling effects like trill, and those
highlighting the significance of source-system interaction and acoustic loading effects are reviewed.
Earlier studies on emotional speech, paralinguistic sounds and expressive voices are also reviewed, in
particular on shouted speech, laughter and Noh voice, respectively. Studies on aperiodicity in Noh voice
and F0 extraction are also reviewed.
Feasibility of automatic detection of trills in continuous speech is examined. Since, it can be developed using the production characteristics of apical trills, earlier studies on trill analysis are reviewed in
brief. Automatic detection of shout in continuous speech in real-life practical scenarios is important and
is a challenging task. Hence, earlier studies aimed towards shout detection are reviewed in brief, prior to
developing an experimental automatic shout detection system, based on changes in the production features. Feasibility of automatic detection of laughter (nonspeech-laugh or laughed-speech) in continuous
speech is explored and hence related earlier studies are reviewed briefly.
The reviews carried out in this chapter on the analysis of four categories of sounds, and attempts
towards detecting few acoustic events highlight the research issues and challenges involved. It also
highlights the limitations of signal processing methods used in these studies, and the need for new
approaches. In chapters further in this thesis, each of the four sound categories is analysed separately,
while exploring changes in the production characteristics, primarily the source characteristics. The
impulse-sequence representation of excitation information in speech coding methods, and the recently
proposed signal processing methods that are used in these analyses, are discussed in the next Chapter.
Using these, the three prototype automatic detection systems developed for spotting trills, shout and
laughter, are also discussed in a chapter later.
38
Chapter 3
Signal Processing Methods for Feature Extraction
3.1 Overview
Analysis of normal (verbal) speech is carried out using standard signal processing techniques. But,
the variations in it, as those related to source-system interaction, can be analysed better using some
recently proposed signal processing methods, that are discussed in this chapter. Mainly, the zerofrequency filtering and zero-time liftering methods are discussed, which are used for deriving the excitation source characteristics in terms of the impulse-sequence representation and the spectral characteristics, respectively. Methods for deriving the resonance characteristics of the vocal tract system in terms
of first two dominant frequencies, are also discussed. In general, tendency is to use the standard/recently
proposed signal processing methods for the analysis of nonverbal speech sounds also. But, the analysis
of emotional speech, paralinguistic sounds and expressive voices would require further specialized signal processing methods, due to peculiarities and range of feature variations involved in the production
of these sounds. Hence, there is need for some modifications/refinements in these methods along with
some new approaches, that are discussed in further chapters with respective contexts.
Impulse-sequence representation of the excitation source component of acoustic signal has been of
considerable interest in speech coding research, mainly for synthesis of natural-sounding speech. But,
this representation can also help in the analysis of nonverbal speech sounds. Different speech coding
methods focused primarily on achieving low bit-rate and good voice quality of synthesized speech.
These methods can be broadly classified into three categories, namely, waveform coders, vocoders and
hybrid codecs. Waveform coders aimed at mimicking the speech waveform to the best possible extent. Vocoders, mainly linear prediction (LP) coding or residual-excited LP source-coders, lead to the
development of stochastically generated code-book excited LP (CELP) codecs. Hybrid or analysisby-synthesis codecs, such as multi-pulse/regular-pulse excited and CELP codecs, aimed at achieving
intelligible speech with bit-rates ≤ 4 Kbps. Analysis approaches in these methods differed in estimating
the pulse position, amplitude or phase. Presence of secondary pulses within a glottal-cycle was also
indicated in some studies. Impulse-sequence representation of the excitation source component is used
in characterizing the nonverbal speech sounds, in this thesis.
39
In this chapter, methods of estimating the impulse-sequence representation of the excitation information used in various speech coding methods are discussed first. Aim is to gain insight into mathematical
basis of the underlying research issues, some of which are attempted in this thesis. The chapter is organised as follows. Section 3.2 discusses speech coding methods and approaches for representing the
excitation source information in terms of an impulse-sequence. The standard all-pole model of excitation used in LPC vocoders, is discussed in Section 3.3. Methods of estimating the impulse-sequence
representation of the excitation information used in speech coding, are discussed in Section 3.4. In
Section 3.5, the zero-frequency filtering method for extracting the excitation source characteristics is
discussed. Section 3.6 discusses the zero-time liftering method for extracting the spectral characteristics. Methods for computing the first two dominant frequencies of the vocal tract system resonances are
discussed in Section 3.7. Research issues and challenges involved in this impulse-sequence representation are discussed in Section 3.8. The chapter is summarized in Section 3.9.
3.2 Impulse-sequence representation of excitation in speech coding
Impulse-sequence representation of the excitation source information in acoustic signal has been
attempted in various speech coding methods evolved in last about three decades. Different speech
coding methods have been proposed for diverse applications in mobile communication systems, voice
response systems and high-speed digital communication networks etc. [8]. Most of these speech coding
methods focused at achieving either of the two contrasting objectives, i.e., (i) producing high quality
speech to make it sound as close to natural speech and (ii) lower bit-rates of coding. Some methods
also attempted to achieve these dual objectives at the same time, though with some compromise. Hence,
synthesis of natural-sounding speech was aimed at low-bit rates of coding, i.e., below 4.8 Kbits/sec.
Speech coding methods can be categorised into three classes, based upon evolution stages and approaches adopted. Earlier speech coders, i.e., (a) waveform coders lead to the development of (b) LPC
based vocoders, which further lead to the development of (c) analysis-by-synthesis, i.e., hybrid codecs.
(A) Waveform coders
The waveform coders [74] were focused at reproducing (by mimicking) the speech signal waveforms, as faithfully as possible, with minimum distortion and error [8]. Most waveform coders were
of types either the pulse-code modulation systems or their differential generalizations, for example,
differential generalizations [74], adaptive predictive [5] and transform coders [113]. Pulse-code modulation (PCM), differential pulse-code modulation (DPCM) and delta-modulation (DM) were also attempted [74]. Waveform coders were capable of producing high-quality speech, but the problem in
these was that bit-rates of the speech coding was relatively high, i.e., above 16 Kbits/sec.
(B) Vocoders
LPC Vocoders, i.e., source-coders [140] were aimed at reducing coding bit-rate, while achieving the
intelligible speech. Using a parametric model of speech production (i.e., source-filter model), vocoders
synthesize intelligible speech at bit-rates even below 2.4 Kbits/sec [113], but it is not natural-sounding
40
speech. Hence, the cost paid in vocoders is voice-quality and intelligibility. In parametric (source-filter)
model [8]: (i) a linear filter models the characteristics and spectral shaping of the vocal tract [8], and
(ii) the excitation source provides excitation to the linear filter. The model assumes classification of
the speech signal into two classes, voiced and unvoiced speech. The excitation for voiced speech is
modeled by a quasi-periodic train of delta-function type pulses located at the pitch-period intervals, and
for unvoiced speech by White noise [8]. This model has limitation in producing high-quality speech
even at high bit-rates, because of the inflexible way in which the excitation is generated.
Applications of low bit-rate coding (≤ 4.8 Kbits/sec) in mobile communication network etc. lead to
several approaches in vocoders. (i) Linear predictive coding (LPC) techniques [113] for speech analysis and synthesis provided alternative way of representing the spectral information by all-pole filter
parameters. But, these used excitation similar to that in channel vocoders [52]. (ii) Voice-excited
vocoders [166] involved excitation of vocal tract system in two modes, pulse-sequence (voiced) and
noise (unvoiced), both estimated over short segments of speech waveform. But, the problem is - “there
are regions where it is not clear whether the signal is voiced or unvoiced, and what the pitch period
is...” [8]. (iii) Residual-excited linear predictive (RELP) vocoders [197] used LP residual for the excitation. But these vocoders had problems in excitation signal representation. (iv) Code-excited LPC
(CELP) vocoders [165] represented the excitation sequence by a stochastically generated codebook at
around 4.8 Kbits/sec. The problem in these CELP vocoders was the large amount of computations required to choose the optimum code from the stored code-book. As a solution to these limitations, the
multi-pulse excited (MPE) speech coding method and its few variations were proposed [8, 140], that are
discussed next.
(C) Hybrid codecs
Hybrid codecs [8, 140, 15, 61] are ‘Analysis-by-Synthesis’ (AbS) kind time-domain coders-decoders
(i.e., codecs), that aimed at achieving the dual objectives of speech-coding, i.e., good voice quality
(intelligibility) of synthesized speech and coding at low bit-rates (≤ 4.8 Kbits/sec). These codecs use
the same linear-filter model of the vocal tract system that was used in LPC vocoders. But, instead of
a simple two-state voiced/unvoiced model as input to this filter, the excitation signal is dynamically
chosen to match the reconstructed speech signal waveform as close as possible to the original speech.
The ‘Analysis-by-Synthesis’ (AbS) codecs have two parts, encoder and decoder [140]. The input
speech to be coded is divided into frames of size about 20 ms. Then, for each frame the parameters for
a synthesis filter are determined in encoder, and the excitation to this filter is determined in decoder. The
encoder comprises of a synthesis filter and a feedback path, which consists of error-weighting and errorminimisation blocks. For each frame of speech signal s(n), the synthesis filter gives the reconstructed
signal sˆ(n) as output for the excitation signal u(n) as input. The encoder analyses the input speech by
synthesizing multiple approximations to it. For each frame, it transmits to decoder the information about
two things, the parameters of synthesis filter and the excitation sequence. The decoder then generates
the reconstructed signal sˆ(n) by passing the given excitation u(n) through the synthesis filter. The
41
objective is to determine the appropriate excitation signal u(n), for which the error e(n) between the
input signal s(n) and the reconstructed signal sˆ(n) after weighting, i.e., ew (n) is minimised.
The synthesis filter [8, 140] used in encoder/decoder part in Hybrid (AbS) codecs is usually an allP
1
pole short-term linear filter H(z) = A(z)
, where A(z) = 1 − pi=1 ai z −i is the prediction error filter
that models the correlation introduced into speech, by the action of the vocal tract. Here, the order of LP
analysis p is chosen usually as 10. This filter minimizes the energy of the residual signal, while passing
an original signal frame through it. The long-term periodicities present in the voiced speech (due to
glottal cycles) are modeled using either a pitch-extractor in the synthesis filter, or adaptive code-book in
the excitation generator. Thus, the excitation signal is Gu(n − n0 ), where n0 is estimated pitch-period.
The error weighting block shapes the error signal spectrum, to reduce the subjective loudness of error
between the reconstructed signal and original speech. Minimising the weighted error thus concentrates
the energy of error signal in the frequency regions of speech having high energy. The objective is to
enhance the subjective quality of the reconstructed speech, by emphasizing the noise in the frequency
regions where the speech content is low. The selection of excitation waveform u(n) is important in
hybrid codecs. That excitation is chosen, which gives the minimum weighted error between the original
speech and the reconstructed signal. However, this closed-loop determination of the excitation sequence
to produce good quality speech at low bit-rates, involves computationally intensive operations.
Various hybrid codecs were developed, which differ in the way the excitation signal u(n) is represented and used.
(A) Multi-pulse excited (MPE) codecs [8] model the ideal excitation by non-zero samples (pulses) in
fixed number, separated by relatively long sequences of zero-values samples, for each frame of speech.
Typically 8 pulses for every 10 ms were considered sufficient to generate different kind of speech sounds
including voiced and unvoiced speech, with little audible distortion [8]. Sub-optimal methods were used
to determine the positions of these non-zero pulses within a frame, and their amplitudes. Good quality
reconstructed speech at a bit rate of around 10 Kbits/sec was achieved. The advantage in MPE codecs
was that no a priori knowledge is required either of voiced/unvoiced decision or of the pitch-period for
synthesized speech. But, the bit-rate achieved was still high in these codecs.
(B) Regular-pulse excited (RPE) codecs [94] use a number of non-zero pulses to give the excitation
sequence u(n), like in MPE codecs. But, here the pulses are regularly spaced at fixed interval, so the
encoder needs to determine only the position of the first pulse and amplitudes of all pulses. Since, less
information needs to be transmitted about pulse position, the RPE codecs can use more non-zero pulses
(around 10 pulses per 5 ms), for a given bit-rate (say 10 Kbits/sec). Hence, RPE codecs give slightly
better quality of reconstructed speech than MPE codecs, and are used in GSM mobile telephone systems
in Europe [94]. However, computational complexity for RPE codecs is more than MPE codecs.
(C) Code-excited linear prediction (CELP) codecs [165] are different from the MPE and RPE codecs,
both of which provide good quality of speech only at the bit-rates ≥ 10 Kbits/sec. Both, the MPE and
RPE codecs require to transmit the information about both positions and amplitudes of the excitation
pulses. But, in CELP codecs the excitation signal is vector-quantized, i.e., the excitation is given by
42
an entry from the vector quantized code-book, and a gain term is used to control its power. Since, a
1024 entries code book index can be represented by 10 bits and gain can be coded by about 5 bits, the
bit-rate requirement as only 15 bits now is greatly reduced, against the 47 bits required in GSM RPE
codec. However, the disadvantage of CELP codecs is high complexity for any real-time implementation.
As solution, the CELP codecs can be used at bit-rates ≤ 4.8 Kbits/sec by classifying the speech into
voiced, unvoiced and transition frames, and then coding each type of speech segment in different way.
(D) Multi-band excitation (MBE) codecs declare some regions in the frequency domains as voiced/
unvoiced and for each frame transmit the information of pitch period, spectral magnitude, phase and
voiced/nonvoiced decisions for harmonics of F0 .
(E) Prototype waveform interpretation (PWI) codecs use a single pitch-period information for every
20-30 ms, and then use interpolation to reproduce a smoothly varying quasi-periodic waveform for
voiced speech segments. Thus, good quality speech at bit rates below 4 Kbits/sec is obtained by some
variations in CELP codecs.
3.3 All-pole model of excitation in LPC vocoders
(A) Generic pole-zero model [112]: A continuous-time signal s(t) sampled at sampling interval
T , can be represented as a discrete-time signal in the form of a time-series s[nT ] or s[n] (denoted as
sn ) [112], where n is a discrete variable and the sampling frequency is fs = 1/T . A parametric model
of the system can be given by a generic pole-zero model, in which the signal sn , i.e., output of the
system is predicted by a linear combination of past outputs, and past and present inputs (hence, called
linear prediction) [112]. The predicted output of the system, i.e. s[n] (or sn ), is given by
s[n] = −
p
X
ak s[n − k] + G
q
X
bl u[n − l],
b0 = 1
(3.1)
l=0
k=1
where, ak , 1 ≤ k ≤ p, 1 ≤ l ≤ q, and G (i.e., gain) are parameters of the system. Here, u[n] is the
unknown input signal (i.e., a time-sequence). By taking z transform on both sides of (3.1), we get
S(z) = −
p
X
ak z
−k
S(z) + G
(1 +
ak z −k ) S(z) = G (1 +
q
X
bl z −l ) U (z)
l=1
k=1
S(z)
H(z) =
U (z)
bl z −l U (z)
l=0
k=1
p
X
q
X
Pq
(1 +
b z −l )
Ppl=1 l −k
= G
(1 +
)
k=1 ak z
(3.2)
where, H(z) is the transfer function of the system, U (z) is z transform of u[n] (i.e., un ). Also, S(z)
P
−n . Here, H(z) in (3.2) is the
is z transform of s[n] (i.e., sn ), and is given by S(z) = ∞
n=− ∞ s[n] z
general pole-zero model, in which numerator polynomial gives the zeros and denominator polynomial
the poles. Its two special cases are all-zero or all-pole models. If ak = 0 for 1 ≤ k ≤ p, then H(z)
43
in (3.2) is called all-zero model or moving average (MA) model. If bl = 0 for 1 ≤ l ≤ q, then H(z)
in (3.2) is called all-pole model or auto-regressive (AR) model. Hence, pole-zero model is also called
auto-regressive moving average (ARMA) model [112].
(B) All-pole model [112, 7]: Spectral information can be represented by the all-pole filter parameters,
by using linear predictive coding (LPC) techniques in speech analysis and synthesis [7]. In an all-pole
model, the signal s[n] is given by a linear combination of past output values and some input u[n], as
s[n] = −
p
X
ak s[n − k] + G u[n]
(3.3)
k=1
where G is gain factor and u[n] is input, i.e., the excitation sequence (or signal). By taking z transform
on both sides of (3.3), we get
S(z) = −
p
X
ak z −k S(z) + G U (z)
k=1
(1 +
p
X
ak z −k ) S(z) = G U (z)
k=1
H(z) =
S(z)
U (z)
G
Pp
=
(3.4)
−k )
(1 +
k=1 ak z
P
−k , i.e., S(z) is z transform of
where H(z) is all-pole transfer function, and S(z) = ∞
n=− ∞ s[n]z
s[n], and U (z) is z transform of u[n]. Here, presence of some input u[n] is assumed.
(C) Method of Least Squares [112]: In case the input u[n] is completely unknown, then the output
s˜[n] (or s˜n ) can be predicted from a linear weighted sum of only the past samples of the signal s[n], as
s˜[n] = −
p
X
ak s[n − k]
(3.5)
k=1
Then, error e[n] between the actual value s[n] and the predicted value s˜[n] is given by
e[n] = s[n] − s˜[n] = s[n] +
p
X
ak s[n − k]
(3.6)
k=1
Here, error e[n] is also called the residual. Parameters {ak } are obtained by minimization of the mean
or total squared error (i.e., e2 [n]) with respect to each of the parameters.
(D) LPC All-pole Synthesis Filter in LPC vocoders and its limitations [15, 140, 61, 8]: In speechsynthesis, ideally the excitation should be the linear-prediction residual [112, 7]. Most LPC vocoders
use the excitation as either (i) a train of periodic non-zero or delta-function pulses separated by the pitch
period for voiced speech, or (ii) a White noise waveform for unvoiced sounds. But, the problem is a
reliable separation of speech segments into voiced and unvoiced classes [8]. Also, this rigid idealization
of the excitation may be contributing to unnatural-sounding voice quality of synthesized speech [8].
Another important limitation of LPC vocoders relates to the observation in [8] that “it would be gross
simplification to assume that there is only one point of excitation in the entire pitch period” [8, 67].
44
There exists secondary excitation, apart from the main excitation occurring at glottal closure, which
occurs not only at glottal opening and during the open phase, but also after the glottal closure [8, 67].
“These results suggest that the excitation for voiced speech should consist of several pulses in a pitch
period, rather than just one pulse at the beginning of the period” [8].
3.4 Methods to estimate the excitation impulse sequence representation
3.4.1
MPE-LPC model of the excitation
The solution to the problem of excitation representation in LPC vocoders was proposed by Atal and
Remde in [8]. A multi-pulse excitation (MPE) model was proposed [8], in which excitation signal is
a combination of pulses in a glottal period, hence the name multi-pulse excitation. This model neither
requires a priori knowledge of voiced-unvoiced decision nor pitch-period [8]. “The purpose of the multipulse analysis is to either replace or model the residual signal, by a sequence of pulses” [8]. Various
MPE codecs differ in methods for determining the positions and amplitudes of pulses in the pulsesequence, in a given interval of time. The selection of pulse positions and amplitudes must be such that
the difference between reconstructed and original speech (computed using some measure) is minimized.
(i) Determining the MPE pulse-sequence using a weighting error filter: The sequence u[n] of pulses
(or impulses, i.e., delta-function) is used as input to the LPC all-pole synthesis filter H(z), given by
(3.4). The objective is to minimize a performance measure, i.e., the weighted-mean square error ǫ, computed from the difference e[n] between original speech s[n] and the reconstructed/synthesized speech
s˜[n], given by (3.6) [8, 15]. The weighting here is accomplished by using a filter [8], as
W (z) =
H(γz)
H(z)
(3.7)
where, γ is the bandwidth expansion factor, W (z) is weighting filter, H(z) is all-pole LPC synthesis
filter, and H(γz) is bandwidth-expanded synthesis filter [15]. The weight is chosen such that the SNR is
lower in formant regions, since noise in these regions is better masked by the voiced speech signal. The
desired multi-pulse excitation d[n] (or dn ) is determined by modeling the LPC residual r[n] (or {rn }),
such that weighted error between original and synthesized speech is minimized. Here, desired signal dn
is obtained by passing the residual (rn ) through the bandwidth-expanded synthesis filter H(γz).
d[n] =
∞
X
r[k] h[n − k]
(3.8)
k=− ∞
where, h[n] (or hn ) is causal impulse response of the bandwidth-expanded synthesis filter H(γz). Different approaches are used for determining the positions and amplitudes of impulse-like pulses.
(ii) Finding the transfer function of error-weighting filter [8]:
In multi-pulse excitation model, an all-pole LPC synthesizer filter H(z) is excited by an excitation
generator that gives a sequence of pulses located at times (positions) t1 , t2 , ..., tn , ... with amplitudes
45
α1 , α2 , ..., αn , ..., respectively. The pulse-sequence, referred to as vn (or v[n]), could possibly be the
desired impulse sequence representation of the excitation, denoted as dn (or d[n]) in (3.8). When excited
by v[n], the LPC synthesis all-pole filter H(z) produces the output synthesized speech samples s˜[n] (or
s˜n ). The sampled output of the all-pole filter, when passed through a low-pass filter, produces a continuous reconstructed speech waveform sˆ(t). Note that (˜) denotes discrete samples of the synthesized
speech, and (ˆ) the continuous reconstructed speech. Comparison of synthesized speech samples s˜[n]
with the corresponding speech samples s[n] produces the error signal e[n] (or en ).
The error signal e[n] needs to be modified to take into account the human ear’s perception of this
signal. Human ear’s phenomena like masking and limited frequency resolution are considered, for deemphasizing the error in the formant regions [8]. The error signal e[n] after modifying it is used for
estimating the excitation signal, i.e., locations and amplitudes of pulses in the sequence. A linear filter
is used for suppressing the error signal energy in the formant regions. Thus error signal is weighted,
squared and averaged over 5-10 ms intervals to produce the mean-squared weighted error ǫ [8]. The
locations and amplitudes of pulses are chosen such that the error ǫ is minimized. Note that e[n] is
the difference between synthesized speech samples s˜[n] and original speech samples s[n], but the ǫ is
mean-squared weighted error in frequency-domain. The frequency-weighted error is given by [8]
Z fs
ˆ )|2 W (f ) df
ǫ=
|S(f ) − S(f
(3.9)
0
ˆ ) are Fourier transform of original speech s(t) and synthesized speech sˆ(t), respecwhere S(f ) and S(f
tively. Here, fs is sampling frequency, and W (f ) is a weighting function (in frequency-domain), which
is chosen such that formant regions in the error spectrum are de-emphasized. If 1 − P (z) is LPC inverse
filter in z-domain, then short-time spectral envelope of the speech (error) signal is given by
2
K
Se (f ) = (3.10)
1 − P (e− 2jπf
fs ) where, K is the mean-squared prediction error. The inverse filter 1 − P (z) is given by
1 − P (z) = 1 −
p
X
ak z
−k
=1−
p
X
ak e
−
2jπf k
fs
(3.11)
k=1
k=1
where {ak } are coefficients of the inverse filter, and transfer function of error-weighting filter W (z) is
P
1 − pk=1 ak z −k
P
(3.12)
W (z) =
1 − pk=1 ak γ k z −k
where γ parameter controls the error weight in the formant region. The value of γ chosen between
0 and 1, decides the degree of de-emphasis of the formant regions in the error spectrum. For γ = 0 the
W (z) = 1 − P (z), and for γ = 1 the W (z) = 1. A typical value of γ = 0.8 is used for fs = 8 KHz.
Notice that the expression in (3.12) is in-line with the simplified version in (3.7).
46
The locations and amplitudes of the pulses in the excitation signal are determined by minimizing the
weighted error over successive 5 ms intervals (for speech data averaged over 20 ms intervals and speech
segments of 0.1 sec [8]). Different methods are used for minimizing the weighted mean-squared error.
(iii) Error minimization procedure, to get locations and amplitudes of pulses in excitation sequence:
Most of the error minimization procedures [8, 15, 140, 61] aim to minimize the weighted error,
and select appropriate locations and amplitudes of pulses in the excitation signal. Pulse locations can
be found by computing either one pulse at a time, or by computing the error for all possible pulse
locations in a given time interval and then locating the minimum error position [8]. Amplitudes of the
pulses appear as a linear factor in the error, and as a quadratic factor in the mean-squared weighted
error. Hence, pulse amplitude is obtained either as a closed-form solution by setting the derivative of
the mean-squared error to zero, or by determining in single step the amplitudes of all pulses by solving
a set of linear equations (assuming the pulse positions are known a priori). A procedure for finding the
locations and amplitudes of pulses in a given time-interval [8], is as follows:
Step 1: Generate synthetic output speech, entirely from the memory of the all-pole filter from previous
synthesis intervals, and without any excitation pulse.
Step 2: Determine the location and amplitude of a single pulse, by subtracting the contribution of past
memory from the speech signal and also minimizing the mean-squared weighted error.
Step 3: Compute new error signal by subtracting the contribution of the pulse just determined. Thus
continue these steps till the mean-squared weighted error is reduced below a desired threshold level.
The advantage here is that the “multi-pulse excitation is able to follow rapid changes in speech waveform, such as those occurring in rapid transitions” [8]. Different methods of estimating the amplitudes
and locations of pulses in the excitation sequence (signal) are discussed later in this section.
(iv) Objective in MPE-LPC model [54]:
The key objective in the multi-pulse excitation (MPE) LPC model of excitation is to find (i) a pulse
sequence u[n], and (ii) a set of filter parameters {ak } used in (3.3). The u[n] and {ak } are found in such
a way that a perceptually weighted means-squared error (MSE) e¯2 [n] is minimized w.r.t. the reference
signal s[n] [54]. The synthesized signal s˜[n] in MPE-LPC model similar to (3.5),is given by
s˜[n] =
p
X
ak s˜[n − k] + u[n]
(3.13)
k=1
where, p is predictor order. The filter coefficients {ak } and the excitation signal u[n] are determined to
minimize the error e¯2 [n] (similar to (3.6)) given by
X
(s[n] − s˜[n])
(3.14)
e¯2 [n] =
n
Finding suitable {ak } and u[n] for minimum e¯2 [n] is a highly nonlinear and difficult problem. Different
approaches to solve this problem, that mostly involve first determining the LPC parameters {ak } and
then the excitation pulse sequence u[n], are discussed in next two sub-sections.
47
3.4.2
Methods for estimating the amplitudes of pulses
Methods to determine pulse amplitude [8, 15, 140, 61] can be categorised as follows. (a) Sequential
approach (no re-optimization), that uses correlation type analysis involving sequential pulse placement.
(b) Re-optimization after all pulse positions are determined, which involves covariance type analysis
and block edge effect. (c) Re-optimization after each new pulse position is determined, that involves
jointly optimal approach and Cholesky decomposition technique to avoid square-root operations.
(A) Sequential pulse placement method (no re-optimization)
In the sequential pulse placement method (involving the ‘correlation’ type analysis) [15], each analysis frame is divided into blocks, in order to determine the multi-pulse excitation. Assume that each
block-size is of N samples, and there are Np excitation pulses for each block. Further, assuming that
the desired excitation sequence is d[n] (or dn ) as mentioned in (3.8) and the first pulse is placed at
position m, then the mean-squared weighted error for the block is given by
e¯2 =
X
(dn − Am hn−m )2
(3.15)
n
where, hn−m , i.e., h[n − m] is causal impulse response of bandwidth-expanded synthesis all-pole filter
H(γz) in (3.7) for the impulse located at m, and Am is amplitude of pulse at position m. Optimal pulse
amplitude is obtained by differentiating (3.15) w.r.t. Am and minimizing the error e¯2 (i.e., e¯2 → 0) [15].
e¯2 =
X
(d2n − 2dn Am hn−m + A2m h2n−m )
(3.16)
n
By differentiating it w.r.t. Am , we get
2
X
dn hn−m = 2 Am
X
hn−i hn−j
n
n
Am =
P
d h
P n n n−m
n hn−i hn−j
(3.17)
In the (3.17), by denoting the vector of cross-correlation terms in numerator by αm and the matrix of
correlation terms in denominator by φij , we get the optimal pulse amplitude Aˆm [15], as follows:
αm =
X
dn hn−m
(3.18)
hn−i hn−j
(3.19)
n
φij
=
Aˆm =
X
n
αm
φmm
48
(3.20)
Now, by substituting the value for optimal amplitude Aˆm [15] at mth location from (3.20), into
expression for weighted mean-squared error e¯2 in (3.16), we get
X αm 2
X
X
αm
2
¯
2
h2n−m
hn−m +
e =
dn − 2
dn
φ
φ
mm
mm
n
n
n
2
X
αm
αm
=
d2n − 2
αm +
φmm
(by using (3.18) and (3.19))
φmm
φmm
n
X
α2
(3.21)
e¯2 =
d2n − m
φmm
n
Hence, the weighted mean-squared error now depends on only the position (m) of the pulse. The best
2
m
position (m) for a pulse is given by that value of m, for which φαmm
is maximum [15].
The optimal position for the next pulse is given by the expression for new sequence {d′n }, as
d′n = dn − Aˆm hn−m
(3.22)
By putting the new value of d′n in (3.18), we get
′
αm
= αm − Aˆm
ˆ φmm
ˆ
(3.23)
′ are new values of d and α , respectively, computed after determining amplitude and
Here, d′n and αm
n
m
position of first pulse. Likewise, positions and amplitudes can be found sequentially for all pulses. This
cross-correlation maximization approach is computationally more efficient than exhaustive search [15].
(B) Re-optimization after determining ‘all’ pulse positions
In the re-optimization after determining ‘all’ pulse positions, covariance type analysis is used [15].
In the autocorrelation form of analysis the limits of error sum e¯2 in (3.15) are used from −∞ to +∞.
The residual {rn }, referred in (3.8), is assumed to be windowed such that it is zero outside the signal
block of N samples. The part of impulse-response of the synthesis filter H(z) that falls outside the
N sample block, is taken into account in autocorrelation type analysis. The autocorrelation term φij in
(3.19) is of the form of Topliz matrix, in which only the first row of values need to be determined. Here,
the optimal pulse amplitude (Am ) does not depend upon the pulse position m. Rather, it depends on
best position (values of m), for which |αm | is maximized and φmm is minimized in the error expression
in (3.21).
The pulse amplitude is determined by jointly-optimal pulse amplitudes. The mean square error
(MSE) after getting the pulse positions up to mi for all np (i.e., Np ) pulses, is given by
e¯2 =
X
n
(dn −
np
X
Ami hn−mi )2
(3.24)
i=1
Expanding this expression, we get
e¯2 =
X
n
d2n
−2
X
n
(dn
np
X
i=1
Ami
np
np
XX
X
hn−mi ) +
(
Ami hn−mi .
Ami hn−mi )
n
49
i=1
i=1
(3.25)
Now, differentiating this expression w.r.t. all pulse amplitudes Ami , we get
2
X
(dn
n
np
np
X
i=1
np
np
XX
X
hn−mi ) = 2
(
hn−mi .
Ami hn−mi )
n
i=1
i=1
np
np
XX
X
X
X
(
hn−mi .
hn−mi . Ami ) =
(dn
hn−mi )
n
i=1
n
i=1
(3.26)
i=1
Replacing in (3.26) now the cross-correlation terms αmi like in (3.18), and correlation terms φmi mi like
in (3.19), we get the following set of simultaneous equations:

φ m1 m1

 φ m2 m1


..

.

φ
 mi m1

..

.

φ m np m 1
φ m1 m2
φ m2 m2
..
.
···
···
..
.
φ m1 mj
φ m2 mj
..
.
φ mi m2
..
.
···
..
.
φ mi mj
..
.
φ m np m 2
· · · φ m np m j
···
···
..
.
φ m 1 m np
φ m 2 m np
..
.

Aˆ1
Aˆ2
..
.







· · · φ m i m np 
  Aˆmi


..
..
..

.
.
.

ˆ
Amnp
· · · φ m np m np


α1
α2
..
.
 
 
 
 
 
=
  α
  mi
  .
  ..
 
α m np












(3.27)
where, Aˆmi is optimal amplitude at position mi , and np (Np ) is number of pulses in block of N samples.
Solution to this set of simultaneous equations can be obtained by using Cholesky decomposition of
the correlation matrix which consists of elements φij . Two forms of pulse-amplitude re-optimization
can be used - (i) after determining ‘all’ pulse positions, (ii) after determining ‘each’ pulse position.
In this method [15], the optimal pulse amplitude Aˆ is determined by using φij in (3.19) and (3.20),
as a matrix of covariance terms. The effect of the part of impulse response (to all-pole synthesis filter)
falling outside the block of N samples (for mean-squared error e¯2 ) is ignored. For closely spaced
pulses the successive optimization of individual pulses is inaccurate. Additional pulses are required
to compensate for these inaccuracies introduced [173]. Hence, re-optimization of ‘all’ Np pulses is
required to solve it.
(C) Re-optimization after determining ‘each’ pulse position
A variation of above method is to locate a pulse at any stage (mi ) by jointly-optimizing its amplitude,
using amplitudes of all pulses located up to that position mi [173]. In this re-optimization, all terms for
pulse amplitude, position and error are minimized after determining the location of each pulse [15, 173,
8]. In the sequential method [8], the amplitudes and locations of pulses are determined in successive
stages, re-optimizing for one pulse at a time by minimizing the weighted mean-squared error e¯2 [n]. But,
in the case of closely spaced pulses some additional pulses are required in the same pitch-period to
compensate for the inaccuracies introduced, using which the pulse amplitudes can be re-optimized at
each step of the sequential procedure instead of all Np pulses together.
In this method [173], the amplitude of a pulse at location mi is jointly optimized using amplitudes
of all pulses determined up to that stage, keeping amplitudes of all pulses optimal at every (ith ) stage.
The result of Cholesky decomposition for Np pulses is computed by adding one row to the triangular
50
matrix determined for np − 1 pulses, without disturbing the amplitude (results) of previously determined pulses [15]. Hence, a new element is computed without disturbing the elements determined up
to that stage. This method reportedly gave improvements in the SNR up to 10 dB in some speech
segments [173].
3.4.3
Methods for estimating the positions of pulses
The search for the best pulse-location in the sequence of excitation pulses involves some comparisons, each of which has computational complexity or cost equal to an addition operation. Different
methods have been attempted to minimize that cost. Major few of these are discussed in this section.
(A) Pulse correlation method
In pulse correlation method [15], the best pulse-position m for a pulse in the excitation is obtained
using the correlation terms. It is the location at which the pulse amplitude Am in (3.17) is optimal,
i.e., it is Aˆm (as given by (3.20)), and the weighted mean-squared error e¯2 (as in (3.15) or (3.21)) is
minimum. The impulse response of the bandwidth-expanded all-pole synthesis filter dies-off quickly,
due to bandwidth expansion factor [15]. Hence, this part of the impulse response can be truncated.
In the autocorrelation form of analysis, the correlation term (φij ) is generated by filtering the {hn }
(i.e., {h[n]}) using recursive form of bandwidth expanded all-pole synthesis filter and (3.22). In covariance form of multi-pulse analysis, the correlation term {φij } is defined recursively [15], as
φi−1, j−i = φij + hN −i hN −j
(3.28)
The initial cross-correlation φij in this recursion in (3.28) can be computed either: (i) by direct computation using (3.20), or (ii) by filtering the {dn } using all-pole model for bandwidth-expanded synthesis
filter [15]. The cross-correlation terms are updated using (3.23). The pulse optimization in (3.27) uses
a modified form of Cholesky decomposition, thus avoiding the square root operations. But, the computational cost here is large memory requirement to store full coefficient matrix of size Np x Np elements.
(B) Pitch-interpolation method
In pitch-interpolation method [140], the pulse-position is obtained by interpolating the pitch-period
so as to minimize the weighted mean-squared error. In the multi-pulse excitation model [8], the excitation signal is a combination of pulses that excites a synthesis filter to produce synthesized speech s˜[n].
Hence, sequential sub-optimum pulse search method, i.e., re-optimization after determining each pulse
(discussed in section 3.4.2(C)) was considered a better solution [15, 140], than the sequential method or
re-optimization after determining all pulses. This method is based on analysis-by-synthesis approach.
Synthesis filter parameters {ak } are computed from the original speech using LPC analysis. The error
weighting filter H(γz) is used to reduce the perceptual distortion. The pulse is determined in such a way
that the weighted mean-squared error e¯2 [n] given by (3.14), is minimized. The pulse search methods
based upon this error-power minimization require maximum cross-correlation αm , as given in (3.18).
A simplified method of maximum cross-correlation αm gives the optimum location mi of the ith
pulse, which is determined by searching the maximum absolute point of amplitude Am for the pulse at
51
location mi . In this method [140], the pulse amplitude Am at location m is given by
Ami =
αhs (mi ) −
Pi−1
j=1 Amj
. φhh (|mj − mi |)
φhh (0)
,
1 ≤ mi , mj ≤ N
(3.29)
where, N is number of samples in block, αh (mi ) is cross-correlation between weighted speech (s[n])
and weighted impulse-response (h[n − m]) of all-pole synthesis-filter, φij is autocorrelation of weighted
impulse response (h[n − m]), and Am are amplitudes of previously determined pulses up to ith location.
Like in (3.18) and (3.19), the correlation terms αhs and autocorrelation terms φhh are given by
αhs (mi ) =
X
s[n] h[n − mi ]
(3.30)
hn−mi hn−mj
(3.31)
n
φhh (ij) =
X
n
Since, searching exhaustively for all possible locations of pulses, i.e., for n = 1 to M would be computationally expansive, different computationally efficient methods are used.
In this method [140], the excitation signal (u[n]) for the voiced speech segments is represented by
a small number of pulses (i.e., multi-pulse) in each pitch-period. Each excitation signal frame consists
of several pitch-periods. The original speech (s[n]) in a frame of size 20 ms is divided into several subframes of durations of successive pitch-periods. Synthesis filter parameters are interpolated in a pitchsynchronous manner, to get smooth change in the spectral characteristics of synthesized speech [140].
Several pitch-periods are searched for selecting one suitable pitch-period. Using this chosen duration,
the pitch-interpolator reproduces the excitation signal for other pitch-periods in the frame by performing
a linear interpolation. Usually 4 pulses are considered in a pitch-period, for sampling frequency (fs )
of 8 KHz. By exciting a synthesis filter with this excitation signal, the synthesized speech (˜
s[n]) is
produced that approximates the original speech (s[n]) in the frame.
(C) SPE-CELP method
In the single-pulse excitation (SPE)-CELP method [61], a single-pulse instead of multi-pulse is used
in a pitch-period of voiced speech. The conventional CELP coding [165] does not provide appropriate
periodicity of pulses in synthesized speech, especially for bit-rates below 4 Kbits/sec. It is because, both
the small code-book size and the coarse quantization of gain factor cause large fluctuations in the spectral
characteristics between two periods. In order to achieve smoothness of spectral changes, the excitation
consisting of a single-pulse of fixed or slowly varying shape for each period was proposed [61]. First
a LP coder classifies speech into periodic and non-periodic intervals. Then, non-periodic speech is
synthesized like that in CELP [140] coding, and periodic speech using single-pulse excitation. This
coder uses an algorithm for determining the pitch markers within short blocks of periodic speech.
The excitation for the all-pole synthesis filter in CELP coding [165] is modeled as a sum of two
vectors, an adaptive codebook that contains past excitation, and a fixed stochastic codebook. Selection
of both vectors uses the criterion of minimum perceptually-weighted error between original and reconstructed speech [165, 61]. Here, repeating the past excitation is necessary for obtaining the periodic
52
excitation. But, the stochastic code-book significantly reduces the ability to produce a periodic excitation. Also, the stochastic code-book vectors of fixed block-size cause large fluctuations in non-smooth
spectrum of reconstructed speech, thereby giving rise to noise in this case [61]. Hence, these problems
in CELP coding are solved using a pulse-like signal with fixed or slowly time-varying shape, i.e., a pulse
defined by a single delta-impulse or a cluster of several delta-impulses is considered for representing the
excitation in each period. Since, a single-pulse excitation can be better described by the time-location of
each pulse, along with its shape and gain, the parametric representation of a single pulse also enables the
interpolation of excitation parameters. This method using both (i) a fixed delta-impulse shape of singlepulse excitation to synthesize periodic speech and (ii) CELP-like stochastic code-book to synthesize
non-periodic speech, is referred to as SPE-CELP coder [61].
Determining the pitch-markers in SPE-CELP coding method [61]:
In order to determine the pitch-markers, the speech signal is encoded in coding frames of size of
200 samples, i.e., 25 ms for sampling rate of 8 KHz. Each coding frame is subdivided into 4 sub-frames
of 50 samples. Using long-term auto-correlation of the pre-processed speech signal within a window
around a sub-frame, the periodic/non-periodic classification is made for each sub-frame. Average pitch
period p¯ is determined for each periodic sub-frame. After each non-periodic-to-periodic transition, a
sequence of up to 5 periodic sub-frames is created, to form an optimization frame. Pitch markers are
determined for each optimization frame, that define the optimal locations for excitation by single-pulse
of delta-impulse shape using an error-criterion. The error-criterion includes a cost function f (i, j, k),
and is implemented efficiently by using dynamic programming in its optimization procedure [61].
Let us assume that an LPC synthesis filter is excited with a single-pulse at location n with amplitude α, for the coding frame m and maximum SNR Sopt (n). The impulse-response vector is ~h0 =
[h(0), ..., h(2N − 1)]T , where N is length of a periodic sub-frame. The error between speech vector
~x = [x(0), x(1), ....x(2N − 1)]T and delayed impulse response vector ~hn multiplied by amplitude α, is
~en = ~x − α ~hn
(3.32)
where ~hn = [0, . . . 0, h(0), . . . h(2N −1−n)]T and n = 0, . . . , N −1. Minimization of the error-energy
w.r.t. pulse amplitude α [61], gives
min ~eTn ~en = ~xT ~x −
α
(~xT ~hn )2
~hT ~hn
(3.33)
n
αopt (n) =
~xT ~hn
~hT ~hn
(3.34)
n

Sopt (n) = max SN R(n) = 10 log10 
α
~xT ~x
min ~eTn ~en
α


(3.35)
Here, pulse amplitude αopt (n) and SNR Sopt (n) are computed for each time-instant n, in an optimization frame m. A candidate pitch-marker at ni is represented by a 3-tuple zi =< ni , αopt (ni ), Sopt (ni ) >,
53
where Z = {zi , i = 1, . . . , M }, and M is maximum number of optimal sub-frames in the speech block.
For optimum sequence sopt the accumulated cost C is minimized using the cost function f (i, j, k). For
the indices Q = {q1 , . . . , qk | qt ∈ [1, M ], nqi > nqi−1 }, the values of sopt , C and f (i, j, k) [61] are:
sopt = {zq1 , . . . , zqk }, K ≥ 2
#
"
K
X
1
f (i=ql , j=ql−1 , k=ql−2 )
fini (i=q1 ) +
C = min
s K
(3.36)
(3.37)
l=2
a
f (i, j, k) =
Sopt (ni )
(ni − nj ) αopt (ni ) + c ln
+ b ln
αopt (nj ) (nj − nk ) (ni − nj ) ni > nj > nk
+ d ln
,
p¯(ni )
(3.38)
The cost function f (i, j, k) in (3.38), which gives the accumulated cost C, has 4 summation terms.
Their purpose is to penalize (i) the candidate with low Sopt (ni ), (ii) inconsistency in amplitude of two
successive pulse candidates, (iii) inconsistency in two successive pulse-intervals (ni −nj ) and (nj −nk ),
and (iv) a deviation in the pulse interval (ni − nj ) from initial estimate of average pitch period p¯(ni ).
The initial cost fini (i=q1 ) in (3.38) is computed as
a
ni
fini (i) =
+ d ln
+ ff ix ,
Sopt (ni )
p¯(ni )
a
+ ff ix ,
=
Sopt (ni )
ni > p¯(ni )
ni ≤ p¯(ni )
(3.39)
where, ff ix is a constant and fini (i=q1 ) = f (i=q1 , j=q0 , k=q−1 + ff ix ). The factors a, b, c and d in
(3.38) are determined empirically, for proper weighting of all 4 summation terms. The indices Q =
{q1 , . . . , qk } of best sopt (optimal sequence) define locations of the pitch-markers within current frame.
In this SPE-CELP method [61], the gain factors for encoding/transmission are jointly-optimized by
minimizing a perceptually-weighted mean-squared error criterion like in CELP coding method [165].
The pitch-markers, used for detecting individual periods of periodic speech within coding frames, are
implemented using dynamic programming. But, the limitation of this method is poor naturalness of
synthesized speech, which due to the fixed-pulse shape used, still sounds buzzy for certain speakers [61].
3.4.4
Variations in MPE model of the excitation
(A) Changing pitch and duration in MPE
The use of multi-pulse excitation [8] lead to significant improvement in the quality of synthetic
speech, but it did not provide appropriate degree of periodicity of the synthesized signal [61]. Coding
delays were also long in it. Single-pulse excitation model did reduce the coding delays also. However,
the procedures for changing the pitch were not known for multi-pulse excitation. Hence, two methods
were proposed for modifying the length of individual pitch periods that in turn caused changes in the
54
pitch [27]. (i) Linear scaling of the time axis of the multi-pulse excitation, which introduced more
distortion due to addition/removal of pitch periods that was required to change the pitch period duration.
(ii) Adding or deleting zeros in the excitation pulse sequence, which produced little distortion in the
synthetic speech [27]. In the cases where additional pitch periods were created, the amplitudes of the
major excitation pulse and the LPC parameters were obtained by interpolation [27].
(B) Post-filtering
Other than the speech coding methods discussed so far, approaches like adaptive sub-band coding (SBC) or adaptive transform coding (ATC) were also attempted, which represented the frequencydomain coders. Post-filtering, using the auditory masking properties, was used in these coders [145].
The post-filtering scheme is based on long-term and short-term spectral information of synthesized
speech. (i) The long-term correlation represented by pitch parameters gives fine spectral information.
(ii) The short-term prediction of LPC coefficients gives global spectral shape information [93].
The optimal long-term and short-term predictors are expressed as
HL (z) = 1 − βz −M
HS (z) = 1 −
M
X
qi z −i
(3.40)
(3.41)
i=1
where, HL and HS are the transfer functions of the long-term and short-term predictors, respectively.
The corresponding post-filter is given by
1
P ′ (z)
1
HS′ (z)
=
=
1
P (ǫ1/M z)
1
Hs (z/α)
(3.42)
where, 0 ≤ ǫ < 1 and 0 ≤ α < 1 are the parameters to vary the impulse response of post-filter between
responses of an all-pass filter (α = ǫ = 0) and a low-pass filter (α = ǫ = 1.0). Thus, varying the
parameters α and ǫ, the degree of noise shaping and signal distortion is changed [145]. For a suitable ǫ,
the post-filtering attenuates the valley in comb filter, but disadvantage is bandwidth increase by α factor.
(C) Secondary pulses and phase changes in MPE
The presence of secondary excitation pulses after the glottal closure, apart from the primary excitation pulses present due to glottal opening/open phase, was also indicated in [8]. The additional pulses
were introduced in pitch-periods, to reduce the inaccuracies introduced in the successive optimization
of individual pulses, especially in the case of closely spaced pulses [173]. But, these secondary pulses
in the multi-pulse excitation did not show periodic behaviour anywhere similar to the periodicity of
input speech [6]. Hence, repetition of a single multi-pulse pattern, selected randomly across all voiced
segments, was considered for the purpose of producing synthetic speech [6]. Other variations in the
multi-pulse excitation, by changing pitch period and duration, were also attempted [27].
But, it was observed that secondary pulses in multi-pulse excitation do not vary systematically from
one pitch period to another, even for periodic speech input [26]. The multi-pulse excitation is highly
55
Figure 3.1 Schematic block diagram of the ZFF method
periodic in lower frequency bands and is less periodic in higher frequency bands [26]. Hence, replacing
the original multi-pulse pattern in the excitation, with a fixed multi-pulse pattern was proposed in [26].
This fixed multi-pulse pattern was selected randomly from the multi-pulse excitation of a voiced speech
segment. Spectral and phase characteristics of pulses were changed, to introduce irregularities in the
fine structure of excitation. It did help in removing the buzzing sound effect present in CELP or SPECELP coded speech [26]. Six different phase conditions were examined: zero phase, constant phase,
time-varying phase, frequency dependent group-delay characteristics based phase, time-varying phase
of the first harmonic of the LPC residual and original phase [26]. It was observed that introducing these
period-to-period irregularities is necessary to provide more naturalness to synthesized speech. [26].
3.5 Zero-frequency filtering method
The characteristics of the glottal source of excitation are derived from the acoustic signal using the
zero-frequency filtering (ZFF) method [130, 216]. In ZFF, the features of the impulse-like excitation at
glottal source are extracted by filtering the differenced acoustic signal through a cascade of two zerofrequency resonators (ZFRs). Steps in the ZFF method are shown in the schematic block diagram in
Fig. 3.1. The key steps involved [130] are as follows:
(a) A differenced speech signal s[n] is considered. This preprocessing step removes the effect of any
slow (low frequency) variations during recording of the signal and produces a zero mean signal.
(b) The differenced signal s[n] is passed through a cascade of two ZFRs, each of which is an all-pole
system with two poles located at z = +1 in the z-plane. The output of the cascaded ZFRs is given
by
4
X
y1 [n] = −
ak y1 [n − k] + s[n] ,
(3.43)
k=1
where a1 = −4, a2 = 6, a3 = −4 and a4 = 1. It is equivalent to a sequence of four successive cumulative sum (integration) operations in time-domain, which leads to a polynomial-like
growth/decay of the ZFR output signal.
56
(a) Input voice signal
s[n]
1
0
−1
(b) ZF Resonator output
2
y [n]
13
4
x 10
2
0
2
y [n]
(c) ZFF output with window length =9 msec
1
0
−1
2
y [n]
(d) ZFF output with window length =7 msec
1
0
−1
2
y [n]
(e) ZFF output with window length =5 msec
1
0
−1
2
y [n]
(f) ZFF output with window length =3 msec
1
0
−1
20
40
60
80
100
120
Time (msec)
140
160
180
200
Figure 3.2 Results of the ZFF method for different window lengths for trend removal for a segment of
Noh singing voice. Epoch locations are indicated by inverted arrows.
(c) The fluctuations in the ZFR output signal can be highlighted using a trend removal operation, which
involves subtracting the local mean from the ZFR output signal at each time instant. The local mean
is computed over a moving window of size 2N + 1 samples. The window size is about 1.5 times
the average pitch period (in samples), which is computed using autocorrelation function of a 50 ms
segment of the signal. The output of the trend removal operation is given by
y2 [n] = y1 [n] −
N
X
1
y1 [n + m] ,
2N + 1
(3.44)
m=−N
where N is number of samples in half window size. The resultant local mean subtracted signal is
called the zero-frequency filtered (ZFF) signal. An illustration of the ZFF signal (y2 [n]) for is shown
in Fig. 3.2(c), which is derived from the corresponding speech signal (s[n]) shown in Fig. 3.2(a).
The trend built-up is shown in Fig. 3.2(b).
(d) The positive to negative going zero-crossings correspond to the instants of glottal closure (GCIs),
which are also referred to as epochs [130]. The interval between successive epochs corresponds
to the fundamental period (T0 ), inverse of which gives the instantaneous fundamental frequency
(F0 ) [216].
(e) The slope of the ZFF signal around the epochs gives a measure of the strength of the impulse-like
excitation (SoE) [130, 216]. The SoE (denoted by ψ) at an epoch thus represents the amplitude of
impulse-like excitation around that instant of significant excitation (i.e., GCI).
57
Figure 3.3 Schematic block diagram of the ZTL method
As can be observed in Fig. 3.2, the effect of window-length on trend-removal is not significant so
long it is chosen between 1-2 times the pitch period, which is fine for the case of normal speech. But the
role of window-length becomes important, when pitch-period is reduced or is changing rapidly, as is the
case likely to be for sounds other than normal speech. Reducing the window length for trend removal
as in Fig. 3.2(d)-(f), may not help in such cases. The need for special signal processing method and
modification in ZFF method are discussed in later chapters, while discussing paralinguistic sounds and
expressive voices.
3.6 Zero-time liftering method
The recently proposed zero-time liftering (ZTL) method is used to capture the spectral features of the
speech signal with improved temporal resolution [40, 38]. The method involves multiplying a segment
of the signal starting at each sampling instant with a tapering window that gives large weight to the
samples near the starting sampling instant, which we call as zero-time. The effect of the tapering window
function w1 [n] in the time domain is approximately equivalent to integration in the frequency domain.
This inference is derived from the fact that the operation of double integration in the time domain is
2
equivalent to filtering with a function 1−z1 −1 , which is same as multiplying the frequency response
2
with a frequency response of an ideal digital resonator 1−z1 −1 with resonance frequency at ω = 0,
i.e., the zero frequency. In analogy with the zero-frequency filtering [130], the time domain signal is
multiplied with w1 [n] to produce a smoothed spectrum in the frequency domain by integration. Steps
involved in the ZTL method are shown in the schematic block diagram in Fig. 3.3. The following are the
steps involved in extracting the instantaneous spectral characteristics using the ZTL method [40, 38].
(a) Consider differenced speech signal s[n] at a sampling rate of 10000 samples per second. The
differenced signal is used to reduce the effects of slowly varying components in the signal due
to recording.
(b) Consider a 5 msec segment (i.e., M = 50 samples) at each sampling instant, and append it with
N − M (= 1024 − 50) zeros. The segment is appended with sufficient number of zeros before
58
computing N -point discrete Fourier transform (DFT), to get adequate number of samples in the
frequency domain for observing the spectral characteristics.
(c) Multiply the N samples segment with a window function w12 [n], where
w1 [n] = 0,
=
n = 0,
1
4sin2 (πn/2N )
,
n = 1, 2, ...N − 1,
(3.45)
(This gives an approximation to four times integration in the frequency domain, as M << N .
Actually a window function of 1/n4 results in integration operation in the frequency domain. But
as mentioned above, the window function w1 [n] is chosen analogous to the zero-frequency filtering
in the frequency domain). Multiplying the signal with the window function w1 [n] is called the zerotime liftering (ZTL) operation. This window function emphasizes the values near the beginning of
the window, i.e., near n = 0, due to its heavy tapering effect. This will help in producing a smoothed
function in the frequency domain.
(d) The truncation effect of the signal at the sampling instant M − 1 in the time domain results in ripple
in the frequency domain. This ripple is reduced by using another window w2 [n] = 4 cos2 (π n/2M ),
n = 0, 1, .....M − 1, which is square of half cosine window of length M samples. The shape of this
window is not critical to the results, except that it should have a tapering effect towards the end of
the segment.
(e) The square magnitude of the N -point DFT of the double windowed signal, i.e., of x[n] = w12 [n]w2 [n]s[n]
is taken. It is a smoothed spectrum due to the effect of equivalent four times integration in the frequency domain.
(f) In order to highlight the spectral features, the numerator of the group delay (NGD) function g[k]
of the windowed signal (i.e., of x[n] = w12 [n]w2 [n]s[n]) is computed [215, 38]. The group delay
function τ [k] of a signal x[n] is given by [212, 75]
τ [k] =
XR [k]YR [k] + XI [k]YI [k]
,
2 [k] + X 2 [k]
XR
I
k = 0, 1, 2, ....., N − 1
(3.46)
where XR [k] and XI [k] are the real and imaginary parts of the N -point DFT X[k] of x[n], respectively, and YR [k] and YI [k] are the real and imaginary parts of the N -point DFT Y [k] of
y[n] = nx[n], respectively. The numerator of the group delay function g[k] is given by [75]
g[k] = XR [k]YR [k] + XI [k]YI [k],
k = 0, 1, 2, ....., N − 1
(3.47)
The group delay function has higher frequency resolution than the normal spectrum due to its additive property, i.e., the group delay function of a cascade of resonators is approximately the sum of the
squared frequency response of the individual resonators [212]. Moreover, due to smoothed nature
(cumulative effect) of the spectrum, the numerator of the group delay function displays even higher
frequency resolution of the resonances (since the spectral peaks are more highlighted now) [75].
59
(a) 3D HNGD spectrum
(b) 3D HNGD spectrum at epoch locations
15
x 10
2.5
2
1.5
1
0.5
50
100
0
150
1000
2000
200
3000
250
4000
Frequency (Hz)
Time (ms)
5000
Figure 3.4 HNGD plots through ZTL analysis. (a) 3D HNGD spectrum (perspective view). (b) 3D
HNGD spectrum plotted at epoch locations (mesh form). The speech segment is for the word ‘stop’.
(g) The resulting NGD function is differenced twice to highlight the spectral features, i.e., peaks in the
spectrum corresponding to resonances (formants) of the vocal tract system. The double differencing
is needed to remove the trend, and to observe the spectral features. Note that the differencing operation is performed with respect to discrete frequency variable (k), whereas the liftering operation in
the time domain is equivalent to integration with respect to the continuous frequency variable (ω).
(h) Due to effect of some spectral valleys, some of the low amplitude peaks do not appear well in
the double differenced NGD plots. Hence, the Hilbert envelope of the double differenced NGD
spectrum is computed to represent visually the spectral features better [137, 40]. The resulting
spectrum is called the HNGD spectrum.
A 3-dimensional (3D) HNGD plot for a segment of speech signal is shown in Figure 3.4(a). Note
that the HNGD plots are shown for every sampling instant. The HNGD spectrum can also be sliced temporally at every glottal closure instant (epoch), to view it in a 3D mesh form, as shown in Figure 3.4(b).
It is the temporal variations in the instantaneous HNGD spectra over the open and closed phases of glottal pulses that are exploited in this study to discriminate between shouted speech and normal speech,
discussed later in Chapter 5.
3.7 Methods to compute dominant frequencies
3.7.1
Computing dominant frequency from LP spectrum
The production characteristics of speech characterise the role of both the excitation source and the
vocal tract system. The vocal tract characteristics are derived from the LP spectrum obtained using the
60
LP spectrum of a frame of speech signal
15
FD
2
10 log|LX| (dB)
10
5
0
−5
−10
0
1000
2000
3000
Frequency (Hz)
4000
5000
Figure 3.5 Illustration of LP spectrum for a frame of speech signal.
LP analysis method [112]. The key idea in LP analysis is that each sample xn of speech signal x[n] at
time instant n is predictable as a linear combination of previous p samples.
x
ˆn =
p
X
ak xn−k
(3.48)
k=1
where {xn } are speech samples in a given frame, {ak } are the predictor coefficients and {ˆ
xn } are the
predicted samples. An all-pole filter H(z) given by these coefficients {ak } in the frequency domain is:
H(z) =
1
(1 −
Pp
k=1 ak z
−k )
(3.49)
where p is the order of linear prediction. The energy of the prediction error signal (Ep ) is:
Ep =
p
X
X
(xn −
ak xn−k )2
n
(3.50)
k=1
This energy (Ep ) is minimized by setting the partial derivative of Ep with respect to each coefficient ak
to zero.
The LP spectrum in the frequency domain is obtained from the LPCs {ak } obtained above [112].
The shape of the LP spectrum represents the resonance characteristics of the vocal tract shape for a
frame of speech signal. An illustration of LP spectrum for a frame of speech signal is shown in Fig. 3.5.
The production characteristics of shouted speech are derived from the LP spectrum, in the Chapter 5.
3.7.2
Computing dominant frequency using group delay method and LP analysis
The effect of system-source coupling is examined using the first two dominant frequencies FD1
and FD2 of the short-time spectral envelope. The features FD1 and FD2 of the vocal tract system are
derived using group-delay function [128] of the linear prediction (LP) spectrum [112]. The features FD1
and FD2 represent the resonances in the vocal tract system. The steps involved are as follows:
61
(a) The vocal tract system characteristics are derived using LP analysis [112]. Let a1 , a2 ,....ap be the
p LP coefficients. The corresponding all-pole filter H(z) is given by:
H(z) =
1
(1 −
Pp
k=1 ak z
−k )
(3.51)
For a 5th order filter, there will be maximum two peaks in the LP spectrum corresponding to two
complex conjugate pole pairs. The frequencies corresponding to these peaks are called dominant
peak frequencies, and are denoted as FD1 and FD2 . These may correspond to formants, but it is not
necessary.
(b) The group delay function (τg (ω)) is the negative first derivative of the phase response of the all-pole
filter [128, 129], and is given by
dθ(ω)
(3.52)
τg (ω) = −
dω
where θ is the phase angle of frequency response H(ejw ) of the all-pole filter. Frequency locations
of the peaks in plot of the group delay function give dominant frequencies (FD1 and FD2 ).
The dominant frequencies FD1 and FD2 are derived for each signal frame using pitch synchronous LP
analysis, anchored around GCIs. These are used mainly in the analysis of acoustic loading effects on
glottal vibration in normal speech and the analysis of laughter signal, which are discussed in further
chapters.
3.8 Challenges in the existing methods and need for new approaches
Representing the excitation information in terms of a sequence of impulse-like pulses was attempted
in various speech coding methods, with aim to reduce the bit-rate of speech coding or to increase the
voice quality of synthesized speech. This representation was attempted only for modal voicing in normal
speech. To the best of our knowledge, the impulse-sequence representation of the excitation information
is not yet used for analysing the emotional speech, paralinguistic sounds and expressive voices. Also,
the presence of secondary pulses within a pitch period was indicated in few studies on multi-pulse
excitation [8, 6, 173, 27]. But, it would be interesting to explore - can these secondary pulses help in
distinguishing the nonverbal speech sounds, from normal speech?
In order to better capture the rapid changes in the production characteristics, mainly of the excitation
source, some modified/refined/new signal processing methods would be required, that use a time-domain
impulse-sequence representation of the excitation component in acoustic signal.
The impulse-sequence representation of the excitation source that is guided by the pitch-perception,
is yet another research challenge. Although perceptual aspect was considered in CELP coding [8], but
using the pitch-perception in this representation is rarely explored. Some good measure of pitch may
also required to be incorporated in the impulse-sequence representation of the excitation signal.
62
Several methods have been evolved for estimating the instantaneous fundamental frequency (F0 ) and
pitch of speech signal. But, further challenge is to estimate F0 (pitch) in the regions of aperiodicity, or
in the regions where the signal is apparently random. Because, this may be the likely scenario in paralinguistic sounds and expressive voices. Further, extracting F0 from the excitation impulse-sequence,
guided by the pitch-perception, is another challenging problem. In this thesis, few new methods have
been proposed to address some of these issues and use for analysing the nonverbal speech sounds.
3.9 Summary
In this chapter, some standard techniques and few recently proposed signal processing methods that
are used in this thesis work, are described. The zero-frequency filtering and zero-time liftering methods used for extracting the excitation source and the spectral characteristics, respectively, are discussed.
Methods for computing the first two dominant frequencies of resonances in the vocal tract system are
also discussed. These methods are used for the analysis of source-system interaction in normal (verbal)
speech, and also for analysing the shouted speech in emotional speech category of sounds. But, the analysis of other nonverbal sound categories, in particular the paralinguistic sounds and expressive voices,
requires some further specialized signal processing methods, that are newly proposed in this research
work. These methods are discussed at relevant places in the thesis in further chapters.
63
Chapter 4
Analysis of Source-System Interaction in Normal Speech
4.1 Overview
In this chapter, we examine the changes in the production characteristics of variations in normal (verbal) speech sounds. Mainly the effects of source-system coupling and acoustic loading due to sourcesystem interaction, involved in the production of some specific sounds such as trills, are examined. First,
the significance of changing vocal tract system and associated changes in the glottal excitation source
characteristics due to trilling are studied, from perception point of view. In these studies, speech signal
is generated (synthesized) by either retaining the features of the vocal tract system or of the glottal excitation source of trill sounds. Experiments are conducted to understand the perceptual significance of the
excitation source characteristics in production of different trill sounds. Speech sounds of sustained trill
and approximant pair, and of apical trills produced by four different places of articulation are considered. Features of the vocal tract system are extracted using linear prediction analysis, and those of the
source by zero frequency filtering. Studies indicate that glottal excitation source apparently contributes
relatively more, for perception of apical trill sounds.
Glottal vibration is the primary mode of excitation of the vocal tract system, for producing voiced
speech. Glottal vibration can be controlled voluntarily for producing different voiced sounds. Involuntary changes in glottal vibration due to source-system coupling are significant in the production of some
specific sounds. A set of six selected categories of sounds with different levels of stricture in the vocal
tract are examined. The sound categories are: apical trills, apical lateral approximants, alveolar and
velar variants of voiced fricatives and voiced nasals. These sounds are studied in the vowel context [a],
in modal voicing. The acoustic loading effect on the glottal vibration, for each of the selected sound
category, is examined through the changes observed in the source and system characteristics. Both
the speech signal and electroglottograph signal are examined in each case. Features such as instantaneous fundamental frequency, strength of impulse-like excitation and dominant resonance frequencies
are extracted from the speech signal using zero-frequency filtering method, linear prediction analysis
and group delay function. It is observed that high stricture in the vocal tract causing obstruction to the
free flow of air produces significant loading effect on the glottal excitation.
64
Upper articulator
(a)
Tongue tip
Upper articulator
(b)
Tongue tip
Upper articulator
(c)
Tongue tip
Closure
Opening
Closure
Figure 4.1 Illustration of stricture for (a) an apical trill, (b) theoretical approximant and (c) an approximant in reality. The relative closure/opening positions of the tongue tip (lower articulator) with respect
to upper articulator are shown.
The chapter is organized as follows. The relative role of source and system in the production of trills
is discussed in Section 4.2. Effect of source-system coupling is examined using analysis by synthesis
and perceptual evaluation. Effect of acoustic loading of vocal tract system on the excitation source
characteristics is discussed in Section 4.3. Changes in production characteristics of six consonant sound
types are examined qualitatively from the waveforms, and quantitatively from the features derived from
acoustic and EGG signals. Section 4.4 discusses summary of this chapter, along with key contributions.
4.2 Role of source-system coupling in the production of trills
4.2.1
Production of apical trills
Trills, a stricture type, involve vibration of an articulator (lower) against another articulator (upper)
due to aerodynamic constraints. Trills involving the lower articulator as the tip of the tongue are called
apical trills [100]. The tongue tip in apical trills vibrates against a contact point in the dental/alveolar
region. Production of an apical trill involves several aerodynamic and articulatory constraints. Aerodynamic constraints are related to tension at the tongue tip and volume velocity of air flow through the
stricture, both essential for the initiation and sustenance of the apical vibration. Articulatory constraints
are related to lingual and vocal tract configuration aspects [100, 40].
Production of apical trills can be characterized by three cyclic actions: (i) Repeated breaking of
the apical stricture due to interaction between tongue tension and volume velocity of air flow. (ii)
Partial falling of the tongue tip to partially release the positive pressure gradient in the oral cavity. (iii)
Recoiling the tongue tip to meet upper articulator to form next event of stricture. One such closureopening-closure cycle of the stricture, shown in Fig. 4.1(a), is called a trill-cycle. Typical rate of tongue
65
(a) Input speech signal
s[n]
1
0
(b) ZFR output signal
1
s
z [n]
−1
0
−1
(c) F contour
0
F (Hz)
0
120
115
110
ψ
(d) SoE impulse sequence
1
0.8
0.6
0.4
0.2
D1
F ,F
D2
(Hz)
(e) F
D1
5000
4000
3000
2000
1000
10
30
50
and F
D2
contours
70
90
Time (ms)
110
130
150
Figure 4.2 (Color online) Illustration of waveforms of (a) input speech and (b) ZFF output signals, and
contours of features (c) F0 , (d) SoE, (e) FD1 (“•”) and FD2 (“◦”) derived from the acoustic signal for
the vowel context [a].
tip trilling, as measured from acoustic waveform or the spectrogram, is about 20-30 Hz [98, 116]. Two
to three cycles of apical trills are common in continuous speech, whereas more than three cycles may
be produced in sustained production of the sound [40]. When the lower articulator (tongue tip) does not
touch (or tap) the upper articulator completely, the production of trill is like in Fig. 4.1(b). However,
due to aerodynamic and articulatory constraints, the production of trill in this case, is mostly, as shown
in Figure 1(c). This sound is called approximant. Apical trills are common among some languages like
Telugu, Malayalam and Punjabi (Indian) languages, whereas approximants are more common in some
languages, like English.
The contact point of the upper articulator, against which the tongue tip vibrates, can be in different
regions, like bilabial, dental, alveolar and post-alveolar. These are called in this study as ‘trill sounds
produced at different places of articulation’.
4.2.2
Impulse-sequence representation of excitation source
The characteristics of the glottal source of excitation are derived from the acoustic signal using the
zero-frequency filtering (ZFF) method [130, 216]. In ZFF, the features of the impulse-like excitation at
glottal source are extracted by filtering the differenced acoustic signal through a cascade of two zerofrequency resonators (ZFRs). Details of ZFF method are discussed in Section 3.5.
An illustration of the ZFF signal (zs [n]) for the vowel context [a] is shown in Fig. 4.2(b), which is
derived from the corresponding speech signal (s[n]) shown in Fig. 4.2(a). An illustration of F0 contour
66
Trill
Approximant
F0 (Hz)
140
120
100
80
SoE
1
0
−1
1
0.8
0.6
0.4
0.2
so[n]
si[n]
(a) Speech signal
1
0
−1
(b) F0 contour
(c) SoE contour
(d) Synthesized speech
1
2
3
Time (sec)
4
5
Figure 4.3 (a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and (d)
Synthesized speech waveform (x13 ), for a sustained apical trill-approximant pair (trill: 0-2 sec, approximant: 3.5-5.5 sec). Source information is changed (system only retained) in synthesized speech.
and SoE impulse sequence is given in Fig. 4.2(c) and (d), respectively. The SoE impulse sequence
represents the glottal source excitation, in which location of each impulse corresponds to an epoch (GCI)
and its amplitude indicates relative strength of excitation around that epoch. The F0 contour reflects the
changes in successive epoch intervals.
4.2.3
Analysis by synthesis of trill and approximant sounds
In order to study the relative significance of the dynamic vocal tract system and the excitation source
in the perception of trill sounds, synthetic speech signals are generated by controlling the system and
source characteristics of trill sounds separately. For this the natural trill sounds are analyzed to extract
the source characteristics in terms of epochs and the strength of impulses at epochs. The dynamic
characteristics of the vocal tract shape are captured using LP analysis on a frame size of 20 msec with a
frame shift of 5 msec.
Four scenarios are considered for synthesis of trills. Retained the characteristics of (i) both source
and system, (ii) only source, (iii) only system. (iv) Changed both source and system. Perceptual evaluation of the synthesized speech in each of these four scenarios of selective retention is carried out.
Changes in the glottal excitation source characteristics are made by first disturbing the fundamental
frequency (F0 ) information and then the amplitude information. For each epoch, the next epoch is
located at an interval corresponding to the average of several pitch periods around this epoch. The
new impulse sequence reflects the averaged pitch period information. The amplitude of each impulse
corresponds to the average of the SoE around that epoch. This new impulse sequence is referred to
in this paper as ‘changed excitation source’ information. The impulse sequence with changed source
information is used as excitation for generating speech for scenarios (iii) and (iv) of selective retention.
The effect of this averaging is shown in Fig. 4.3 for a trill and for an approximant. Fig. 4.3 shows the
67
Trill
Approximant
F0 (Hz)
140
120
100
80
SoE
1
0
−1
1
0.8
0.6
0.4
0.2
so[n]
si[n]
(a) Speech signal
1
0
−1
(b) F0 contour
(c) SoE contour
(d) Synthesized speech
1
2
3
Time (sec)
4
5
Figure 4.4 (a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and (d)
Synthesized speech waveform (x11 ), for the sustained apical trill-approximant pair (trill: 0-2 sec, approximant: 3.5-5.5 sec). Both system and source information that of original speech are retained in
synthesized speech.
changed source characteristics, i.e., the F0 and SoE contours of excitation sequence in Fig. 4.3(b) and
Fig. 4.3(c), respectively. These changes are more evident for the trill (first) sound as compared to the
approximant (second) sound. This can be contrasted with the corresponding contours of the F0 and SoE
of excitation sequence, for the original trill and approximant sounds, shown in Fig. 4.4.
Since the trill cycle is of 20-30 Hz, the LPCs computed using a frame size of 100 msec can be
considered as the ‘changed characteristics of vocal tract system’. The changed characteristics of the
system is used for generating speech for scenarios (ii) and (iv) of selective retention. To establish the
significance of coupling between the system and source characteristics (in scenario (i)), the scenario
where both the source and system information are changed (i.e., scenario (iv)) is used for comparison.
Perceptual evaluation of the 4 synthesized speech files (one for each of the 4 scenarios of selective
retention), with reference to original speech file is carried out. Similarity score criterion as given in
Table 1, is used.
4.2.4
Perceptual evaluation of the relative role
Two experiments were conducted in this study, each with 4 scenarios of selective retention of
source/system information. Since most databases of continuous speech have very limited trill data
suitable for this study, the required speech sounds of sustained trill-approximant pair and the trills with
4 different places of articulation, are recorded with the help of an expert male phonetician.
Experiment 1 was conducted with a trill-approximant pair speech file (x10.wav). From this reference file (x10 ), the features of the glottal excitation and the vocal tract system are extracted. Using
these source and system features, the four synthesized speech files (x11 , x12 , x13 and x14 ) are generated [153]., one for each of the 4 scenarios of selective retention of source/system information. Fig. 4.4
68
Table 4.1 Criterion for similarity score for perceptual evaluation of two trill sounds (synthesized and
original speech)
Perceptual similarity
both sound very much similar
both sound quite similar
both sound somewhat similar
both sound quite different
both sound very much different
Similarity score
5
4
3
2
1
Table 4.2 Experiment 1: Results of perceptual evaluation. Average similarity scores between synthesized speech files (x11 , x12 , x13 and x14 ) and original speech file (x10 ) are displayed.
x11 vs x10
(Source, System retained)
3.95
x12 vs x10
(Source only
retained)
3.48
x13 vs x10
(System only
retained)
2.82
x14 vs x10
(Source, system changed)
1.75
shows the sustained apical trill-approximant pair with F0 and SoE contours of excitation sequence, and
synthesized speech for scenario (i) (i.e., retaining both source and system information). Fig. 4.3 shows
the changed source characteristics as in scenario (iii) (i.e., retaining only the system information) of
the trill region. The effect of ‘changed source information’ can be observed in F0 and SoE contours of
excitation sequence in Fig. 4.3, as compared to those in Fig. 4.4. Perceptual evaluation is carried out
by comparing each of the synthesized speech file (x11 to x14 ) with reference original speech (x10 ). A
total of 20 subjects, all speech researchers from Speech and Vision Lab at IIIT-Hyderabad, participated
in this evaluation. The subjects were asked to give the similarity scores for each of the 4 synthesized
trill-approximant pairs. The averaged scores of perceptual evaluation for Experiment 1 are given in
Table 4.2.
Another experiment (Experiment 2) was conducted with speech file (x20.wav) consisting of trill
sounds corresponding to the 4 different places of articulation, namely, bilabial, dental, alveolar and
post-alveolar. From this reference speech file (x20 ), the features of the glottal excitation and vocal tract
system are extracted. Four synthesized speech files (x21 , x22 , x23 and x24 ) are generated for each of the
4 scenarios of selective retention of source/system information. Perceptual evaluation was carried out
by comparing each of the trill sound in a synthesized speech file (x21 to x24 ), with the corresponding
original trill utterance (x20 ). Similarity score for each of the 4 different places of articulation with
respect to corresponding original speech was obtained. All the 20 subjects gave similarity scores for
each of the 4 synthesized speech files (x21 , x22 , x23 and x24 ), for each place of articulation. The results
of perceptual evaluation for Experiment 2 are given in Table 4.3.
In Table 4.2 the high average score in column 1 is due to the fact that source and system information
both are retained in the synthesized speech, which is perceptually close to the original speech. The
lower average score in column 4 confirms the fact that when both the source and system are changed,
69
Table 4.3 Experiment 2: Results of perceptual evaluation. Average similarity scores between each place
of articulation in synthesized speech files (x21 , x22 , x23 and x24 ), and corresponding sound in original
speech file (x20 ) are displayed.
File name: synthesized
vs reference speech
x21 vs x20
x22 vs x20
x23 vs x20
x24 vs x20
Bilabial
trill
3.15
2.55
2.25
1.20
Dental
trill
3.55
2.90
2.45
1.30
Alveolar
trill
3.58
2.85
2.30
1.40
Post-alveolar
trill
3.13
2.85
2.30
1.40
Average score
(for all 4 trills)
3.35
2.79
2.33
1.33
the resulting sound is different from the original trill utterance. The trill sound is perceptually close to
an approximant, in this case. The lower average score in column 4 in contrast to high average score in
column 1, is indicative of the fact that vocal tract system and glottal excitation source information both
jointly contribute to the production and perception of trill sounds. The relatively high average score
in column 2 (for x12 ), where only source information is retained (system information is changed), and
relatively low average score in column 3 (for x13 ), where only system information is retained (source
information is changed) are interesting results. These scores indicate the relatively higher significance
of the glottal excitation source information in the perception of apical trills.
In Table 4.3, the last column gives the average similarity scores across all the 4 different trill sounds.
These results are in line with the results of Experiment 1 (in Table 4.2). The average scores in row 3
(for x22 ) in Table 4.3, where only source information is retained, are relatively higher in comparison
to the average scores in row 4 (for x23 ), where only system information is retained. This pattern is
consistent for each of the 4 places of articulation. It reconfirms the inference drawn from the results
of Experiment 1 (in Table 4.2) that source information contributes relatively more as compared to the
system information, in perception of tongue tip trills.
The relatively higher average scores in 3rd and 4th columns (in Table 4.2, for dental and alveolar
trills, respectively), in row 2 especially (for x21 ) where both source and system are retained, also indicate
relatively better perceptual closeness of dental and alveolar synthesized trill sounds to the corresponding
natural trill sounds. The least average scores in last row (for x24 ) in Table 4.3, when both source and
system are changed, and high average scores in row 2 (for x21 ), when both source and system are
retained, are similar to the results of Experiment 1 (in Table 4.2). These average scores highlight the
fact that there is some amount of system-source coupling in the production and perception of the tongue
tip trilling. It also indicates that production of tongue tip (apical) trilling does affect the glottal excitation
source due to coupling with the vocal tract system.
4.2.5
Discussion on the results
The effect of tongue tip (apical) trilling on glottal excitation source is indicated by the fact that system
alone or source alone information is not sufficient for production and perception of apical trills. Both
70
the source and system are involved in some coupled way, in the production/perception of apical trills,
due to interaction between aerodynamic and articulatory components. Glottal excitation source appears
to contribute relatively more, for perception of apical trills, as indicated by the perceptual evaluation
results of experiments 1 and 2. Also, the synthesized apical dental/alveolar trills are perceptually closer
to the corresponding natural trill sounds.
This study can be useful further in automatic spotting of trills, synthesis/modification of trill sounds
and trill-based discrimination of different languages and dialects. The study can also be helpful in distinguishing the trill sounds and approximants from signal processing point of view, and in understanding
the production/perception of different apical trill sounds at different places of articulation.
4.3 Effects of acoustic loading on glottal vibration
Speech sounds are produced by excitation of the time-varying vocal tract system. The major source
of excitation is the quasi-periodic vibration of the vocal folds at the glottis [48], referred to as voicing.
Languages make use of different types of voicing, called phonation types [102, 99]. Among the phonation types, modal voice is considered to be the primary source for voicing in majority of languages [99].
Both the mode of glottal vibration and the shape of the vocal tract contribute to the production of different sounds [49]. The mode of glottal vibration can be controlled voluntarily for producing different
phonation types such as modal, breathy and creaky voices. Similarly, the rate of glottal vibration (F0 )
can also be controlled, giving rise to changes in pitch. Glottal vibration may also be affected due to
coupling of the vocal tract system with the glottis. This change in the glottal vibration may be viewed
as involuntary change.
In this study, we examine the involuntary changes in the glottal vibration, due to the effect of acoustic
loading of the vocal tract system, for a selected set of six categories of voiced consonant sounds. These
categories are distinguished based upon the size, type and location of the stricture along the vocal tract,
which is influenced by the manner and place of articulation. Three types of occurrences, namely, single,
geminated and prolonged are examined for each of the six categories of sounds, in modal voicing, in
the context of vowel [a]. The speech signal along with the electroglottograph (EGG) signal [53, 50] are
used for analysis of these sounds. Changes in the system characteristics are analyzed using two features
termed as dominant frequencies FD1 and FD2 . Dominant frequencies are derived from the speech signal,
using linear prediction analysis [112] and the group-delay analysis [128]. Source features such as the
instantaneous fundamental frequency (F0 ) and strength of impulse-like excitation (SoE) are extracted
from the speech signal using the zero-frequency filtering method [130, 216].
4.3.1
What is acoustic loading?
Studies have indicated that there exists physical coupling between the glottal source and the vocal
tract system, i.e., source-system coupling. During the production of some specific speech sounds, such
71
Upper articulator
(a)
Lower articulator
Upper articulator
(b)
Lower articulator
Closure
Opening
Closure
Upper articulator
(c)
Lower articulator
Upper articulator
(d)
Lower articulator
Time
Figure 4.5 Illustration of strictures for voiced sounds: (a) stop, (b) trill, (c) fricative and (d) approximant.
Relative difference in the stricture size between upper articulator (teeth or alveolar/palatal/velar regions
of palate) and lower articulator (different areas of tongue) is shown schematically, for each case. Arrows
indicate the direction of air flow passing through the vocal tract.
as ‘high vowels’ and trills, this coupling leads to acoustic loading of the vocal tract system. The air
pressure difference across glottis (i.e., between supraglottal and subglottal regions) affects the vibration
of the vocal folds at the glottis. It results in the source-system interaction, which seems to manifest as
acoustic loading, i.e., changes in the vocal tract resonances affect the glottal vibration characteristics.
According to source-filter theory of speech production, speech wave is the response of the vocal tract
filter system to one or more sources of excitation [46]. A number of interconnected filter sections are
represented by different vocal tract cavities such as pharynx, mouth and nasal cavities. In general, the
first formant is associated with the resonance of pharynx (back) cavity and the second formant with
mouth (front) cavity [46]. In this study, we focus mainly on the glottal source of excitation. During the
open phase of the glottis, the acoustic excitation at the glottis can be represented by a volume velocity
source with a relatively high acoustic impedance [182]. Glottal vibrations are related to the pressure
across glottis, the configuration of glottis and the compliance of the vocal folds [182]. The study of
the effect of constriction in the vocal tract on glottal vibration had indicated that in the case of voiced
fricative or stop consonant sounds, a narrow constriction or closure at some point along the length of the
vocal tract, may cause substantial effect on the glottal vibration [182]. Similar effect may intuitively be
possible in the production of voiced nasal sounds also. Hence, more investigation is needed in the effect
of acoustic loading of the vocal tract system on the excitation source characteristics, in relation to the
effect of constriction in the vocal tract, for different sounds.
72
In this study, we examine the effect of acoustic loading of the vocal tract system on glottal vibration characteristics for a set of voiced consonant sounds. The different sound categories selected for
this study differ in cross-sectional area of stricture, besides place of articulation (i.e., stricture location) [46, 29] during production. Following [29], differences in the stricture for stop, trill, fricative and
approximant sounds are schematically represented in Fig. 4.5(a), (b), (c) and (d), respectively. Both, the
location of constriction/closure point along the length of the vocal tract, and the air flow and air pressure
between the upper and lower articulators, can influence the extent of acoustic loading effects.
In the production of apical trill ([r]) sound, the oral stricture opens and closes periodically (as shown
in Fig. 4.5(b)), at the rate of 25-50 Hz [116, 40]. This periodic closing/opening of the oral cavity
affects the acoustic loading of the vocal tract in each cycle. In recent studies on the production of apical
trills, the loading effect of the system on the source characteristics [40] and the role of source-system
coupling [122] were examined. In this study, we examine the excitation source characteristics of apical
trills ([r]) using electroglottograph (EGG) signals along with the corresponding speech signal.
Production of fricatives involves narrow constriction of the vocal tract at some point along its length
(Fig. 4.5(c)), which may cause acoustic loading of the vocal tract, and thereby affect the glottal vibration
characteristics. Different locations of the constriction point along the vocal tract may cause the glottal
vibration characteristics to change differently. Two variants of fricatives are examined, namely, alveolar
fricative ([z]) and velar fricative ([È]), which involve two different locations for the points of constriction
of the vocal tract.
In the production of the apical lateral approximant ([l]) sound, the lateral stricture is relatively wide
open for the entire steady-state duration (Fig. 4.5(d)), unlike that for [r] sound (Fig. 4.5(b)). If the glottal
vibration characteristics of the trill sounds are changed to normal modal vibration, then trills may sound
like approximants [122]. Hence, apical lateral approximant ([l]) sounds are examined to understand the
differences in their excitation characteristics from those of trills ([r]).
Nasal sounds involve closure at some location in the oral tract, while the nasal tract is kept open.
Two variants of nasal sounds are examined, namely, alveolar nasal ([n]) and velar nasal ([N]), to study
whether the high stricture (nearly closed constriction) along the vocal tract, concurrent with the open
nasal tract, has any effect on the glottal vibration.
Production of consonants ([r], [l], [z], [È], [n] and [N]) sounds in the context of vowel [a] are considered in this study. The sounds selected for studying the effect of acoustic loading are only representative
of a few sound categories. The single, geminated and prolonged occurrences of sounds are included
in each category. The analysis of acoustic loading effects is carried out using the geminated occurrence type for each of the six categories of sounds, as in this case the consonants are produced in a
sustained manner. The single cases are considered as these are the cases that occur in normal speech,
and prolonged cases are studied to examine whether such prolongation has any considerable deviation.
73
4.3.2
Speech data for analysis
In natural production, speech sounds are produced as part of one or more syllables of the structure
/CV/, /VCV/ or /VCCV/, consisting of vowels (/V/) and consonants (/C/). If the vowel on both sides is
in modal voicing, then it is easier to distinguish the vowel/consonant regions for analysis. Consonants
in the context of the open vowel [a] are considered in this study. Sometimes, changes in the production
characteristics may not be highlighted in a single occurrence of consonant in the vowel context (/VCV/).
Hence, sustained production of the consonants is considered. Sustained production of consonants can
be either geminated (double) or prolonged (longer than geminated), i.e., in the form of /VCCV/ or
/VCC...CV/ sound units, respectively. The distinctive characteristics of consonants may fade sometimes
in their prolonged production. Hence, geminated type is analysed in more detail, although data is
collected and studied for each of the three occurrence types.
Data was collected for the following six categories of voiced speech sounds: (1) apical trill ([r]),
(2) alveolar fricative ([z]), (3) velar fricative ([È]), (4) apical lateral approximant ([l]), (5) alveolar
nasal ([n]) and (6) velar nasal ([N]). All these sounds are considered in the context of vowel [a] on
both sides, in modal voicing. For each category of sound, three types of occurrences are considered:
single, geminated and prolonged occurrence. Utterances for each type of each of the 6 categories were
repeated 3 times. Thus the data consists of total 54 (=6x3x3) utterances.
Sustained production of some sounds like velar fricative ([È]), trill ([r]) or velar nasal ([N]) is needed
for detailed analysis of the effects of acoustic loading. It is also preferable, in general, that “an international standard of phonetic pronunciation norms could be established by reference to a few selected
speakers” [48]. Hence, the data was collected in the voice of a male expert phonetician so as to have
reliable and authentic data of production of these sounds. The data was also collected in the voice of a
(less trained) female phonetics research student. Thus, total data has 108 (=54+54) utterances.
The data was recorded in a sound treated recording room. Simultaneous recordings of the speech
signal and the electroglottograph (EGG) signal [53, 47, 50] were obtained for each utterance. The speech
signal was recorded on a digital sound recorder with a high quality condenser microphone (Zoom H4n),
kept at a distance of around 10 cm from the corner of the mouth. The EGG signal was recorded using
an EGG recording device [121]. The audio data was acquired at a sampling rate of 44100 samples/sec,
with 16 bits/sample. The data was downsampled to 10000 samples/sec before analysis. The collected
data is available for download at the link: http://speech.iiit.ac.in/svldownloads/ssidata/ .
4.3.3
Features for the analysis
(A) Glottal excitation source features
The features of the glottal source of excitation are derived from the speech signal using the zerofrequency filtering (ZFF) method [130, 216]. In ZFF, the features of the impulse-like excitation of the
glottal source are extracted by filtering the differenced speech signal through a cascade of two zerofrequency resonators (ZFRs). Details of ZFF method are discussed in Section 3.5 and Section 4.2.2.
74
(a) EGG signal
e[n]
0.5
0
−0.5
e
d [n]
(b) Differenced EGG signal
0.15
0.1
0.05
0
−0.05
0
2
4
T
C
6
8
10
12
14
16
Time (ms)
18
20
T
O
Figure 4.6 Illustration of open/closed phase durations, using (a) EGG signal and (b) differenced EGG
signal for the vowel [a].
An illustration of F0 contour and SoE impulse sequence is given in Fig. 4.2(c) and (d), respectively.
The SoE impulse sequence represents the glottal source excitation, in which location of each impulse
corresponds to an epoch and its amplitude indicates relative strength of excitation around the epoch.
The F0 contour reflects the changes in successive epoch intervals.
(B) Vocal tract system features
The vocal tract system characteristics are studied using the first two dominant frequencies FD1
and FD2 of the short-time spectral envelope. The features FD1 and FD2 of the vocal tract system
are derived from the speech signal using group-delay function [128] of the linear prediction (LP) spectrum [112].
The dominant frequencies FD1 and FD2 are derived for each signal frame using pitch synchronous
LP analysis, anchored around GCIs. An illustration of FD1 and FD2 contours for the vowel segment [a]
is shown in Fig. 4.2(e), in which the features FD1 and FD2 can be seen as almost steady.
(C) Closed phase quotient (α) from EGG signal
The electroglottograph (EGG) signal [53, 47] and differenced EGG (dEGG) signal [50] are used for
studying the changes in the characteristics of the glottal pulse during production of speech. Features of
the glottal pulse shape are extracted in terms of opening/closing phase durations of the glottal slit, as
shown in Fig. 4.6. The closed phase quotient, denoted as α, is computed for each closing/opening cycle
of the glottis [123], as follows:
TC
(4.1)
α=
TC + TO
where TC and TO are the closed and open phase durations (in ms), respectively.
(D) Features used for the analysis
In this study, observations from the EGG/speech signals and the derived features are analysed for the
six categories of sounds. Both qualitative and quantitative observations are discussed, using the signals
and feature contours. First, qualitative observations are made using the waveforms of raw/processed
75
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
e
d [n]
(c) dEGG signal
0.4
0.2
0
−0.2
(d) α contour
α
0.6
0.4
0.2
10
30
50
70
Time (ms)
90
110
Figure 4.7 Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal,
and the α contour for geminated occurrence of apical trill ([r]) in vowel context [a]. The sound is
for [arra], produced in female voice.
signals. Four waveforms: (i) speech signal, (ii) EGG signal, (iii) dEGG and (iv) ZFF output are used
for visual observations/comparisons in each case. Next, quantitative changes are measured from the
features extracted using speech signals, for the six sound categories, each in three occurrence types.
Total four features, (i) F0 , (ii) SoE (ψ), (iii) FD1 and (iv) FD2 are used.
It is generally observed that changes in EGG, dEGG and F0 reflect the effect of glottal vibration,
changes in FD1 and FD2 reflect the changes in the vocal tract system, and changes in the speech signal
waveform, ZFF and SoE may reflect changes in both the source and vocal tract system characteristics.
4.3.4
Observations from EGG signal
The effect of acoustic loading of the vocal tract on the vibration characteristics during production
of apical trills [r] can be seen better from the changes in closed phase quotient (α). An illustration of
the EGG signal (e[n]), differenced EGG (dEGG) signal (de [n]), and α contour for apical trill sound [r]
is shown in Fig. 4.7(b), (c) and (d), respectively. The corresponding acoustic signal s[n]) is shown in
Fig. 4.7(a). Changes in the stricture between the alveolar/palatal region (upper articulator) and apical
region of tongue (lower articulator) as shown in Fig. 4.5(b), are also reflected in the changes in feature α,
as shown in Fig. 4.7(d). These changes are difficult to observe in the EGG signal, shown in Fig. 4.7(b).
In the production of trills, the apical stricture gets periodically broken to release the pressure gradient
built in the oral cavity, due to the interaction between tongue-tension and volume-velocity of the air flow
from glottis [29]. Apical stricture is formed again by recoiling of the apex of tongue to meet the upper
articulator, due to Bernoulli effect. The pressure gradient across glottis reduces during the closed phase
76
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
e
d [n]
(c) dEGG signal
0.4
0.2
0
−0.2
(d) α contour
α
0.6
0.4
0.2
10
30
50
70
Time (ms)
90
110
130
Figure 4.8 Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal,
and the α contour for geminated occurrence of apical lateral approximant ([l]) in vowel context [a].
The sound is for [alla], produced in female voice.
in the trill cycle, and increases during the open phase. These changes due to trilling effect of the tongue
tip are manifested as changes in the rate of vibration of vocal folds and in the excitation strength [40].
The effect of changes in the air flow and the transglottal pressure gradient on the rate of glottal vibrations
is also reflected as changes in the closed phase quotient (α), that can be seen in Fig. 4.7(d).
Changes in the closed phase quotient (α) are also helpful in understanding the difference in production characteristics of apical trill [r] and apical lateral approximant [l] sounds. An illustration of the
acoustic signal, EGG and dEGG signals, and α contour for apical lateral approximant sound [l] is shown
in Fig. 4.8(a), (b), (c) and (d), respectively. Since, the stricture between alveolar/palatal region (upper
articulator) and apical tongue region (lower articulator) as shown in Fig. 4.5(c) is wide enough to allow
air flow in the vocal tract to pass smoothly, no cyclic changes in the stricture occur in the case of apical
lateral approximant [l], unlike those for trill [r]. This difference in strictures for trill ([r]) and approximant ([l]) (shown in Figure 4.5(b) and (c)) can also be observed from the difference in α contours in
Fig. 4.7(d) and Fig. 4.8(d). The closed phase quotient (α for [l] (in Fig. 4.8(d)) does not change like
that for [r] (in Fig. 4.7(d)), because there is not much change in the rate of glottal vibrations during the
production of lateral approximant [l] sounds.
Changes in α contours are useful for the analysis of trills ([r]) and for discriminating between
trills ([r]) and approximants ([l]). But, α contours are not useful for the study of other sound categories considered ([z], [È], [n] and [N]). Hence, the waveforms of EGG and dEGG signals themselves
are considered for the analyses of these sound categories.
77
4.3.5
Discussion on acoustic loading through EGG and speech signals
In this section the effects of acoustic loading in the production of different categories of sounds are
examined in terms of observed and derived characteristics from the EGG and speech signals. Acoustic
loading effects caused by the size, type and location of the stricture in the vocal tract are discussed
in detail. The cross-sectional area of the opening at the stricture determines the size of the stricture,
which in turn determines the extent of the (high, low, no) stricture. Very narrow to closed constriction
in the vocal tract corresponds to high stricture, which occurs, for example, in trills ([r]) and alveolar
fricatives ([z]), as in Fig. 4.5(b) and Fig. 4.5(c), respectively. The intermediate case of relatively wider
opening corresponds to low stricture, which occurs in the case of the approximant ([l]) (Fig. 4.5(d)) and
the velar fricative ([È]). A completely open vocal tract corresponds to no stricture, as in the case of open
vowel [a].
The two different types of strictures considered in this study are cyclic (as in [r], in Fig. 4.5(b)) and
steady ([z] and [l], in Fig. 4.5(c) and Fig. 4.5(d), respectively). Two different locations of strictures in
the vocal tract are considered, namely alveolar (as in [z]) and velar (as in [È]). In addition, the effects
of closed vocal tract (high stricture) during production of nasal sounds are considered for any possible
acoustic loading effect in terms of the observed and derived characteristics from EGG and speech signals. Two different locations of the stricture for nasals are considered, namely, alveolar nasal ([n]) and
velar nasal ([N]). All these categories of sounds are produced in the context of the open vowel [a], where
there is no stricture. Only geminated utterances of these different categories of sounds are analysed in
this section, as gemination of consonants produces sustained steady-state that facilitates study of the
effects of acoustic loading while also eliminating possible effects of vowel-consonant transition.
Figures in Fig. 4.9 to Fig. 4.14 show the waveforms of speech and EGG signals for the six categories
of sounds chosen for analysis in this section. Each figure displays, besides the waveforms of speech,
EGG, differenced EGG (dEGG) and ZFF output signals, the contours of two source features, namely,
instantaneous fundamental frequency (F0 ) and strength of excitation (SoE), and two system related
features, namely, the two dominant resonance frequencies of the vocal tract (FD1 and FD2 ). In the
following, the acoustic loading effects caused by the strictures in each of the six categories of sounds
are examined in detail.
(A) Apical trill ([r])
In the production of apical trill ([r]) sound, the high stricture is formed due to narrow opening between the alveolar/palatal region and the apical region of the tongue, as shown in Fig 4.5(b). This
stricture gets broken, releasing the air pressure built in the oral cavity, and it is formed again due to
Bernoulli effect [29]. Thus this stricture is cyclic in nature, due to opening and closing of the stricture
in each trill cycle (Fig 4.5(b)).
The cyclic high stricture affects the rate of vibration of the vocal folds and the strength of excitation [40]. These effects are reflected in the F0 and SoE contours in Fig. 4.9(e) and Fig.4.9(f), respectively. This is because the pressure gradient across the glottis reduces during the closed phase of the
78
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
(c) dEGG signal
de[n]
0.2
0
(d) ZFF output
1
s
z [n]
−0.2
0
F0 (Hz)
1
0.8
0.6
0.4
0.2
D1
F ,F
D2
(Hz)
140
120
100
80
ψ
−1
(e) F0 contour
(f) SoE contour
(g) FD1 and FD2 contours
5000
4000
3000
2000
1000
0
0.05
0.1
0.15
0.2
Time (sec)
0.25
0.3
0.35
Figure 4.9 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG
signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated
occurrence of apical trill ([r]) in vowel context [a]. The sound is for [arra], produced in male voice.
trill cycle, which in turn reduces F0 . Thus the acoustic coupling effect of the system on the source is
significant in this case.
There are also changes in the resonances of the vocal tract system due to changes in the shape of the
tract during each trill cycle. These changes are seen as cyclic variations of FD1 and FD2 (Fig. 4.9(g)),
where FD1 is higher during closed phase of the trill cycle.
In the case of trill, the effects of acoustic loading due to dynamic vocal tract configuration can be
seen in the waveform of EGG, dEGG and the speech signal, shown in Fig. 4.9. The contrast between
the steady vowel region and the trill region can be seen in all the signals and features derived from these.
(B) Alveolar fricative ([z])
Production of alveolar fricative ([z]) also involves narrow opening of the constriction between the
upper articulator (alveolar ridge) and the lower articulator (tongue tip), as shown in Fig. 4.5(c). The
constriction is narrow enough to produce frication or turbulence. Thus [z] is produced by a high steady
stricture, unlike the high cyclic stricture in the case of [r]. There is pressure build-up behind the constric79
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
(c) dEGG signal
de[n]
0.2
0
(d) ZFF output
1
s
z [n]
−0.2
0
F0 (Hz)
0.6
0.4
0.2
D1
F ,F
D2
(Hz)
120
110
100
ψ
−1
(e) F0 contour
(f) SoE contour
(g) FD1 and FD2 contours
5000
4000
3000
2000
1000
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
0.45
Figure 4.10 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for
geminated occurrence of alveolar fricative ([z]) in vowel context [a]. The sound is for [azza], produced
in male voice.
tion, causing pressure differential across the glottis. Thus in this case, the acoustic loading effect can be
seen in the signal waveform, as well as in the source and system features derived from the signals.
The amplitudes of the speech signal, EGG, dEGG and ZFF is low, relative to the adjacent vowel,
in these waveforms (Fig. 4.10). The acoustic loading effect results in lowering of F0 and SoE values,
relative to the adjacent vowel region, as can be seen in Fig. 4.10(e) and Fig. 4.10(f), respectively. Due
to frication, both the dominant frequencies (FD1 and FD2 ) show high values, compared to those in the
vowel region (Fig. 4.10(g)). The acoustic loading effects are similar to the trill case ([r]), except that in
the case of alveolar fricative ([z]) the effects are steady (not cyclic).
(C) Velar fricative ([È])
The production of velar fricative([È]) involves steady but relatively lower stricture due to more open
constriction between the upper and lower articulators, than for the alveolar fricative ([z]). Since the
constriction area has to be small enough to produce turbulence, this stricture may be termed as steady
80
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
(c) dEGG signal
(d) ZFF output
1
s
z [n]
de[n]
0.2
0
−0.2
0
−1
F0 (Hz)
ψ
(e) F0 contour
120
115
110
0.8
0.6
0.4
D1
F ,F
D2
(Hz)
(f) SoE contour
(g) FD1 and FD2 contours
5000
4000
3000
2000
1000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time (sec)
0.35
0.4
0.45
0.5
Figure 4.11 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for
geminated occurrence of velar fricative ([È]) in vowel context [a]. The sound is for [aÈÈa], produced in
male voice.
high-low stricture, and the acoustic loading effects are expected to be similar to those for the alveolar
fricatives.
While there are no significant changes in the EGG signal waveform, relative to the adjacent vowel
region (Fig. 4.11(b)), changes can be seen better in the waveform of dEGG signal (Fig. 4.11(c)). Acoustic loading effects can be seen in the derived source information, i.e., F0 contour (Fig. 4.11(e)) and SoE
contour (Fig. 4.11(f)). However, the changes are less evident in the ZFF signal (Fig. 4.11(d)).
The changes in the speech signal waveform for velar fricative ([È]) relative to that for vowel [a] can
be attributed to the changes in the vocal tract characteristics. Turbulence generated at the stricture is
lower in the case of [È], than in the case of [z], because of wider constriction in the vocal tract (for [È]).
As a result, the FD1 is lower than for the vowel [a] (Fig. 4.11(g)). Accordingly, the frication effect is
not as high as in the case of [z], and hence the behaviour of FD1 and FD2 are more vowel-like, in the
81
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
(c) dEGG signal
de[n]
0.2
0
(d) ZFF output
1
s
z [n]
−0.2
0
F0 (Hz)
−1
(e) F0 contour
120
110
100
(f) SoE contour
ψ
1
D1
F ,F
D2
(Hz)
0.5
(g) FD1 and FD2 contours
5000
4000
3000
2000
1000
0
0.1
0.2
0.3
0.4
Time (sec)
0.5
0.6
0.7
Figure 4.12 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for
geminated occurrence of apical lateral approximant ([l]) in vowel context [a]. The sound is for [alla],
produced in male voice.
sense that they are in the same range as that for vowel [a] (Fig. 4.11(g)), unlike for [z] where both FD1
and FD2 are high (Fig. 4.10(g)).
(D) Approximant ([l])
An apical lateral approximant ([l]) is formed by a closure between the alveolar/palatal region (upper
articulator) and the apical tongue region (lower articulator), along with a simultaneous lateral stricture.
In this case, the lateral stricture is wide enough, as shown in Fig. 4.5(d), to allow the free flow of air
in the vocal tract. Thus the stricture is low, i.e., relatively more open than for the high stricture cases
considered so far ([r] and [z]), and also steady, i.e., not cyclic as in the case of trill [r]. Since, there is no
significant pressure gradient built-up in this case, the acoustic loading effect is negligible in comparison
with the high stricture case. Hence, one does not notice any significant changes in the amplitudes in the
waveforms of speech signal, EGG, dEGG and ZFF output, relative to the adjacent vowel regions, as can
be seen in Fig. 4.12(a) to Fig. 4.12(d), respectively.
82
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
(c) dEGG signal
de[n]
0.2
0
(d) ZFF output
1
s
z [n]
−0.2
0
−1
F0 (Hz)
(e) F0 contour
120
110
D1
F ,F
D2
(Hz)
ψ
(f) SoE contour
0.8
0.6
0.4
(g) FD1 and FD2 contours
5000
4000
3000
2000
1000
0
0.05
0.1
0.15
0.2
0.25
0.3
Time (sec)
0.35
0.4
0.45
0.5
Figure 4.13 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for
geminated occurrence of alveolar nasal ([n]) in vowel context [a]. The sound is for [anna], produced in
male voice.
There are no major changes in the excitation features as well, such as in F0 and SoE contours
in Fig. 4.12(e) and Fig. 4.12(f), respectively. However, due to wider lateral opening, the corresponding
change in the shape of the vocal tract affects the first two dominant resonance frequencies (Fig. 4.12(g)).
The FD1 is reduced and FD2 is increased, relative to the values in the neighbouring vowel region. This
shows that if the stricture is not high, the acoustic loading effects are negligible.
(E) Alveolar nasal ([n]) and velar nasal ([N])
Nasal sounds are produced with complete closure of the vocal tract at some location in the oral cavity,
and simultaneous flow of air through the nasal tract, which is facilitated by the velic opening. Here, the
constriction along the vocal tract is like a high stricture case, but due to the coupling of the nasal tract
there is hardly any obstruction to the egressive flow of air. Hence, nasal sounds are considered in this
study, to examine whether the high stricture in the vocal tract has any acoustic loading effect on the
83
(a) Input speech signal
s[n]
1
0
−1
(b) EGG signal
e[n]
1
0
−1
(c) dEGG signal
(d) ZFF output
1
s
z [n]
de[n]
0.2
0
−0.2
0
−1
F0 (Hz)
(e) F0 contour
120
115
110
(f) SoE contour
ψ
1
D1
F ,F
D2
(Hz)
0.5
(g) FD1 and FD2 contours
5000
4000
3000
2000
1000
0.05
0.15
0.25
0.35
Time (sec)
0.45
0.55
Figure 4.14 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal,
(c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for
geminated occurrence of velar nasal ([N]) in vowel context [a]. The sound is for [aNNa], produced in
male voice.
glottal excitation. Two variants of the high stricture along the vocal tract are considered, corresponding
to two different locations, namely, alveolar nasal ([n]) and velar nasal ([N]).
Fig. 4.13 and Fig. 4.14 show the waveforms and other features for [n] and [N], respectively. Due
to absence of acoustic loading effect on the glottal vibration, there are no visible changes in EGG and
dEGG waveforms, in relation to the adjacent vowel. Also, there is hardly any significant change in the
F0 contours (Fig. 4.13(e) and Fig. 4.14(e)), indicating that the glottal vibration is not affected.
However, there is reduction in the amplitude of the waveform of the speech signal in both the cases
of [n] and [N], as shown in Fig. 4.13(a) and Fig. 4.14(a), respectively. This primarily is due to narrow
constricted (turbinated) path of the nasal tract, especially at the nares. The effect of this constriction
can also be seen in the significantly lower amplitudes of ZFF output (Fig. 4.13(d) and Fig. 4.13(f)) and
SoE contour (Fig. 4.14(d) and Fig. 4.14(f)), as compared to the adjacent vowel [a]. As expected, the
resonance frequency due to nasal tract coupling is significantly lower than the vowel, as can be seen in
84
Table 4.4 Comparison between sound types based on stricture differences for geminated occurrences.
Abbreviations:- alfric: alveolar fricative [z], vefric: velar fricative [È], approx/appx: approximant [l],
frics: fricatives ([z], [È]), alnasal: alveolar nasal [n], venas: velar nasal [N], stric: stricture, H/L indicates
relative degree of low stricture.
Qualitative observations Quantitative observations
(using waveforms)
(using features)
Categories
of
Main
causes
for
Sl.
difference
in
effects
(a)
(b)
(c)
(d)
(f)
(g)
sounds
(h)
(e)
#
of acoustic loading
considered
s[n] e[n] de [n] zs [n] F0
ψ
FD1 FD2
cyclic high stric:[r]
trill vs vowel
steady no stric:[a]
([r] vs [a])
1.
•
•
•
•
X
X
X
•
steady high stric:[z]
alfric vs trill
cyclic high stric:[r]
2. ([z] vs [r])
X
•
X
X
X
X
X
•
steady
Hlow stric:[È]
vefric vs alfric
steady high stric:[z] •
([È] vs [z])
3.
•
X
•
•
◦
X
◦
steady low stric:[l]
approx vs vefric
steady Hlow stric:[È] •
([l] vs [È])
4.
◦
•
◦
•
X
•
X
approx vs vowel steady low stric:[l]
steady no stric:[a]
([l] vs [a])
5.
•
◦
◦
◦
◦
◦
X
•
cyclic high stric:[r]
trill vs approx
steady low stric:[l]
([r] vs [l])
6.
•
X
X
•
X
X
X
•
frics vs trill/appx
high strics:[z],[r]
7. ([z],[È] vs [r],[l]) H/L low strics:[È],[l] •
•
X
•
•
X
X
◦
nasal
low
stric:[n],[N]
nasals vs vowel
steady no stric:[a] X
8. ([n],[N] vs [a])
◦
•
X
◦
X
X
•
nasals vs approx nasal low stric:[n],[N]
([n],[N] vs [l])
steady low stric:[l] •
9.
◦
◦
•
◦
X
•
X
nasal
high stric:[n],
alnasal vs venas
nasal Hlow stric:[N] ◦
([n] vs [N])
10.
◦
•
◦
◦
◦
X
•
Legend:- X: mostly evident, •: sometimes/less evident, ◦ : rarely/not evident changes
the FD1 contours in Fig. 4.13(g) and Fig. 4.14(g), for [n] and [N], respectively. In fact, even the FD2 is
also lower in both the cases, but this change is clearly visible in the case of [N] (Fig. 4.14(g)).
In summary, in the case of nasal sounds ([n] and [N]), though there is a complete closure in the
oral cavity, the high stricture in the vocal tract does not cause any acoustic loading effect on the glottal
excitation. However, there are significant changes in the speech signal, ZFF signal, SoE, FD1 and
FD2 , relative to the adjacent vowel. These changes are primarily due to narrow constriction in the nasal
tract. There are no significant changes in the source characteristics of alveolar ([n]) and velar ([N]) nasal
sounds.
In Table 4.4, comparisons among different sound categories and the vowel context ([a]) are made,
based upon the level of stricture in the vocal tract. In each case, the difference in the stricture size, type
and location in the vocal tract, that causes the difference in the acoustic loading effect, is highlighted.
The signal waveforms and the derived features that are mostly/sometimes/not affected for each sound
type, are marked to provide a comprehensive view.
85
Table 4.5 Changes in glottal source features F0 and SoE (ψ) for 6 categories of sounds (in male voice).
Column (a) is F0 (Hz) for vowel [a], (b) and (c) are F0min and F0max for the specific sound, and
0
(d) is F∆F
(%). Columns (e) is SoE (i.e., ψ) for vowel [a], (f) and (g) are ψmin and ψmax for the
0
[a]
specific sound, and (h) is ψ∆ψ (%). Sl.# are the 6 categories of sounds. Suffixes a, b and c in the
[a]
first column indicate single, geminated or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’
denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal.
(d)(%)
(h)(%)
(a)
(e)
(b)
(g)
(c)
(f)
∆F0
Sound
∆ψ
Sl. Sound
F0[a]
F0[a]
ψ[a]
ψ[a]
F0min F0max
ψmin ψmax
# category Symbol
1a
1b
1c
2a
2b
2c
3a
3b
3c
4a
4b
4c
5a
5b
5c
6a
6b
6c
4.3.6
trill
trill
trill
alfric
alfric
alfric
vefric
vefric
vefric
lateral
lateral
lateral
alnasal
alnasal
alnasal
venasal
venasal
venasal
[ara]
[arra]
[arr...ra]
[aza]
[azza]
[azz...za]
[aÈa]
[aÈÈa]
[aÈÈ...Èa]
[ala]
[alla]
[all...la]
[ana]
[anna]
[ann...na]
[aNa]
[aNNa]
[aNN...Na]
112.1
111.1
111.4
111.4
110.6
111.5
112.7
112.2
112.1
114.9
113.3
114.6
112.7
113.0
115.7
112.7
114.8
115.6
88.5
85.8
89.4
95.9
94.6
95.6
111.1
110.9
111.7
117.6
114.0
112.8
114.7
113.4
113.2
114.3
115.0
117.1
117.7
118.3
118.5
117.0
116.3
117.7
117.7
119.1
117.7
119.1
119.8
120.1
118.6
119.1
117.7
119.0
119.8
119.9
26.00
29.26
26.06
18.91
19.59
19.85
5.81
7.34
5.30
1.22
5.11
6.42
3.48
5.08
3.85
4.15
4.20
2.46
.820
.665
.734
.617
.509
.641
.813
.608
.798
.787
.720
.616
.744
.748
.793
.722
.828
.748
.237
.119
.146
.074
.057
.075
.627
.393
.538
.889
.819
.582
.299
.304
.251
.311
.331
.315
.987
.753
.634
.751
.566
.740
.943
.735
.957
.933
.903
.707
.814
.861
.813
.862
.879
.880
91.34
95.31
66.50
109.6
99.96
103.8
38.90
56.23
52.57
5.67
11.64
20.24
69.09
74.39
70.81
76.30
66.14
75.59
Quantitative assessment of the effects of acoustic loading
The effects of acoustic loading examined in the previous section, are for geminated cases of the six
sound categories, where the production of specific sound is sustained. The cross-sectional area of a
stricture is not expected to be significantly affected by the duration of the sound, whether it is single,
geminated or prolonged occurrence. It would be interesting to observe changes in the features for single
and prolonged occurrences of these six sound categories, relative to the geminated occurrences. The
degree and nature of changes in the vibration characteristics of the vocal folds due to acoustic loading
of the vocal tract system, are examined in this section using the average values of the features for single
and prolonged occurrence types, along with geminated occurrences of the six sound categories.
Changes in the features F0 , SoE, FD1 and FD2 for the six categories of sounds in the context of the
vowel [a] are examined, in terms of the average values computed in two ways. First, the average values
of the features are computed over glottal cycles (at GCIs) in the regions of consonant or vowel context,
demarcated manually, for each of the three occurrence types (single, geminated and prolonged). These
are discussed using the average values given in Tables 4.5 and 4.6, for source and system features, re86
Table 4.6 Changes in vocal tract system features FD1 and FD2 for 6 categories of sounds (in male
voice). Column (a) is FD1 (Hz) for vowel [a], (b) and (c) are FD1min and FD1max for the specific sound,
∆F
and (d) is FD1D1 (%). Columns (e) is FD2 (Hz) for vowel [a], (f) and (g) are FD2min and FD2max for
[a]
the specific sound, and (h) is
∆FD2
FD2[a] (%).
Sl.# are the 6 categories of sounds. Suffixes a, b and c in the
first column indicate single, geminated or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’
denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal.
(d)(%)
(h)(%)
(a)
(e)
∆FD1
∆FD2
(b)
(f)
(g)
(c)
Sound
Sl. Sound
FD1
FD2
F
F
D
D
F
F
1
2
F
F
D1min
D2min
D2max
D1max
[a]
[a]
[a]
[a]
# category Symbol
1a
1b
1c
2a
2b
2c
3a
3b
3c
4a
4b
4c
5a
5b
5c
6a
6b
6c
trill
trill
trill
alfric
alfric
alfric
vefric
vefric
vefric
lateral
lateral
lateral
alnasal
alnasal
alnasal
venasal
venasal
venasal
[ara]
[arra]
[arr...ra]
[aza]
[azza]
[azz...za]
[aÈa]
[aÈÈa]
[aÈÈ...Èa]
[ala]
[alla]
[all...la]
[ana]
[anna]
[ann...na]
[aNa]
[aNNa]
[aNN...Na]
761
763
791
856
844
886
735
867
889
804
862
815
1137
1103
1084
1177
1119
1118
525
402
397
278
227
288
363
342
359
562
480
361
327
241
267
261
244
214
1499
1837
1882
2723
2873
2749
913
968
1031
732
652
495
1410
1387
1240
1316
1149
1195
128.1
188.0
187.7
285.6
313.6
277.7
74.9
72.2
75.6
21.3
20.0
16.4
95.3
103.9
89.7
89.6
80.8
87.7
2022
2006
2399
2892
3092
3250
3195
3131
3114
2301
2361
2774
3483
3440
3513
3335
3277
3329
1377
1655
1863
3612
3678
3875
3245
3062
3182
1748
2349
2589
2454
2491
2842
2438
2778
2713
3506
3933
3793
4451
4538
4536
3752
3884
3855
3725
4229
3746
4010
3523
3694
3925
3650
3689
105.3
113.6
80.5
29.0
27.8
20.4
15.9
26.3
21.6
85.9
79.6
41.7
44.7
30.0
24.3
44.6
26.6
29.3
spectively. Second, changes in the features are examined by computing the average values over the three
types of occurrences (single, geminated and prolonged) for each sound category. These are discussed
using the average values given in Tables 4.7 and 4.8.
Changes in F0 and SoE in comparison to those for the vowel context [a] are given in Table 4.5. The
average values of F0 for vowel [a], minimum F0 and maximum F0 (i.e., F0[a] , F0min and F0max ) for
each sound category are given (in Hz) in columns (a), (b) and (c), respectively. The values are rounded
−F
F
off to a single decimal. The percentage change in F0 relative to F0[a] , i.e., ∆F0 /F0[a] (= 0maxF0 0min %)
[a]
is given in column (d). Likewise, the values of the feature SoE (denoted as ψ) are given in columns (e)(h). The features are normalized relative to those for the vowel context (i.e., F0[a] and ψ[a] ), to facilitate
comparison across sound categories. Since, number of points (GCIs) for feature values is less in some
cases of sounds such as single or geminated occurrences of trill ([r]) sounds, the range of deviation
in the feature is computed using minimum and maximum values, rather than computing the standard
deviation.
87
Table 4.7 Changes in features due to effect of acoustic loading of the vocal tract system on the glottal
vibration, for 6 categories of sounds (in male voice). Columns (a)-(d) show percentage changes in F0 ,
SoE, FD1 and FD2 , respectively. The direction of change in a feature in comparison to that for vowel [a]
is marked with +/- sign. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’
denotes alveolar/velar nasal.
(d)(%)
(a)(%) (b)(%) (c)(%)
IPA
Sl. Sound
∆F
∆FD2
∆ψ
∆F0
D1
# category Symbol
1
2
3
4
5
6
trill
alfric
vefric
lateral
alnasal
venasal
[r]
[z]
[È]
[l]
[n]
[N]
-27.1
-19.5
+6.1
+4.3
+4.1
+3.6
-84.8
-104.5
-49.2
+12.5
-71.4
-72.7
+167.9
+292.3
-74.2
-19.2
-96.3
-86.0
+99.8
+25.7
+21.2
+69.1
-33.0
-33.5
Significant changes in F0 and SoE in comparison to the vowel context [a] can be observed for apical
trill ([r]) and alveolar fricative ([z]), in columns (d) and (h) in Table 4.5. A dip in F0 for alveolar
fricative ([z]), is due to acoustic loading of the nearly closed vocal tract (to produce fricative noise for
sound [z]) on the vibration of the vocal folds. The strength of excitation (ψ) is reduced for [z] due
to constriction in the vocal tract. A sharp change (a dip) in SoE for both nasals ([n] and [N]) can be
observed from column (h). This is because of the constriction in the nasal tract, and not due to the
acoustic loading on the glottis, as in the case for [r] or [z]. Absence of acoustic loading for nasals is
evident from the negligible changes in the F0 values.
In Table 4.6, the average values of FD1 for the vowel [a] (FD1[a] ), minimum FD1 ( FD1min ) and
maximum FD1 (FD1max ), for the six sound categories are given in columns (a), (b) and (c), respectively.
Percentage changes in FD1 for these sounds relative to FD1[a] (for vowel context), i.e., ∆FD1 /FD1/a/
FD
−FD
1max
1min
%) are given in column (d). Likewise, the average values of FD2 are given in
(=
F0[a]
columns (e)-(h). The values of FD1 and FD2 are rounded off to the nearest integers, and the percentage
changes to a single decimal. Large changes can be observed in FD1 for trill [r] and alveolar fricative [z],
relative to the vowel [a]. In comparison, the changes in FD1 for velar fricative ([È]) and lateral approximant ([l]) are relatively low. The changes in FD1 for nasals ([n] and [N]) are significant, due to lowering
of the first formant of the nasal tract. Changes in FD2 are high mainly for the trill ([r]) sound, as can be
seen in column (h).
A summary of percentage changes in F0 , SoE, FD1 and FD2 , relative to those for the vowel [a], for
the six categories of sounds for the male voice, is given in Table 4.7. The average values of the changes
in these feature computed across the three types of occurrences are given for each sound category. In
each case, the relative increase or decrease (i.e., the direction of change) in the average values of these
features, in comparison to those for the vowel [a], is marked as (+) or (-), respectively. The significant
changes are in F0 due to acoustic loading effect on the glottal vibration, and in FD1 due to changes in the
system characteristics. The summary table clearly illustrates the changes in different sound categories
as discussed before.
88
Table 4.8 Changes in features due to effect of acoustic loading of the vocal tract system on the glottal
vibration, for 6 categories of sounds (in female voice). Columns (a)-(d) show percentage changes in F0 ,
SoE, FD1 and FD2 , respectively. The direction of change in a feature in comparison to that for vowel [a]
is marked with +/- sign. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’
denotes alveolar/velar nasal.
(d)(%)
(a)(%) (b)(%) (c)(%)
IPA
Sl. Sound
∆F
∆FD2
∆ψ
∆F0
D1
# category Symbol
1
2
3
4
5
trill
alfric
lateral
alnasal
venasal
[r]
[z]
[l]
[n]
[N]
+15.5
-29.4
+2.9
+6.6
+6.6
-49.1
-116.2
+35.0
+46.4
+22.5
+72.4
+273.3
-51.7
-108.7
-72.4
+30.9
+43.8
+17.1
-21.0
-26.3
The summary of changes in features F0 , SoE, FD1 and FD2 for a subset of the data collected for
a female voice is given in Table 4.8 in columns (a), (b), (c) and (d), respectively. Tables 4.7 and 4.8
show some differences. The extent of changes in F0 , SoE, FD1 and FD2 for [r] seem to be less for the
female speaker. The signs of changes in SoE (∆ψ) for both nasals ([n] and [N]), as compared to the
vowel context [a], are also different for both the speakers. It could possibly be related to higher average
pitch of female voice in comparison to male voice. Another reason for the differences in Tables 4.7
and 4.8 could be that the data for Table 4.7 was obtained from an expert phonetician, whereas the data
for Table 4.8 was obtained from a research scholar with basic training in phonetics.
4.3.7
Discussion on the results
In this preliminary study, we have examined the effect of acoustic loading of the vocal tract system
on the vibration characteristics of the vocal folds. Involuntary changes in glottal vibrations in the production of some specific categories of sounds are examined, which are due to acoustic loading of the
vocal tract system and system-source interaction.
A selected set of six categories of sounds are considered for illustration. The sounds differ in the
size, type and location of stricture in the vocal tract. We have considered features describing the glottal
vibration source and the vocal tract system to demonstrate the effect of system-source coupling. Further,
this study concentrates on only the sounds uttered in the context of vowel [a]. Single, geminated and
prolonged occurrences are examined for each sound category.
The speech signal along with EGG signal were studied for each case. Changes were examined in the
amplitudes of the waveforms of speech signal, EGG, dEGG and ZFF output, and in features F0 , SoE,
FD1 and FD2 . The glottal source features F0 and SoE are derived from the speech signal using the
zero-frequency filtering method. The vocal tract system characteristics are represented through the two
dominant resonance frequencies FD1 and FD2 .
The acoustic loading effect on the glottal vibration depends on the size, type and location of the
stricture in the vocal tract. The general observation is that the glottal vibration is not affected when there
89
is a relatively free flow of air from the lungs passing through the vocal tract system, as is seen from the
EGG, dEGG and F0 contours for apical lateral approximant ([l]) and nasals ([n] and [N]). However, the
speech signal waveform and the feature SoE could be affected by changes in both the vocal tract system
and in the glottal source of excitation. Significant changes occur in the glottal vibration, mainly when
there is acoustic loading of the vocal tract, as in the case of apical trill ([r]) and alveolar fricative ([z])
sounds. The stricture in the vocal tract is cyclic/steady high (i.e., constriction is narrow) in these cases.
Glottal vibration is affected less in the case of velar fricative ([È]), because of lesser effect of acoustic
loading due to relatively more open constriction in the vocal tract. Associated changes in the dominant
resonance frequencies FD1 and FD2 are primarily due to changing shape of the vocal tract system,
during production of these consonant sounds.
4.4 Summary
In this chapter, we have examined the role of source-system interaction in the production of some
special sounds in normal (verbal) speech. First, the relative role of source and system, and the sourcesystem coupling effect in the production of trills is examined using analysis by synthesis approach
and perceptual evaluation. Experiments are conducted to understand the perceptual significance of the
excitation source characteristics in production of speech sounds of sustained trill and approximant pair,
and apical trills produced by four different places of articulation. The glottal excitation source seems to
contribute relatively more than the vocal tract system component, in the perception of apical trill sounds.
Glottal vibration that can be controlled voluntarily for producing some sounds, may also have significant involuntary changes in the production of some specific sounds. Hence, the effects of acoustic loading on the glottal vibration characteristics is examined, for production of six consonant types of sounds.
Apical trills, apical lateral approximants, alveolar and velar variants of voiced fricatives and voiced
nasals are studied in the vowel context [a] in modal voicing. Qualitative observations are made from the
waveforms, and quantitative examination is carried out using features derived from both the acoustic and
EGG signals. Features such as instantaneous fundamental frequency, strength of impulse-like excitation
and dominant resonance frequencies are extracted from the speech signal using zero-frequency filtering
method, linear prediction analysis and group delay function. Results indicate that the high stricture in
the vocal tract causing obstruction to the free flow of air, produces significant acoustic loading effect on
the glottal excitation, for example in the production trill ([r]) or alveolar fricative ([z]) sounds.
The study examines the nature of involuntary changes in the glottal vibration characteristics due
to the acoustic loading effect, along with associated changes in the vocal tract system characteristics,
only for a few sounds. More variety of sounds and their variants need to be studied further. Also,
different vowel contexts need to be examined. Different features may also be needed to understand the
differences in variations of sounds from the production point of view. Production of nonverbal speech
sounds also involves source-system coupling and involuntary changes in the glottal vibration. Hence,
this study would be helpful for the analysis of nonverbal speech sounds, studied in further chapters.
90
Chapter 5
Analysis of Shouted Speech
5.1 Overview
Production of shout can be emotional or situational. Shouted speech is normally produced when the
speaker is emotionally charged during interaction with other human beings or with a machine. Some
emotions, like anger or disgust, may also be related to the production of shouted speech. Production of
shouted speech can also be situational, for example to warn or to raise alarm or to seek help. Irrespective
of whether the production of shouted speech is emotional or situational, its production characteristics
are expected to deviate significantly from those of normal speech of a person. Hence, in the emotional
speech category of nonverbal speech sounds, shouted speech in particular is considered for the detailed
analysis. Most of these changes in the production characteristics of shout are expected to be in the
source of excitation. These characteristics are reflected in the speech signal, and they are perceived and
discerned well by human listeners.
Production characteristics of shout appear to be changing significantly in the excitation source,
mainly due to differences in the vibration of the vocal folds at the glottis. Hence, production characteristics of shout are examined from both EGG and speech signals, in this study. The effect of coupling of
the excitation source with the vocal tract system is examined for four levels of loudness. The sourcesystem coupling leads to significant changes in the spectral energy in low frequency region relative to
that in the high frequency region for shouted speech. Their ratio as well as the degree of fluctuations in
the low frequency spectral energy reduce for shout as compared to those for normal speech.
The chapter is organized as follows: Different loudness levels of shouted speech considered for
analysis are described in 5.2. In Section 5.3, the features related to both the excitation source and
vocal tract system characteristics of shouted speech are described. Section 5.4 gives details of the data
collected for analysis in this study. In Section 5.5, the EGG signals for normal and shouted speech are
analysed to study changes in the characteristics of glottal vibration during shouting. In Section 5.6,
features are derived from the speech signal to discriminate shouted speech from normal. Some of these
features can be related to the characteristics of the glottal vibration. Analysis results are discussed in
Section 5.7. Finally, a summary of this chapter is given in Section 5.8, along with research contributions.
91
5.2 Different loudness levels in emotional speech
Production of emotional speech at different volume levels, especially in the case of shout, depends
upon the context information. Out of numerous possible contexts and scenarios, it is relatively easier to
investigate the volume level of production in the context of vowel regions, due to relatively steady behaviour of the vocal tract and excitation features in these regions. Hence, the production characteristics
of shouted speech signal are examined in different vowel contexts in this study.
In this study, we examine the production characteristics of shouted speech, also termed as shout,
which is a kind of extreme deviation from normal speech in terms of loudness level. In order to understand the characteristics of shout relative to those of normal speech, the loudness levels are further
sub-classified into whisper, soft, normal, loud and shout, in the increasing order of loudness level [219].
Among these, whisper is mostly unvoiced speech, and the other four, i.e., soft, normal, loud and shout,
are voiced speech. Although many subtle variations at the intermediate levels of loudness can possibly
be produced by people, these four coarse loudness levels in voiced speech are examined in detail from
the production point of view, with focus on discriminating shout from normal speech.
Variation in loudness is used in normal conversation to communicate different meaning, context,
semantics, emotion, expression and intent. “We produce differences in loudness by using greater activity
of our respiratory muscles so that more air is pushed out of lungs. Very often when we do this we also
tense the vocal folds so that pitch goes up” [97]. The vibrations of the vocal folds are caused by the air
pressure from the lungs, which pushes them apart and brings them together, repeatedly. In the case of
shout it is possible that the vocal folds vibrate faster. This increased rate of vibration perhaps leads to the
perception of higher pitch for shouted speech compared to normal speech. Changes in the vibrations of
the vocal folds also affect the open and closed phases in each glottal cycle. The degree of these changes
in the case of shout in relation to normal speech is expected to be much larger than the changes in the
case of soft or loud speech with respect to normal. There exists coupling between the excitation source
and the vocal tract system. This source-system coupling is reflected in the changes in the characteristics
of both the excitation source and the vocal tract system.
5.3 Features for analysis of shouted speech
In this study, apart from the known features like F0 and signal energy, features like the proportion
of the close phase region within glottal cycle, ratio of energies in the low and high frequency regions
of the short-time spectrum, and the standard deviation of the temporal fluctuations in the low frequency
spectral energy are investigated. Among these, the last feature, namely the temporal fluctuations in
the low frequency spectral energy, seems to play a major role in discriminating shout from normal
speech. Methods used for extracting these features are described in this section. It is to be noted (as
explained in later sections) that while most of these features may be obtained using the conventional
speech processing methods, such as short-time spectrum analysis and linear prediction analysis, some
92
(a) Input speech signal waveform
in
x [n]
1
0
−1
(b) EGG of input speech signal
x
e [n]
1
0
−1
(c) LP residual of speech signal
x
r [n]
1
0
−1
(d) Glottal pulses obtained from LP residual
rx
g [n]
1
0
−1
0
5
10
15
20
25
Time (ms)
30
35
40
45
50
Figure 5.1 (a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d) glottal pulse obtained from LP residual (grx [n]) for a segment of normal speech. Note that all plots are
normalized to the range of -1 to +1.
(a) Input speech signal waveform
in
x [n]
1
0
−1
(b) EGG of input speech signal
x
e [n]
1
0
−1
(c) LP residual of speech signal
x
r [n]
1
0
−1
(d) Glottal pulses obtained from LP residual
rx
g [n]
1
0
−1
0
5
10
15
20
25
Time (ms)
30
35
40
45
50
Figure 5.2 (a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d) glottal pulse obtained from LP residual (grx [n]) for a segment of shouted speech. Note that all plots are
normalized to the range of -1 to +1.
new methods like ZTL described in this section seem to help in highlighting some of the excitation
features better than by the conventional methods.
(A) Excitation source features
The glottal excitation source features F0 and SoE are extracted from the speech signal using the
zero-frequency filtering (ZFF) method [130, 216] discussed in Section 3.5, and are shown in Fig. 4.2.
(B) Glottal pulse shape features
The EGG and the differenced EGG (dEGG) signals are used to study the changes in the features of
the glottal pulse for shouted speech. The open and closed phase regions of glottal cycles of an EGG
signal are shown in Figure 4.6. The open and closed phase regions are identified using the differenced
EGG signal shown in Figure 4.6(b). Note that the amplitude of the EGG signal is larger during the
closed phase region due to low impedance across the vocal folds.
93
The features of the glottal pulse are also present in the acoustic signal. But it is not always possible
to derive the shape of the glottal pulse from the speech signal using inverse filtering, except in a few
cases of clean signals [50, 58, 59]. Linear prediction (LP) residual signal is obtained by passing the
speech signal through the inverse filter derived from LP analysis [112]. A 12th order LP analysis on
the signal sampled at 10000 samples per second is used. The sequence of glottal pulses is obtained
by integrating the LP residual signal twice, as the differenced (preemphasized) signal is used for LP
analysis. Figures 5.1 and 5.2 show the speech signal waveform, the EGG signal, the LP residual and
the derived glottal pulse, for normal and shouted speech signals, respectively. The glottal pulse shapes
obtained from the EGG signals are shown in Figures 5.1(b) and 5.2(b), and those obtained from the LP
residual are shown in Figures 5.1(d) and 5.2(d). It can be seen that it is difficult to clearly demarcate
the closed and open phase regions of glottal pulses in Figures 5.1(d) and 5.2(d). On the other hand
the EGG waveforms in Figures 5.1(b) and 5.2(b) are clearer in terms of the closed and open phase
regions as discussed with reference to Figure 4.6(b). Note that improvements in inverse filtering for
deriving the glottal pulses do not seem to give the open and closed phase characteristics of the EGG
waveform, mainly because of the difficulty in cancelling out the effect of the vocal tract system by
inverse filtering [2, 4, 3].
(C) Spectral features
The excitation source seems to contribute to the discrimination between shouted and normal speech.
The variations due to excitation source can be captured well if the temporal variations of the spectral
features are extracted. Ideally, it is desirable (if possible) to derive the spectral information in the signal
around each sampling instant. Spectrograms derived using short-time spectrum analysis of speech are
generally used to provide visual representation of information of both the excitation source and the vocal
tract system. The spectrograms do reveal differences between normal and shouted speech, in the energy
distributions in the low and high frequency regions, and also in the separation of the pitch harmonics.
But it is difficult to isolate the contributions of the excitation source and the vocal tract system effectively
in the normal spectrograms, due to trade-off between spectral and temporal resolution.
In this study, the ZTL method [40, 38] discussed in Section 3.6 is used, to capture the spectral features
of the speech signal with improved temporal resolution. A 3-dimensional (3D) HNGD plot for a segment
of shouted speech signal is shown in Figure 3.4(a), in which the HNGD values are shown for every
sampling instant. The HNGD spectrum sliced temporally at every glottal closure instant (epoch) is
shown in Figure 3.4(b), in a 3D mesh form. The temporal variations in the instantaneous HNGD spectra
over the open and closed phases of glottal pulses are exploited in this study to discriminate between
shouted and normal speech, discussed further in Section 5.6.1.
(D) System resonance features
The production characteristics of speech characterise the combined role of both the excitation source
and the vocal tract system. The vocal tract resonance characteristics can be derived from the LP spectrum obtained using the LP analysis method, as discussed in Section 3.7. The shape of the LP spectrum
represents the resonance characteristics of the vocal tract shape for a frame of speech signal, as illus94
trated for a frame of speech signal, as illustrated in Fig. 3.5. These resonance features extracted for
shout and normal speech are used in the analysis further in Section 5.6.3.
5.4 Data for analysis
Speech data for this study was collected from a total of 17 speakers (10 males and 7 females), each
is a researcher in the Speech and Vision Lab at IIIT, Hyderabad. The data was collected in an ordinary
laboratory environment, where other members of the laboratory were working during collection of data.
The ambient noise level was about 50 dB. Simultaneous recordings were made using close speaking
microphone (Sennheiser ME-03) and the electroglottograph (EGGs for Singers [121]). The data was
recorded using a sampling rate of 48000 samples per second and 16-bits per sample. The data was downsampled to 10000 samples per second for analysis using ZFF, LP and ZTL to derive the HNGD spectra.
Each of the 17 speakers was asked to speak 3 sentences, each in 5 distinct volume (loudness) levels
of utterances. The 3 sentences are: (i) “Where were you?” (ii) “Don’t stop, go on.” (iii) “Please
help!” The 5 different loudness levels in which each speaker produced utterances for each text are:
(a) normal, (b) soft, (c) whisper, (d) loud and (e) shout. Each speaker repeated the utterances twice, and
the utterances of the best (judged by listening) 5 distinct levels of loudness were chosen for each speaker
for each text. The reason for this choice is that some speakers found it difficult to produce speech at
some loudness levels. In spite of this, some speakers found it difficult to produce soft speech. This is
reflected in the objective evaluations described in Sections 5.5 and 5.6. The data in whisper mode was
collected only for differentiating it from soft voice, while recording. Since whispered speech is unvoiced
and/or unaspirated most of the time, that data is not used in this study. Thus the database consists of a
total of 204 utterances (17 speakers, 3 sentences and 4 levels of loudness). The text of the utterances
was chosen to encompass different vowel contexts.
5.5 Production characteristics of shout from EGG signal
The EGG signal is directly proportional to the current flow across the two electrodes placed on either
side of the glottis. Since the resistance encountered is lower during the closed phase of the vocal folds
(shown in Fig. 4.6), the current flow is higher as compared to that in the open phase in each glottal
cycle. The same can also be verified from the dEGG signal, where the location of the positive peak
corresponds approximately to the instant of glottal closure.
Shout signals are produced by the vibration of the vocal folds at the glottis. It is possible that these vocal folds vibrate faster, with a changed rate of their opening/closing, in the case of shout. This increased
rate of vibration, manifested as shorter period compared to normal, perhaps leads to the perception of
higher pitch. It is also possible that if the vocal folds at the glottis are opening/closing at a changed
rate, it reflects in the corresponding change in the ratio of the open/closed phase region to the period of
the glottal cycle for shouted speech. Hence, a comparative study was carried out to examine the glottal
95
pulse characteristics, especially the relative durations of open/closed phase regions in each glottal cycle,
for normal and shouted speech.
Differences in the production characteristics of normal and shouted speech are clearly visible in the
EGG/dEGG signals, as shown in Figures 5.1(b) and 5.2(b), respectively. Increase in the instantaneous
fundamental frequency (F0 ) is evident in Fig. 5.2(b). In the EGG waveforms, the closed phase regions
are identified automatically as shown in Fig. 4.6. That is, the region between the positive peak and
the negative peak for each glottal cycle in the differenced EGG is marked as closed phase region. The
average values of the ratio (α) in percentage (%) of the closed phase region to the period of the glottal
cycle for different utterances and by different people are computed for soft, normal, loud and shouted
speech. These α values are computed over only the voiced regions of the utterance in each case. Then
average α value is computed for each speaker for all 3 utterances (of 3 texts) for a specific loudness
level. The percentage changes (∆α) in the average values of α for soft, loud and shout with respect to
normal speech are given in columns (a), (b) and (c) in Table 5.1, respectively, for each speaker. From
these results, it is evident that the change in ratio (α), i.e., ∆α, generally increases in the case of loud
or shouted speech, and is mostly reduced (i.e., -ve) in the case of soft voice. In a few cases like for
speaker 6 (S6(F)), speaker 16 (S16(F)), and speaker 4 (S4(F)) (particularly female speakers), the ∆α
value is observed to be positive for soft as compared to normal speech. This may possibly be due to
difficulty in the production of soft speech by the speaker. However, for two of these three speakers the
values of ∆α for shout are observed to be higher. For one speaker (S16(F)), of course the ∆α values
are lower for shout in comparison with soft voice, as the speaker was not able to control the voice at
different loudness levels.
In general, it was observed that speakers expressed difficulty in producing speech in soft voice,
whereas most speakers felt that it is easier to produce shouted speech. The values for individual speakers
and utterances were computed to observe the variability, and then the average values of the features were
computed over all utterances for each speaker. The average values across all speakers and utterances for
each loudness level indeed show that for shouted speech the proportion of closed phase in a glottal cycle
is much higher than for normal, and this and related features are exploited in discriminating shouted
speech from normal.
5.6 Analysis of shout from speech signal
As mentioned earlier, it is difficult to derive the features of the glottal pulse shape from the speech
signal itself by using inverse filter. Non-availability of EGG signal may restrict the application of the α
feature in practice. Moreover, speech signal carries much more information about the excitation than the
EGG, due to the effect of the glottal vibration on the pressure of air from the lungs. In this section we
shall examine methods of deriving features related to the excitation information and some other features
from the speech signal in order to discriminate shouted speech from normal speech.
96
Table 5.1 The percentage change (∆α) in the average values of α for soft, loud and shout with respect
to that of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below means speaker
number i (i = 1 to 17).
(a)
Speaker #
(b)
(c)
∆αSof t (%) ∆αLoud (%) ∆αShout (%)
(M/F)
S1 (M)
S2 (M)
S3 (F)
S4 (F)
S5 (M)
S6 (F)
S7 (F)
S8 (M)
S9 (M)
S10 (M)
S11 (F)
S12 (M)
S13 (M)
S14 (F)
S15 (M)
S16 (F)
S17 (M)
Average
5.6.1
-13.0
2.0
1.7
9.7
-15.0
18.5
-7.5
-7.2
-2.6
-10.4
-0.7
-12.6
-16.8
-14.9
-10.6
13.4
-13.4
-4.71 %
10.1
16.3
9.4
10.7
7.2
2.2
13.8
0.6
6.3
11.2
4.9
7.0
-1.0
9.2
4.4
5.5
11.5
7.53 %
14.9
29.9
6.1
25.7
8.1
25.8
20.3
8.5
6.4
24.1
1.6
13.7
7.4
16.9
5.5
2.2
25.5
14.29 %
Analysis from spectral characteristics
Variation in the value of α for speech at different loudness levels does have an effect on the spectral
characteristics. Coupling of the sub-glottal cavities with the supra-glottal cavities during the open phase
of the vocal folds results in a resonance in the low frequency (< 400 Hz) region due to increase in the
effective length of the vocal tract. This in turn results in higher energy in the low frequency region for
soft and normal speech, compared to loud and shouted speech. The lower energy in the low frequency
region for loud and shouted speech is also because of lesser open region in the glottal cycles in these
cases, compared to soft and normal speech.
To examine the effects of excitation from the speech signal, spectral features with good temporal
resolution are derived. For this purpose, the HNGD spectrum (as described in Section 3.6) is computed
at each sampling instant of time using a window of 5 msec. The HNGD spectrum is normalized, by
dividing the spectrum values for the speech segment at every instant of time by their sum value. The
normalized HNGD spectra at every sampling instant of time are shown in Fig. 5.3, for segments of
speech in soft, normal, loud and shout modes.
The HNGD spectrum plots clearly bring out the distribution of energy in the low frequency region in
the speech signal. The higher temporal resolution of the spectral features help in observing the effects of
open and closed regions of glottal cycles on the spectrum. The presence of low frequency (< 400 Hz)
spectral energy due to increased open phase of the glottal cycle can be seen clearly for soft speech
97
Figure 5.3 HNGD spectra along with signal waveforms for a segment of (a) soft, (b) normal, (c) loud
and (d) shouted speech. The segment is for the word ‘help’ in the utterance of the text ‘Please help!’.
Arrows point to the low frequency regions.
(pointed by arrows in Fig. 5.3). The low frequency region has higher energy (darker regions) and larger
fluctuations (less uniform distribution of energy) in the case of soft mode in comparison with that for
shout mode. The diminishing low frequency spectral energy is due to gradual reduction in the open
phase for loud and shout cases, as compared to that for normal speech. Examining these source features
through the HNGD spectra of the speech signal appears to be a promising proposition.
The energies in the low (0-400 Hz) and high (800-5000 Hz) frequency bands of the normalized
HNGD spectra are averaged over a moving window of 10 msec duration for every sampling instant. The
smoothing effect of this window helps to highlight the gross temporal variations in energies, for different
cases of loudness levels. The temporal variations in the low frequency spectral energy (LFSE) and the
high frequency spectral energy (HFSE) over the duration of a word for soft, normal, loud and shouted
speech are shown in Figures 5.4 and 5.5, respectively. It may be observed from these figures that there
is gradual decrease in the average level of the low frequency HNGD energy (the dotted line region in
Fig. 5.4), and increase in the average level of the high frequency HNGD energy (the dotted line region
in Fig. 5.5) in the vowel regions, for the 4 cases of speech with increasing loudness levels. There is
gradual reduction in the amplitudes of the fluctuations in the LFSE and gradual increase in the frequency
of these fluctuations for the 4 cases of loudness levels, although the spread of these fluctuations are not
seen clearly in Figures 5.4 and 5.5 due to smoothing of the LFSE and HFSE contours.
The ratio (β) of the average levels of LFSE and HFSE of the HNGD spectra is computed over the
vowel region (marked in Figure 5.4) for different texts and for different speakers. As an illustration,
the values of β computed for 2 different vowel contexts (/6/ in word ‘stop’ and /e/ in word ‘help’)
98
(a) Soft
(b) Normal
LFSE
0.6
0.4
(c) Loud
0.6
V
0.4
(d) Shout
0.6
0.6
0.4
0.4
V
0.2
0
0
0.2
50
100
150
0
0
0.2
50
100
0
150
0
Time (ms)
0.2
V
100
V
200
0
0
50
100
150
Figure 5.4 Energy of HNGD spectrum in low frequency (0-400Hz) region (LFSE) for a segment of (a)
soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in utterances of
text ‘Please help!’. The vowel regions (V) are marked in these figures.
HFSE
(a) Soft
(b) Normal
(c) Loud
1
1
1
0.8
0.8
0.8
(d) Shout
1
0.8
V
0.6
0.6
0.4
V
0.2
0
50
0.6
V
0.4
100
150
0.2
0
50
100
V
0.6
0.4
150
0.2
0
Time (ms)
0.4
100
200
0.2
0
50
100
150
Figure 5.5 Energy of HNGD spectrum in high frequency (800-5000 Hz) region (HFSE) for a segment
of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in utterances
of text ‘Please help!’. The vowel regions (V) are marked in these figures.
are given for 8 different speakers in Table 5.2 for soft, normal, loud and shouted speech. It may be
observed from the Table 5.2, that since the low frequency energy of the HNGD spectrum decreases and
the high frequency energy increases, their ratio (β) decreases fairly consistently with increasing levels
of loudness.
The visual representation of the feature β can be seen in the distribution of HFSE vs LFSE computed
over the vowel region (/e/ in ‘help’), as shown in Fig. 5.6(a). The distributions of HFSE vs LFSE computed for different vowels (/6/ in ‘stop’, /u/ in ‘you’ and /o:/ in ‘go’) are shown in Figures 5.6(b), (c) and (d).
All these figures show distinct clusters for the 4 different levels of loudness.
Standard deviation (σ) values computed for the LFSE are given in Table 5.3, for the 4 different
loudness levels and in 2 different vowel contexts (/6/ in word ‘stop’ and /e/ in word ‘help’) for 8 different
speakers. It can be observed from Table 5.3 that the fluctuations in the LF energy (LFSE) are far less in
the case of shout as compared to that for normal speech. A similar trend is observed for fluctuations in
the HF energy (HFSE) of the HNGD spectrum, although these are not as prominent as in the LFSE.
As illustration, the values of β and σ features are given in Tables 5.2 and 5.3 for 2 vowel contexts
(/6 in ’stop’ and /e in ’help’) for 8 speakers. Similar trends are observed for other vowel contexts in the
99
(a) For vowel region /e/ in ‘help’
(b) For vowel region /6/ in ‘stop’
Distribution of HFSE vs LFSE (vowel region)
Distribution of HFSE vs LFSE (vowel region)
1
1
Soft
Normal
Loud
Shout
Centroids
0.9
0.8
0.7
0.7
0.6
0.6
HFSE
HFSE
0.8
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
LFSE
0.6
0.7
0.8
0.9
Soft
Normal
Loud
Shout
Centroids
0.9
0
0
1
(c) For vowel region /u/ in ‘you’
0.1
0.4
0.5
LFSE
0.6
0.7
0.8
0.9
1
Distribution of HFSE vs LFSE (vowel region)
1
1
Soft
Normal
Loud
Shout
Centroids
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
0.2
0.3
0.4
0.5
LFSE
0.6
0.7
0.8
0.9
Soft
Normal
Loud
Shout
Centroids
0.9
HFSE
HFSE
0.3
(d) For vowel region /o:/ in ‘go’
Distribution of HFSE vs LFSE (vowel region)
0
0
0.2
1
0
0
0.1
0.2
0.3
0.4
0.5
LFSE
0.6
0.7
0.8
0.9
1
Figure 5.6 Distribution of high frequency spectral energy (HFSE) vs low frequency spectral energy
(LFSE) of HNGD spectral energy computed in 4 different vowel contexts for the 4 loudness levels. The
4 vowel region contexts are: (a) vowel /e/ in word ‘help’, (b) vowel /6/ in word ‘stop’, (c) vowel /u/ in
word ‘you’ and (d) vowel /o:/ in word ‘go’. The segments are taken from the utterances by same speaker
(S4). (Color online)
100
Table 5.2 The ratio (β) of the average levels of LFSE and HFSE computed over a vowel segment of
a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes. Note: The numbers given in
columns (a) to (d) are β values multiplied by 100 for ease of comparison.
Vowel
Speaker #
(a)
(b)
(c)
(d)
context
(M/F)
βSof t βN ormal βLoud βShout
x 100
x 100
x 100 x 100
/6/
S1 (M)
62.9
22.8
10.8
6.7
/6/
S2 (M)
63.8
22.5
11.6
11.4
/6/
S3 (F)
19.4
12.0
5.2
5.0
/6/
S4 (F)
92.8
33.3
11.3
5.7
/6/
S5 (M)
62.6
35.5
26.2
8.6
/6/
S6 (F)
17.4
10.1
5.1
3.7
/6/
S7 (F)
17.6
11.7
6.0
8.5
/6/
S8 (M)
29.9
13.3
17.5
17.9
/e/
S1 (M)
43.9
15.1
6.0
3.6
/e/
S2 (M)
46.2
10.8
11.6
5.6
/e/
S3 (F)
15.1
11.9
8.0
6.3
/e/
S4 (F)
72.8
33.2
10.8
5.5
/e/
S5 (M)
83.2
42.4
30.4
16.6
/e/
S6 (F)
16.0
7.8
3.6
2.0
/e/
S7 (F)
18.1
17.1
5.4
3.0
/e/
S8 (M)
57.9
41.7
26.7
32.2
data, like for vowel /u/ in the word ‘you’, /i/ in the word ‘please’ and /o:/ in the word ‘go’, across all the
17 speakers. Thus, these features (β, σ) are useful to discriminate shout from normal speech.
It is tempting to infer that similar discriminating characteristics can be observed in the usual shorttime spectra computed using 5 msec window at every sampling instant. The normalized short-time
spectra are computed for the same segments considered in Fig. 5.3. The gross features, such as the
increase in the high frequency energy and pitch frequency with increase in loudness can be observed
in the normalized short-time spectra also. But the finer details of the spectral variations caused by the
glottal vibrations, especially in the low frequency region, are not visible prominently in the short-time
spectra in comparison with the HNGD spectra.
The difference in one such feature, namely the temporal variation of energy in the low frequency
band (0-400 Hz) of the normalized HNGD spectra is considered here for illustration. The low frequency
spectral energy (LFSE) is computed at every sampling instant. The spread of the LFSE computed from
HNGD spectra is shown in Fig. 5.7, by plotting the histogram of the LFSE values in the vowel region
for each of the four cases of loudness levels. The relative spread of the LFSE is lowest for the shouted
speech and is largest for the soft speech. This can also be observed from the spread of the LFSE values
from the cluster plots shown in Fig. 5.6.
Fig. 5.8 gives the relative spread of the LFSE computed for the same segments using short-time
spectra. The square root of the short-time spectra are used to reduce the dynamic range of the spectra.
In this case the discrimination among the four cases of loudness levels is not as clear as in Fig. 5.7.
Similar difficulty in discrimination was observed in the cluster plots as in Fig. 5.6 when they are obtained
101
Table 5.3 Average values of standard deviation (σ), capturing temporal fluctuations in LFSE, computed
over a vowel segment of a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes.
Note: The numbers given in columns (a) to (d) are σ values multiplied by 1000 for ease of comparison.
(b)
Vowel
Speaker # (a) σSof t
(c)
(d)
x 1000
σN ormal
context
(M/F)
σLoud x
σShout
x 1000
1000
x 1000
/6/
S1 (M)
88.6
35.1
16.7
13.5
/6/
S2 (M)
95.9
44.4
18.0
23.5
/6/
S3 (F)
35.2
16.4
11.3
10.7
/6/
S4 (F)
67.6
27.5
11.9
9.2
/6/
S5 (M)
96.3
42.3
40.5
13.3
/6/
S6 (F)
25.3
15.2
16.4
11.2
/6/
S7 (F)
14.2
9.6
9.5
9.1
/6/
S8 (M)
72.8
36.2
42.6
32.6
/e/
S1 (M)
86.1
23.6
17.5
5.4
/e/
S2 (M)
71.8
16.0
15.6
8.1
/e/
S3 (F)
30.7
22.9
11.2
15.2
/e/
S4 (F)
66.5
34.1
9.7
6.0
/e/
S5 (M)
93.4
50.4
51.4
29.3
/e/
S6 (F)
31.2
13.2
11.0
4.6
/e/
S7 (F)
20.6
21.9
14.8
6.7
/e/
S8 (M)
69.7
67.3
66.3
72.0
from short-time spectra for different loudness levels. Thus these comparative studies between HNGD
spectrum and short-time spectrum bring out the importance of the temporal resolution of the spectral
features in highlighting the features of excitation, especially the effect of coupling of the excitation
source and the vocal tract system.
5.6.2
Analysis from excitation source characteristics
The instantaneous fundamental frequency (F0 ), representing pitch, is computed from the speech signal using ZFF method. The average F0 values for each speaker are computed over the voiced regions
for utterances in soft, normal, loud and shout modes. In Table 5.4, the percentage change in the average pitch frequency (∆F0 ), relative to that for normal speech, are given in columns (a), (b) and (c),
respectively for soft, loud and shouted speech.
Note that while the shouted speech is associated with rise in pitch with respect to the pitch of normal
speech, the opposite is not true. The rise in pitch does not necessarily increase the loudness level, nor
the audio with raised pitch alone would sound like shouted speech. To verify this assertion, speech
from 5 speakers (3 males, 2 females) for utterances of 3 different texts was recorded with pitch raised
intentionally, but without letting it sound as loud or shout. For comparison, the speech was also recorded
in normal and shout mode for each speaker. The EGG output was also collected along with the acoustic
signal. Table 5.5 gives the average values of the ratio (α) of closed phase region to the period of glottal
cycle for normal, high-pitch (non-shout) and shouted speech, in columns (a), (b) and (c), respectively,
102
Distribution of LFSE(vowel)
Soft
Normal
Loud
Shout
80
70
60
Count
50
40
30
20
10
0
0
0.05
0.1
0.15
0.2
0.25
0.3
LFSE(vowel)
0.35
0.4
0.45
0.5
Figure 5.7 Relative spread of low frequency spectral energy (LFSE) of ‘HNGD spectra’ computed over
a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and shout.
The segment is for the vowel /e/ in the word ‘help’ in the utterance of the text ‘Please help!’.
Distribution of LFSE(vowel)
Soft
Normal
Loud
Shout
100
90
80
Count
70
60
50
40
30
20
10
0
0
0.2
0.4
0.6
0.8
LFSE(vowel)
1
1.2
1.4
Figure 5.8 Relative spread of low frequency spectral energy (LFSE) of ‘Short-time spectra’ computed
over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and
shout. The segment is for the vowel /e/ in the word ‘help’ in the utterance of the text ‘Please help!’.
for 5 different speakers. The corresponding average values of the instantaneous fundamental frequency
(F0 ) are given in columns (d), (e) and (f). It can be observed from Table 5.5 that utterances with raised
pitch (F0 in column (e)) result in reduced average values of α (column (b)), as compared to that for the
normal speech (column (a)). These α values for normal and high pitched voices are in sharp contrast
with the corresponding values for the shouted speech, given in column (c) in Table 5.5.
The average F0 and SoE values for normal and shouted speech, are given in Table 5.6, in columns
(d) and (e), and (f) and (g), respectively. These values are given for 5 different vowel contexts from the
dataset, as representative cases. The changes in F0 and SoE for shouted speech, i.e., ∆F0 (= F0Shout −
F0N ormal ) and ∆SoE (= SoEShout − SoEN ormal ) are given in columns (h) and (j), respectively. The
changes in F0 and SoE, in terms of the percentage of their values for normal speech, are given in
columns (i) and (k), respectively. It may be observed from these results that F0 always increases in the
case of shouted speech, with respect to F0 of normal speech. However, SoE may increase in some cases
103
Table 5.4 The percentage change (∆F0 ) in the average F0 for soft, loud and shout with respect to that
of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below means speaker
number i (i = 1 to 17).
(a)
(b)
(c)
Speaker #
∆F0Sof t (%)
∆F0Loud (%)
∆F0Shout (%)
(M/F)
S1 (M)
S2 (M)
S3 (F)
S4 (F)
S5 (M)
S6 (F)
S7 (F)
S8 (M)
S9 (M)
S10 (M)
S11 (F)
S12 (M)
S13 (M)
S14 (F)
S15 (M)
S16 (F)
S17 (M)
Average
-16.7
-8.0
3.6
-6.9
-10.5
-3.8
-11.4
-3.4
-3.6
-2.3
-5.9
-28.5
-3.6
-3.9
-4.0
-7.4
-1.0
-6.94 %
29.9
26.2
10.6
28.4
10.6
22.2
7.6
16.7
30.4
8.5
3.1
20.3
13.1
17.0
15.0
-0.7
18.0
16.29 %
84.6
82.4
17.5
49.1
58.4
53.2
11.1
27.6
68.3
82.9
1.9
38.5
57.9
69.7
43.8
1.4
36.6
46.24 %
and may decrease in some other cases, though the significant amount of changes in SoE, i.e., |∆SoE|
is another indication of shouted speech.
5.6.3
Analysis using dominant frequency feature
It is generally difficult to see the changes in the characteristics of excitation through spectral features,
as these features represent the combined effect of the characteristics of both the vocal tract system and
the excitation, and also the vocal tract characteristics dominate in it. It is difficult to derive the spectral
characteristics with good temporal resolution to provide discrimination between closed and open phase
regions of glottal vibration [40]. However, the manifestation of the dominating vocal tract characteristics
along with the changes in excitation characteristics can be observed in the spectral characteristics like
LP spectrum. But the challenge remains - ‘how to achieve good temporal resolution’, since LP spectrum
is computed only for a frame length of the speech signal.
It is observed from the comparison of the LP spectra of corresponding frames of speech signal for
normal and shout modes, that there are changes in the locations and magnitudes of spectral peaks in
the case of shouted speech. The locations of spectral peaks in the LP spectrum, represented in terms of
frequencies, indicate the combined effect of resonance characteristics of the vocal tract as well as of the
excitation source characteristics. The location of these spectral peaks may sometimes be closely related
to the formants but it is not always necessary. It is also observed from the comparison of LP spectra that
104
Table 5.5 The average values of the ratio (α) of the closed phase to the glottal period for (a) normal,
(b) raised pitch (non-shout) and (c) shouted speech, respectively. Columns (d), (e) and (f) are the corresponding average fundamental frequency (F0 ) values in Hz. The values are averaged over 3 utterances
(for 3 texts) for each speaker. Note: Si below means speaker number i (i = 1 to 5). F0 values are
rounded to nearest integer value.
Speaker #
(M/F)
(a)
αN ormal
(b)
αHighP itch
(c)
αShout
(d)(Hz)
F0N ormal
(e)(Hz)
F0HighP itch
(f)(Hz)
F0Shout
S1 (M)
S2 (F)
S3 (M)
S4 (F)
S5 (M)
0.445
0.368
0.515
0.496
0.447
0.368
0.337
0.303
0.461
0.397
0.555
0.385
0.535
0.540
0.499
151
219
160
289
116
416
430
502
622
336
264
313
293
395
211
Table 5.6 Results to show changes in the average F0 and SoE values for normal and shouted speech, for
5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates Shout, S# indicates Speaker
number, T# indicates Text number and M/F indicates Male/Female. Note: IPA symbols for the vowels
in English phonetics are shown for the vowels used in this study.
(i)(%)
(k)(%)
(f)(Hz) (g)
(c)Speaker, (d)(Hz) (e)
(j)
(h)(Hz) ∆F0
(a)Vowel (b)
∆SoE
Text(M/F) F0N m SoEN m F0Sh SoESh ∆F0 F0N m ∆SoE
SoEN m
context Word
/6/
/stop/
S1,T2(M) 115 .5250
213 .2744 98
85.22 -.2506
-47.73
/e/
/help/ S4,T3(F) 230 .7592
378 .2284 148
64.40 -.5308
-69.92
/u/
/you/
S2,T1(M) 181 .4066
278 .8556 97
53.86 .4490
110.43
/oU/
/go/
S6,T2(F) 204 .1788
292 .3959 88
43.23 .2171
121.42
/i:/
/please/ S5,T3(M) 149 .1790
241 .6472 92
61.84 .4682
261.56
the location of the highest peak of the LP spectrum changes for the shouted speech as compared to that
for normal speech. This change is more prominent for certain frames taken at different instants of time,
from speech signals in shout and normal modes.
The location of the highest peak in the LP spectrum seems to have the dominating effect and characterises the frame of signal at that particular instant of time. Hence, we have termed it as dominant
frequency (FD ) in this paper. The dominant frequency (FD ) at some time instant t in speech signal can
be obtained from the location of the highest peak in the LP spectrum of a signal frame at t, as is shown
in Fig. 3.5. The FD value derived from speech signal for each sampling instant provides a production
characteristic of speech signal with good temporal resolution.
The illustrations of changes in FD for normal and shouted speech are shown in Figures 5.9(d)
and 5.10(d), respectively. The changes in the corresponding F0 and SoE contours are also shown
in each figure. It may be observed that changes in the characteristics of the excitation source (F0 , SoE),
and that of the system and the source combined (FD ), are significant in certain segments of normal
and shouted speech. These changes are more prominent around vowel regions. The mean and standard
deviation of these temporally changing FD values are computed, to help detection of shouted speech.
105
xin[n]
(a) Signal waveform
1
0
−1
(b) F contour
0
F (Hz)
0
400
300
200
100
(c) SoE contour
SoE
1
0.8
0.6
0.4
0.2
(d) Dominant frequency (F ) contour
D
F (Hz)
D
2500
2000
1500
1000
500
0.6
0.65
0.7
0.75
Time (sec)
0.8
0.85
Figure 5.9 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour for
a segment “you” of normal speech in male voice.
xin[n]
(a) Signal waveform
1
0
−1
F0 (Hz)
SoE
(b) F0 contour
400
300
200
100
1
0.8
0.6
0.4
0.2
(c) SoE contour
FD (Hz)
(d) Dominant frequency (FD) contour
2500
2000
1500
1000
500
0.6
0.65
0.7
0.75
Time (sec)
0.8
0.85
Figure 5.10 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour for
a segment “you” of shouted speech in male voice.
The changes in FD values for corresponding segments of normal and shouted speech are given in
Table 5.7, for the words in the context of 5 different vowels in English phonetics. The mean and
standarad deviation (σ) of FD values for normal speech are given in columns (d) and (e), and for
shouted speech in columns (f) and (g). The change in the mean of FD values, i.e., (∆FD = FDShout −
FDN ormal ), and the percentage change in FD with respect to FDN ormal , i.e., (∆FD /FDN ormal ) are given
in columns (h) and (i), respectively. The changes in the ratios of standard deviation (σ) to mean FD
values from normal to shouted speech, i.e., (σFD Shout /FDShout − σFD N ormal /FDN ormal ) are given in
column (j). A 5th order LP analysis is used to obtain the LP spectrum at each time instant. The FD
values are derived from the LP spectra for different frame sizes (5, 10, and 20 msec) taken at each instant
of time. The frame size of 10 msec is considered better.
It may be observed from the Table 5.7, that changes in the mean FD values are consistent across all
the vowel contexts. Similar changes are also observed for words in other vowel contexts in the dataset.
106
Table 5.7 Results to show changes in the Dominant frequency (FD ) values for normal and shouted
speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates Shout, S# indicates
Speaker number, T# indicates Text number and M/F indicates Male/Female. Note: IPA symbols for the
vowels in English phonetics are shown for the vowels used in this study.
(j)(%)
(i)(%)
σ FD
σ FD
(c)Speaker, (d) (Hz) (e)(Hz) (f)(Hz) (g)(Hz) (h)(Hz) ∆FD
(a)Vowel (b)
Sh
Sh
σ
σ
F
F
FDSh
F
F
F
F
DSh
DSh
DN m
DN m
Text(M/F)
∆FD
DSh
DN m
context Word
/6/
/stop/
S1,T2(M) 775
295
945
193
171
22.04 -17.73
/e/
/help/
S4,T3(F) 538
224
1689 999
1150 213.77 17.61
/u/
/you/
S2,T1(M) 424
230
1448 658
1024 241.28 -8.72
/oU/
/go/
S6,T2(F) 859
331
1162 962
303
35.29 44.24
/i:/
/please/ S5,T3(M) 595
337
1122 758
526
88.40 10.90
The changes in the range of fluctuations in FD values, represented by the standard deviation (σ) of
FD values, do not exhibit any definite trend. However, the changes in the mean FD values for shouted
speech as compared to that for normal speech, indicate that changes in the mean FD value could be a
key characteristic of shouted speech.
Combined evidences of changes in the mean FD along with F0 and SoE, may help detection of
shouts in continuous speech. It may be noted that these discriminating features are relatively simple to
derive, and are computationally less expensive. A decision logic can be devised using these production
features for automatic detection of shouted speech.
5.7 Discussion on the results
From the above discussion, the following are the features that are useful to discriminate shouted
speech from normal speech:
• Instantaneous fundamental frequency (F0 )
• Ratio (α) of closed phase region to the period of glottal pulse cycle
• Ratio (β) of LFSE to HFSE computed over vowel region, i.e., βvowel , where LFSE and HFSE
are the energies of the normalized HNGD spectra in the low (0-400 Hz) and high (800-5000 Hz)
frequency regions of the vowel segments, respectively.
• Standard deviation (σ) of the low frequency energy of the normalized HNGD spectrum, computed
over a vowel region, i.e., σ for LF SE(vowel)
• Dominant frequency (FD ) of resonances in vocal tract system
Features of speech production mechanism that contribute to the characteristics of shout are examined
in this study. Production characteristics of shout appear to be changing significantly in the excitation
source, mainly due to differences in the vibration of the vocal folds at the glottis. The consistent increase
in the average ratio (α) of the closed phase region to the glottal cycle period for increasing levels of
loudness confirms that the vocal folds at the glottis indeed have a longer closed phase for higher levels
107
of loudness. The vocal folds also are more tensed and vibrate faster in the case of shout, thereby leading
to the perception of higher pitch. However, mere increase in pitch does not necessarily indicate that it is
due to shout.
The low frequency spectral energy (LFSE) decreases and the high frequency spectral energy (HFSE)
increases, with increase in the level of loudness, due to increased proportion of the closed phase within
each glottal cycle. The ratio (β) of LFSE to HFSE is lower for shouted speech as compared to that for
normal speech. The degree of temporal fluctuations (σ) in the low frequency spectral energy (LFSE)
also reduces with increasing loudness level.
5.8 Summary
Features of speech production mechanism that contribute to the characteristics of shout are examined
in this study. Production characteristics of shout appear to be changing significantly in the excitation
source, mainly due to differences in the vibration of the vocal folds at the glottis. The consistent increase
in the average ratio (α) of the closed phase region to the glottal cycle period for increasing levels of
loudness confirms that the vocal folds at the glottis indeed have a longer closed phase for higher levels
of loudness. The vocal folds also are more tensed and vibrate faster in the case of shout, thereby leading
to the perception of higher pitch. However, mere increase in pitch does not necessarily indicate that it is
due to shout.
The effect of coupling of the excitation source with the vocal tract system is examined for 4 levels
of loudness. This coupling leads to significant change in the spectral energy in low frequency region
relative to that in the high frequency region for shouted speech. The change is consistent in all the vowel
regions. The low frequency spectral energy decreases and the high frequency spectral energy increases
with increasing loudness level, due to increased closed phase quotient in each glottal cycle. Hence,
the ratio (β) of LFSE to HFSE is lower for shouted than for normal speech. The degree of temporal
fluctuations (σ) in the LFSE also reduces with increasing loudness level.
To study the effect of coupling between the system and the source characteristics, it is necessary
to extract the spectral characteristics of speech production mechanism with high temporal resolution,
which is still a challenging task. At present the computational complexity of the HNGD spectrum is
a limiting factor for practical applications of the proposed features. As a solution towards developing
system for automatic shout detection in continuous speech, computing the dominant frequency (FD )
capturing the resonances in vocal tract system is proposed.
The features proposed in this study, along with the features like F0 and signal energy, may be useful
for detection of shout in continuous speech. These studies may also help in understanding the role
of excitation component in the production of shout-related emotions like anger or disgust. In the next
chapter, we examine the role of these characteristics in the production of paralinguistic nonverbal speech
sounds such as laughter.
108
Chapter 6
Analysis of Laughter Sounds
6.1 Overview
Production of variations in normal speech and emotional speech sounds involves the effects of
source-system interaction. Also, changes in the characteristics of excitation source in particular can
be controlled voluntarily in these cases. But, in the case of paralinguistic sounds like laughter, rapid
changes occur in the source characteristics, and the changes are produced involuntarily. Hence, different
signal processing techniques are needed, for the analysis of this type of nonverbal speech sounds.
Production characteristics of laughter sounds are expected to be different from normal speech. In
this study, we examine changes in the characteristics of the glottal excitation source and the vocal
tract system, during production of laughter. Laughter at bout and call levels is analysed. Production
characteristics of the laughter-speech continuum are analysed in three categories, two of laughter as
(i) nonspeech-laugh (NSL) and (ii) laughed-speech (LS), and third (iii) normal speech (NS) as reference.
Only voiced nonspeech-laugh, produced spontaneously, is considered. Data was recorded for natural
laugh responses. In each case, both EGG [50] and acoustic signals are examined.
Changes in the glottal vibration are examined using features such as closed phase quotient in glottal cycles [123] and F0 , both derived using differenced EGG signal [58]. Excitation source features
are extracted also from the acoustic signal using a modification proposed in this study in the ZFF
method [130]. First, the excitation source is represented in terms of a time-domain sequence of impluselike excitation pulses. Then, features such as instantaneous fundamental frequency (F0 ), strength of
impulse-like excitation (SoE) around glottal closure instants, and density of excitation impulses (dI )
are derived. Associated changes in the vocal tract system characteristics are examined using first two
dominant frequencies (FD1 and FD2 ) [124], derived from the acoustic signal using LP analysis [112]
and group delay function [128]. Production features are also derived in terms of sharpness of the peaks
in the Hilbert envelope [170] of LP residual [114] of acoustic signal. The decision of voiced/nonvoiced
regions [39] uses framewise energy of the modified ZFF output signal. Parameters are derived to measure the degree of changes and temporal changes in the production features that discriminate well the
three cases NS, LS and NSL. Performance evaluation is carried out on a database with ground truth.
109
The chapter is organized as follows. Section 6.2 discusses the details of data collected for this
study. Changes in the glottal source characteristics during production of laughter are examined from the
EGG signal in Section 6.3. A modified zero-frequency filtering method to study the excitation source
characteristics of laughter sounds is proposed in Section 6.4. Section 6.5 discusses changes in the source
and the system characteristics examined from the acoustic signal. Excitation source characteristics are
derived using the modified zero-frequency filtering method, associated changes in the vocal tract system
using LP and group delay analysis, and some production characteristics using the sharpness of peaks
in the Hilbert envelope of LP residual of the acoustic signal. Results of the study are discussed in
Section 6.6. A summary and contributions of this chapter are given in Section 6.7.
6.2 Data for analysis
Laughter consists of both nonspeech-laugh and laughed-speech. Nonspeech-laugh (NSL), referred
to as ‘pure-laugh’ in [135], may have both voiced and unvoiced regions, but only voiced regions are
considered in this study. Laughed-speech (LS), referred to as ‘laugh-speech’ in [118], has regions of
laughter interspersed with speech, the degree of which is difficult to predict or quantify. Hence, wide
variations in the acoustic features of a laughed-speech bout are probable.
The data was collected, eliciting natural laughter responses of the subjects, by playing hilarious/
comedy audio-visual clips or jokes audio clips, sourced from online media resources. The subjects were
asked to express their responses, in case they really liked the comedy or joke, as one of the following
three texts: (i) “It is a good joke.” (ii) “It is really funny.” (iii) “I have enjoyed.” The idea of using
predefined texts was to help subjects express their natural responses with minimal text-related variability
in acoustic features. The laughter (LS and NSL) responses and normal speech (NS) were recorded in
each case. Normal speech for speakers was also recorded for the utterance of a fourth (neutral) text
(iv) “This is my normal voice.” Both acoustic and EGG signals were recorded in parallel, in each case.
Data was recorded for 11 speakers (7 males and 4 females), all research students at IIIT, Hyderabad.
The data has total 191 (NSL/LS) laugh calls in 32 utterances, by 11 speakers. The nonspeech-laugh calls
occur mostly prior to (or sometimes after) the laughed-speech calls for a text. The data also consists
of 130 voiced segments of normal speech in 35 utterances, by 10 speakers. Thus the database has
191 natural laugh (NSL/LS) calls and 130 (NS) voiced segments in total 67 utterances, by 11 speakers.
EGG signal was recorded using a EGG recording device [EGGs for Singers [121]]. The device
records the current flow between two disc shaped electrodes, across the glottis. Thus changes in the
glottal impedance during vocal folds vibration are captured in the EGG signal, but not the changes in
air pressure in glottal region. Corresponding acoustic signal was recorded using a close-speaking standmounted microphone (Sennheiser ME-03), kept at a distance of around 15 cm from the speaker. The
data was collected in normal room conditions, with ambient noise at about 50 dB, using a sampling rate
of 48 KHz with 16 bits/sample. The data was downsampled to 10 KHz for the analysis. The ground
truth for the data of nonspeech-laugh and laughed-speech was established by listening to each audio file.
110
(a) Input speech signal waveform
in
x [n]
1
0
−1
(b) α contours (with V/NV regions)
αdex
1
0.5
0
F0 (Hz)
SoE
(c) F0 contours (with V/NV regions)
500
400
300
0.4
0.2
(d) SoE contours (with V/NV regions)
2.6
2.65
2.7
2.75
2.8
2.85
2.9
Time (sec)
2.95
3
3.05
3.1
Figure 6.1 Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for three calls in a
nonspeech-laugh bout after utterance of text “it is a good joke”, by a female speaker.
xin[n]
(a) Input speech signal waveform
1
0
−1
(b) α contours (with V/NV regions)
α
dex
1
0.5
0
F0 (Hz)
SoE
(c) F0 contours (with V/NV regions)
500
400
300
0.4
0.2
(d) SoE contours (with V/NV regions)
0.4
0.5
0.6
0.7
0.8
0.9
Time (sec)
1
1.1
1.2
1.3
Figure 6.2 Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for a laughed-speech
segment of the utterance of text “it is a good joke”, by a female speaker.
6.3 Analysis of laughter from EGG signal
In the production of laughter, the vocal folds vibrate in a manner similar to normal speech. Hence,
the glottal vibration characteristics of laughter are examined from the EGG signal [50]. Changes are
examined in the closed phase quotient (α) in each glottal cycle for both the cases of laughter, i.e.,
nonspeech-laugh and laughed-speech, with reference to normal speech. The open/closed phase durations are computed using the differenced EGG (dEGG) signal [58], as illustrated in Fig. 4.6. Peaks and
valleys in the dEGG signal (dex [n]) correspond nearly to the positive going and negative going zerocrossings in the EGG signal (ex [n]), respectively. Peaks in the dEGG indicate glottal closure instants
(GCIs). The region in a glottal cycle for peak to valley in dEGG is considered as closed phase of duration Tc , and the region for valley to peak in dEGG as open phase of duration To . The proportion α is
computed as α = Tc /(To + Tc ). An illustration of α contours (obtained from the dEGG signal (dex [n]))
111
in
x [n]
(a) Input acoustic signal waveform
1
0
−1
x
e [n]
(b) EGG signal (with V/NV regions)
1
0
−1
x
e
d [n]
(c) dEGG signal (with V/NV regions)
0.4
0.2
0
−0.2
α
d
ex
(d) α contour (with V/NV regions)
0.8
0.6
0.4
0.2 1
0.1
2
0.3
3
0.5
4
0.7
6
1.1
1.3
Time (sec)
5
0.9
7
1.5
8
1.7
1.9
2.1
Figure 6.3 Illustration of (a) signal waveform (xin [n]), and (b) EGG signal ex [n], (c) dEGG signal
dex [n] and (d) αdex , contours along with V/NV regions (dashed lines). The segment consists of calls in
a nonspeech-laugh bout (marked 1-4 in (d)) and a laughed-speech bout (marked 5-8 in (d)) for the text
“it is really funny”, produced by a male speaker.
for the calls in nonspeech-laugh and laughed-speech (voiced) segments produced by a female speaker
is shown in Fig. 6.1(b) and Fig. 6.2(b), respectively.
A comparative study is carried out for the calls in nonspeech laugh and laughed-speech bouts. An
illustration of the acoustic signal (xin [n]), EGG signal (ex [n]), dEGG (dex [n]) and the α contour is given
in Fig. 6.3(a), Fig. 6.3(b), Fig. 6.3(c) and Fig. 6.3(d), respectively, for calls in a NSL bout and a LS bout.
The NSL calls are marked in Fig. 6.3(d) as regions 1, 2, 3, 4 and the LS calls as 5, 6, 7, 8. It may be
observed in Fig. 6.3(d) that the spread (fluctuation) of α is distinctly larger for nonspeech-laugh calls,
in comparison with laughed-speech calls that have α contour relatively more smooth.
6.3.1
Analysis using the closed phase quotient (α)
In Table 6.1, the average α values (αave ) computed for each speaker are given in columns (a), (b)
and (c), for normal speech, laughed-speech and nonspeech-laugh, respectively. The corresponding standard deviations in α (σα ) are given in columns (d), (e) and (f). In general, the αave is observed to be
lower and σα higher for laughter (NSL/LS) calls, in comparison with normal speech. Hence, changes
in α are better represented by a parameter βα (i.e., β), computed as
βα =
σα
× 100
αave
112
(6.1)
Table 6.1 Changes in α and F0EGG for laughed-speech and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average α, (d)-(f) are σα , (g)-(i) are average βα and (l)-(n) are average F0EGG (Hz) for NS, LS and NSL. Columns (j), (k) are ∆βα (%) and (o), (p) are ∆F0EGG (%) for
LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female.
(p)
(l) (m) (n)
(o)
(j)
(k)
Speaker (a) (b) (c) (d) (e) (f) (g) (h) (i)
∆F
F
∆F
F
F
σ
σ
σ
0N SL
0
0
0
0
α
α
α
(M/F) αN SαLS αN SL N S LS N SL βN SβLS βN SL ∆βLS ∆βN SL N S LS N SL
LS
S1 (M)
S2 (M)
S3 (F)
S4 (M)
S5 (M)
S6 (F)
S7 (M)
S8 (M)
S9 (F)
S10(F)
S11(M)
.50
.47
.44
.51
.44
.48
.52
.15
.38
.49
.41
.48
.41
.49
.49
.37
.48
.47
.53
.49
.44
.43
.39
.36
.38
.47
.34
.46
.42
.46
.42
.42
.35
.113 .113
.076 .136
.067 .109
.037 .044
.051 .087
.123 .120
.047 .066
.034 .131
.084 .174
.186 .077
.090 .092
Average
.116
.162
.206
.115
.129
.153
.106
.138
.197
.138
.185
22
16
15
7
12
26
9
22
22
38
22
19
23
33
22
9
23
25
14
25
35
17
22
23
29
45
54
24
38
33
25
30
47
33
53
38
5
105
46
25
101
-3
57
13
60
-55
-1
31
178
257
237
227
28
179
36
111
-13
142
168
148
249
147
152
235
138
133
212
208
143
168
179
280
149
218
282
126
194
256
355
176
256
183
281
183
235
378
187
221
291
384
182
0
20
12
2
43
20
-9
46
21
71
23
23
52
24
13
25
54
61
36
66
37
85
27
44
The βα values for the three cases of NS, LS and NSL are given in Table 6.1 in columns (g), (h)
β −β
and (i), respectively. Changes in βα for LS and NSL from NS, i.e., ∆βLS (%) = LSβ N S × 100
β
NS
−β
NS
× 100, are given in columns (j) and (k), respectively. The βα and ∆βα
and ∆βN SL (%) = N SL
βN S
values are rounded to integers. The average βα values for NS, LS and NSL cases are 19, 23 and 38,
respectively. Across speakers, the increase in βα is higher for nonspeech-laugh than laughed-speech,
with reference to normal speech. It means, across speakers the closed phase quotient (αα ) reduces more
for nonspeech-laugh than laughed-speech. It is possible that this reduction in the closed phase quotient,
is related to the abrupt closure of the vocal folds in each glottal cycle, during production of laughter.
This possibility is examined later, in Section 6.5.5.
6.3.2
Analysis using F0 derived from the EGG signal
The observation that the instantaneous fundamental frequency (F0 ) rises in a laughter call [16, 11,
85, 118], is confirmed in this study by the analysis from EGG signal. The F0 values (i.e., F0EGG )
are computed from the inverse of the glottal cycle periods (T0EGG ) obtained using the dEGG signal,
as illustrated in Fig. 4.6. In Table 6.1, average F0 (in Hz) for the three cases NS, LS and NSL, are
given in columns (l), (m) and (n), respectively. Changes in average F0 for LS/NSL from NS, i.e.,
F
−F0N S
F
−F
× 100 are given in columns (o)
∆F0LS (%) = 0LSF0 0N S × 100 and ∆F0N SL (%) = 0N SL
F
0
NS
NS
and (p), respectively. The values are rounded to integers. Across speakers, the changes in average F0
are larger for nonspeech-laugh than for laughed-speech. The average rise in F0 for nonspeech-laugh
and laughed-speech, with reference to normal speech, is 44% and 23%, respectively.
113
(b) Average α for female calls
0.5
0.45
0.45
ave
0.5
0.4
α
αave
(a) Average α for male calls
0.35
0.3
1
0.4
0.35
0.3
2
3
4
1
(c) Average F for male calls
350
(Hz)
350
ave
300
F0
(Hz)
ave
400
0
F
4
0
400
250
2
3
Call number
3
(d) Average F for female calls
0
200
1
2
4
300
250
200
1
2
3
Call number
4
Figure 6.4 (Color online) Illustration of inter-calls changes in the average values of ratio α and F0 , for
4 calls each in a nonspeech-laugh bout (solid line) and a laughed-speech bout (dashed line), produced
by a male speaker and by a female speaker: (a) αave for NSL/LS male calls, (b) αave for NSL/LS female
calls, (c) F0ave for NSL/LS male calls, and (d) F0ave for NSL/LS female calls.
6.3.3
Inter-call changes in α and F0
It is interesting to observe that the average α values (αave ) for calls within a NSL/LS laugh bout also
show some inter-calls increasing/decreasing trend. An illustration of inter-calls changes in αave for the
calls in NSL and LS bouts is given in the voice of a male speaker and a female speaker, in Fig. 6.4(a) and
Fig. 6.4(b), respectively. For nonspeech-laugh there is decrease in αave for successive calls in bout. The
inter-calls trend in the average F0 values (F0ave ) also is observed for the calls in NSL and LS bouts of a
male speaker and a female speaker, as illustrated in Fig. 6.4(c) and Fig. 6.4(d), respectively. The trends
of inter-calls changes in αave and F0ave are indicative of the (inter-calls) growth/decay characteristics
of natural laughter bouts. Higher F0 and Lower α for NSL calls, compared to LS calls, can also be
observed for both the speakers.
The analysis from EGG/dEGG signal highlights some interesting changes in the glottal vibration
characteristics during production of laughter. But, two practical limitations in using the EGG signal are:
(i) EGG captures changes in the glottal impedance (related with air flow), not the air pressure in glottal
region, and (ii) collecting EGG signal may not be practically feasible at all times. Hence, the excitation
source characteristics are examined from the acoustic signal in further sections.
6.4 Modified zero-frequency filtering method for the analysis of laughter
Production characteristics of laughter are analysed in terms of acoustic features derived from both
the EGG and acoustic signals. Since, there are some practical limitations in collecting the EGG signal,
114
xin[n]
(a) Input acoustic signal waveform
1
0
−1
(b) Trend built−up after cascaded ZFRs
10
y2[n]
x 10
−4
−6
x
zm [n]
(c) modified ZFF signal
1
0
−1
V/NV[n]
(d) V/NV regions
1
0.5
V
V
V
V
0
0.1
0.2
0.3
0.4
Time (sec)
0.5
0.6
0.7
Figure 6.5 Illustration of (a) acoustic signal waveform (xin [n]), (b) the output (y2 [n]) of cascaded pairs
of ZFRs, (c) modified ZFF (modZFF) output (zx [n]), and (d) voiced/nonvoiced (V/NV) regions (voiced
marked as ‘V’), for calls in a nonspeech-laugh bout of a female speaker.
the production characteristics are derived also from the acoustic signal. Though excitation source characteristics from the speech signal for modal voicing can be derived using the zero-frequency filtering
(ZFF) method [130, 216]. But, laughter involves rapid changes in the glottal vibration characteristics
such as F0 [159]. Hence, in order to capture these changes better, a modified ZFF method is proposed.
Steps involved in the proposed modified zero-frequency filtering (modZFF) method for deriving the
source characteristics from the acoustic signal of paralinguistic sounds like laughter, are as follows:
(a) Preprocess the input acoustic signal xin [n] to obtain a differenced signal s[n], in order to minimize
the effect of any slow varying component in recording and produce a zero-mean signal.
(b) Pass the differenced speech signal s[n] through a cascade of two ideal digital filters, called zerofrequency resonators (ZFRs) [130]. Each ZFR has a pair of poles located at the Unit-circle in the
z-plane. The output of both ZFRs is given by
y1 [n] = −
ak y1 [n − k] + s[n]
k=1



ak y2 [n − k] + y1 [n] , 


k=1
y2 [n] = −







2
X
2
X
(6.2)
where a1 = −2, a2 = 1. Successive integration-like operations in cascaded ZFRs result in a
polynomial growth/decay kind trend built-up in its output y2 [n], as illustrated in Fig. 6.5(b), for
nonspeech-laugh bout signal in Fig. 6.5(a).
115
(c) The trend removal for modal voicing [130] is normally carried out by subtracting local mean computed over a window of size 2N + 1 samples, whose duration is between 1-2 times the average pitch
period (T0 ). But, for the nonmodal voices such as laughter, that has rapid changes in the glottal vibration, the trend removal operation is proposed in stages using gradually reducing window sizes.
Window sizes 20 ms, 10 ms and 5 ms, and then 3 ms, 2 ms and 1 ms are used for computing the
local mean (for gross trend first, then finer changes). This is the key step different from the ZFF
method discussed in Section 3.5(c). The resultant output after each stage is given by
y˜2 [n] = y2 [n] −
N
X
1
y2 [n + l]
2N + 1
(6.3)
l=−N
where each gradually reducing window consists of 2N + 1 samples. Final trend removed output is called the modified zero-frequency filtered (modZFF) output (zx [n]). An illustration of the
modZFF output signal is shown in Fig. 6.5(c), for the acoustic signal in Fig. 6.5(a). The detection of voiced/non-voiced (V/NV) regions [39] is based upon the framewise energy of the modZFF
signal (zx [n]). An illustration of V/NV regions in laughter acoustic signal is shown in Fig. 6.5(d).
(d) This modZFF signal (zx [n]) carries information of glottal vibrations within each glottal cycle due to
opening/closing of vocal folds, and also those related to high-pitch frequency. In order to highlight
the glottal cycle characteristics better, in particular the locations of glottal closure instants (GCIs),
i.e., epochs, the Hilbert envelope is obtained from the modZFF output signal zx [n]. The Hilbert
envelope (hz [n]) of the signal z[n] (i.e., zx [n]) is given by [170, 137]
q
2 [n] ,
hz [n] = z 2 [n] + zH
(6.4)
where zH [n] denotes the Hilbert transform of z[n]. The Hilbert transform zH [n] of the signal z[n]
is given by [170, 137]
zH [n] = IF T (ZH (ω)) ,
(6.5)
where IF T denotes inverse Fourier transform, and ZH [n] is given by [137]
(
+jZ(ω), ω ≤ 0
ZH (ω) =
−jZ(ω), ω > 0
(6.6)
Here Z(ω) denotes the Fourier transform of the signal z[n]. An illustration of the Hilbert envelope (hz [n]) of few cycles of modZFF output is shown in Fig. 6.6(c), along with the modZFF output
signal (zx [n]) in Fig. 6.6(b) and the acoustic signal (xin [n]) in Fig. 6.6(a). Corresponding EGG
signal (ex [n]) is shown in Fig. 6.6(d), for comparison with the glottal pulse characteristics.
(e) In the case of modal voicing, the locations of (negative to positive going) zero-crossings of ZFF
signal correspond to the GCIs (epochs) [130]. The instantaneous fundamental frequency (F0 ) is
computed from the inverse of the period (T0 ), i.e., interval between successive epochs [216]. But,
in the case of nonmodal voice like laughter, the epochs (i.e., GCIs) are obtained using the Hilbert
116
xin[n]
(a) Input acoustic signal
1
0
−1
(b) modified ZFF signal
zx[n]
1
0
−1
(c) Trend−removed Hilbert envelope of modified ZFF signal
hz[n]
1
0
−1
ex[n]
(d) EGG signal
0
−0.5
−1
0
5
10
15
Time (ms)
20
25
30
Figure 6.6 Illustration of (a) signal (xin [n]), and (b) modZFF output (zx [n]), (c) Hilbert envelope of
modZFF output (hz [n]), and (d) EGG signal (ex [n] for a nonspeech-laugh call, by a female speaker.
envelope of the modZFF signal (hz [n]). Epochs for a nonspeech-laugh call are shown marked
with arrows in Fig. 6.6(c). The interval between these successive epochs gives period (T0 ), and
inverse of T0 gives the F0 (i.e., F0ZF F ) for the nonmodal voices. The relative strength of impulselike excitation (SoE) is derived from the slope of the Hilbert envelope of ZFF signal (hz [n]), at
these epoch locations. An illustration of changes in the F0 and SoE contours for the calls in nonspeech laugh and laughed-speech bouts of a female speaker is shown in Fig. 6.1(c) and (d), and
Fig. 6.2(c) and (d), respectively. For better demonstration of relative changes in nonspeech-laugh vs
normal speech, an illustration of F0 and SoE contours for laughter in acoustic is given in Fig. 6.7(b)
and (c), respectively, for the speech signal shown in Fig. 6.7(a).
(f) The intervals between all successive (negative to positive going) zero-crossings of the modZFF output signal (zx [n]) for laughter are used to compute the density of excitation impulses (dI ). Since,
all negative to positive going zero-crossings (not necessarily epochs alone) indicate significant excitation at those instants, these are considered useful in examining the characteristics of laughter.
Changes in the excitation impulse density (dI ) are observed to be significantly higher for laughter
than for normal speech, details of which are discussed in next section.
6.5 Analysis of source and system characteristics from acoustic signal
The excitation source characteristics of laughter are derived from the acoustic signal using a modified ZFF (modZFF) method discussed in Section 6.4. Three features (i) instantaneous fundamental
117
xin[n]
(a) Input acoustic signal waveform
1
0
−1
F0 (Hz)
(b) F0 contour (with V/NV regions)
400
200
ψ
(c) SoE (ψ) contour (with V/NV regions)
1
0.8
0.6
0.4
0.2
1
2
4000
1
2
FD ,FD (Hz)
(d) FD and FD contours (with V/NV regions)
2000
0
0.1
0.2
0.3
0.4
Time (sec)
0.5
0.6
0.7
Figure 6.7 Illustration of (a) signal (xin [n]), and contours of (b) F0 , (c) SoE (ψ), and (d) FD1 (“•”) and
FD2 (“◦”) with V/NV regions (dashed lines), for calls in a nonspeech-laugh bout of a male speaker.
frequency (F0ZF F ), (ii) density of excitation impulses (dI ) and (iii) strength of impulse-like excitation
(SoE) are extracted from the modZFF signal (zx [n]). Changes in the source characteristics are analysed in two ways: (i) by measuring the degree of changes, and (ii) by capturing the temporal changes
in these features. Parameters capturing the degree of changes in features are computed using the average values and standard deviation. Temporal parameters are derived by capturing the temporal changes
in features. Changes in the features and the parameters derived are compared for laughed-speech and
nonspeech-laugh, with reference to normal speech.
The negative to positive going zero-crossings of the modZFF signal (zx [n]) indicate impulse-like
excitation at those time-instants. Some of these zero-crossings correspond to epochs (i.e., GCIs). But,
all zero-crossings are important in the production of laughter. Hence, a feature ‘density of the excitation
impulses’ (dI ), computed at the instants of all zero-crossings of the modZFF signal (zx [n]), is used for
examining changes in the excitation around these instants. Changes in dI are observed to be higher
for laughter, in comparison to normal speech. These additional time-instants are captured by using the
modified ZFF method, which otherwise are neither captured in the EGG/dEGG signal nor highlighted by
the ZFF method [130, 216]. An illustration of few glottal cycles of acoustic signal (xin [n]), EGG signal
(ex [n]) and modZFF output (zx [n]) is given in Fig. 6.8(a), Fig. 6.8(b) and Fig. 6.8(c), respectively. The
presence of impulse-like excitation in-between some successive epochs can be observed in the acoustic
signal and in modZFF signal in Fig. 6.8(a) and Fig. 6.8(c), respectively. But, the same is neither visible
in the EGG signal in Fig. 6.8(b), nor in dEGG. This presence of more than one pulses in a pitch-period
is possibly related to the secondary excitation, also observed in the context of LPC vocoders [67, 8].
“These results suggest that the excitation for voiced speech should consist of several pulses in a pitch
period, rather than just one pulse at the beginning of the period” [8]. The role of these secondary
excitation pulses is examined later in this section.
118
(a) Input acoustic signal
in
x [n]
1
0
−1
(b) EGG signal
x
e [n]
1
0
−1
(c) modified ZFF signal
x
zm [n]
1
0
−1
0
10
20
30
40
Time (ms)
50
60
70
80
Figure 6.8 Illustration of few glottal cycles of (a) acoustic signal (xin [n]), (b) EGG signal ex [n] and
(c) modified ZFF output signal zx [n], for a nonspeech-laugh call produced by a male speaker.
6.5.1
Analysis using F0 derived by modZFF method
The instantaneous fundamental frequency (F0 , i.e., F0ZF F ) is also derived from the acoustic signal
using the modified ZFF method proposed in Section 6.4. The F0 value is computed from the inverse of
the interval (T0 ) between successive epochs that are derived using the Hilbert envelope (hz [n]) of the
modZFF output signal (zx [n]). A trend-removed and smoothed Hilbert envelope is used. In order to
distinguish from F0EGG , we refer to it as F0ZF F . The intra-call changes in F0ZF F in each laugh call
show an increasing (or decreasing) trend, as can be observed in Fig. 6.7(b). The observations are in line
with the changes in T0 for laughter reported earlier [185]. Changes in F0ZF F are analysed by measuring
the degree of changes and temporal (intra-call) changes in the F0ZF F contour.
[A.] Measuring the degree of changes in F0
In Table 6.2, the average F0ZF F values (F0ave ) for each speaker are given in columns (a), (b) and (c)
for NS, LS and NSL, respectively. Corresponding standard deviations in F0ZF F (σF0 ) are given in
columns (d), (e) and f). The values are rounded to integers. Two observations can be made. (i) The trend
in average F0ZF F for the NS, LS and NSL cases in columns (a)-(c) in Table 6.2, is mostly in line with
the trend in average F0EGG for the three cases in columns (l)-(n) in Table 6.1. (ii) In columns (a)-(c) and
(d)-(f) in Table 6.2, the F0ave and σF0 (spread in F0 ) are higher for laughter (LS/NSL), in comparison
with normal speech. Though, F0EGG is expected to be more close to the true F0 , in comparison to
F0ZF F . But, since F0ZF F and other features are derived from the acoustic signal that can be collected
more easily, and both F0EGG and F0ZF F are mostly comparable, the F0ZF F is termed as F0 to represent
the instantaneous fundamental frequency, further in this study.
119
Table 6.2 Changes in F0ZF F and temporal measure for F0 (i.e., θ) for laughed-speech and nonspeechlaugh, with reference to normal speech. Columns (a)-(c) are average F0ZF F (Hz), (d)-(f) are σF0 (Hz),
(g)-(i) are average γ1 and (l)-(n) are average θ values for NS, LS and NSL. Columns (j), (k) are ∆γ1 (%)
and (o), (p) are ∆θ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and
M/F male/female.
(p)
(j)
(k) (l) (m) (n) (o)
Speaker (a) (b) (c) (d) (e) (f) (g) (h) (i)
(M/F) F0N S F0LS F0N SL σF0N SσF0LSσF0N SLγ1N Sγ1LSγ1N SL∆γ1LS∆γ1N SLθN S θLS θN SL ∆θLS ∆θN SL
S1 (M)
S2 (M)
S3 (F)
S4 (M)
S5 (M)
S6 (F)
S7 (M)
S8 (M)
S9 (F)
S10(F)
S11(M)
184
187
273
146
158
261
157
142
240
262
169
193
217
279
150
219
276
152
211
259
339
186
269
223
308
236
259
378
294
260
298
344
209
57
58
76
19
26
69
40
30
63
43
51
85
66
89
31
58
83
46
85
60
79
48
Average
98
73
144
100
97
72
129
97
117
160
88
10
11
21
3
4
18
6
4
15
11
9
10
16
14
25
5
13
23
7
18
16
27
9
16
27
16
44
24
25
27
38
25
35
55
18
30
56
33
19
71
208
27
10
317
4
140
4
153
52
113
762
513
50
502
487
133
392
113
.25
.40
.32
.31
.29
.46
.34
.34
.71
.74
.60
.43
.53 1.94
.51 1.40
1.95 7.05
.60 1.87
.63 1.89
.30 .67
.32 .51
.33 1.29
.64 1.02
.95 .99
.56 .40
.67 1.73
114
28
516
91
116
-35
-5
-4
-11
29
-6
689
252
2126
498
545
46
49
275
42
34
-33
Since both F0ave and σF0 increase for laughter, a parameter γ1 that reflects changes in the F0ZF F is
γ1 =
F0ave × σF0
(6.7)
1000
Here F0ave and σF0 are computed over each laughter (LS or NSL) call or NS voiced segment. In
Table 6.2, average γ1 for each speaker are given in columns (g), (h) and (i) for the three cases NS, LS
γ
−γ
and NSL, respectively. Changes in average γ1 for LS and NSL from NS, i.e., ∆γ1LS (%) = 1LSγ1 1N S ×
γ
NS
−γ
1N S
100 and ∆γ1N SL (%) = 1N SL
× 100, are given in columns (j) and (k), respectively. The values
γ 1N S
are rounded to integers. The average γ1 values for NS, LS and NSL are 10, 16 and 30, respectively.
Across speakers, the ∆γ1 are higher for NSL than LS (columns (j), (k)). The changes in γ1 and ∆γ1 for
the three cases indicate larger degree of changes in F0ZF F for nonspeech-laugh than laughed-speech.
[B.] Measuring temporal changes in F0
Temporal gradient-related changes in F0 (i.e., F0ZF F ) contour are captured through a parameter θ,
computed for each laughter (LS and NSL) call and (NS) voiced segment. Temporal parameter θ has two
constituents, a monotonicity factor and a duration factor. (i) The monotonicity factor (mF0 ) captures the
monotonically increasing (or decreasing for some speakers) trend of F0 within a call. It is the sum of
∆F0 of similar signs, computed over each window of size of 5 successive pitch periods. Here, ∆F0 is
the change in F0 over successive epochs in a laugh call or NS voiced segment. The factor mF0 is:
mF0 =
n X
5
X
∆F0 |+(or −) , i = 1, 2, ..., n, j = 1, 2, ...5
(6.8)
i=1 j=1
where n is number of such windows in a call and j is index of ∆F0 values of same sign within each
window. (ii) The duration factor (δtF0 ) captures the percentage duration of a call that has similar signs
120
of ∆F0 in each window. It is given by
δtF0 =
NGC+(or −)
NGCseg
×
1
tdseg
(6.9)
where NGC+(or −) is number of glottal cycles (epochs) having ∆F0 of same sign (+ or −, whichever
has larger count), and NGCseg is total number of epochs in a laugh call. The tdseg is call duration (ms),
used in denominator for normalization. The temporal parameter θ = |mF0 × δtF0 | for a call is given by


X
5
n X
NGC+(or −)
1 

θ=
(6.10)
×
∆F0 |+(or −) ×
NGCseg
tdseg i=1 j=1
where i is index of the window within a call and j is index of ∆F0 of same signs in each window.
In Table 6.2, average θ values for each speaker are given in columns (l), (m) and (n) for NS, LS
and NSL, respectively. Changes in average θ for LS and NSL bouts from NS, i.e., ∆θLS (%) =
θ
−θN S
θLS −θN S
× 100 and ∆θN SL (%) = N SL
× 100, are given in columns (o) and (p), respectively.
θN S
θN S
Average θ for NS, LS and NSL are 0.43, 0.67 and 1.73, respectively. The higher value of θ (e.g., above
1.0) indicates larger content of laughter. Larger changes in temporal parameter θ are observed for
nonspeech-laugh, than for laughed-speech. For laughter, the temporal changes in F0ZF F observed using
the parameter θ (in columns (l)-(p)), are in line with degree of changes in F0ZF F observed using the
parameter γ1 (in columns (g)-(k)).
6.5.2
Analysis using the density of excitation impulses (dI )
It may be observed in Fig. 6.8(c) that the modified ZFF signal (zx [n]) has some extra zero-crossings
within almost each glottal cycle, in addition to those corresponding to the epochs. Whether these additional zero-crossings are related to the characteristics of excitation source or vocal tract system, can
be ascertained by observing the spectrograms in Fig. 6.9, for a few nonspeech-laugh calls. The spectrogram of a epoch sequence that has impulses located at epochs with amplitude as SoE, in Fig. 6.9(c),
shows only the source characteristics. It can be compared with the spectrogram shown in Fig. 6.9(d),
for a sequence of impulses located at all negative to positive going zero-crossings of zx [n] with amplitude representing the strength of excitation at these instants. Both spectrograms in Fig. 6.9(c) and
Fig. 6.9(d) for are quite similar. It may also be noticed that the spectrogram in Fig. 6.9(d) does not show
any formants-like system characteristics that can be observed in the spectrogram in Fig. 6.9(b) for the
acoustic signal. Similar observations are made from spectrograms of laughter calls of other speakers.
Hence, it may be inferred that the additional zero-crossings in zx [n] are related to the characteristics of
the excitation source only, and not of the vocal tract system.
In order to highlight the glottal cycle characteristics (at epochs) shown in Fig. 6.6(c), the additional
zero-crossings can be suppressed by using the Hilbert envelope (hz [n]) of the modZFF output, as discussed in Section 6.4. But, these additional zero-crossings can also help in discriminating the three
cases (NS, LS and NSL). A feature ‘density of excitation impulses’ (dI ) is obtained from all successive
121
xin[n]
(a) Input acoustic signal
1
0
−1
Frequency (Hz)
Frequency (Hz)
4000
Frequency (Hz)
(b) Spectrogram of signal
4000
4000
3000
2000
1000
(c) Spectrogram of epoch sequence
3000
2000
1000
(d) Spectrogram of impulse sequence
3000
2000
1000
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Time (sec)
0.45
0.5
0.55
0.6
Figure 6.9 Illustration of (a) acoustic signal (xin [n]), and spectrograms of (b) signal, (c) epoch sequence
(using the modified ZFF method) and (d) sequence of impulses at all (negative to positive going) zerocrossings of zx [n] signal, for few nonspeech-laugh calls produced by a male speaker.
negative to positive going zero-crossings of the modZFF signal (zx [n]). Feature dI represents instantaneous density of the impulse-like excitation at such zero-crossings in the unit ‘number of impulses per
sec’. Like for F0 , the changes in dI are also analysed in two ways, by measuring the degree of changes
and temporal (intra-call) changes in the dI contour.
[A.] Measuring the degree of changes in feature dI
In Table 6.3, the average dI values (dIave ) for each speaker are given in columns (a), (b) and (c) for
the three cases NS, LS and NSL, respectively. Corresponding standard deviations in dI (σdI ) are given
in columns (d), (e) and (f). The values are rounded to integers. It may be observed from columns (a)-(c)
and (d)-(f) that average values of dI and its spread (σdI ) are in general higher for NSL than for LS.
Since, both dIave and σdI increase for laughter, a parameter γ2 that reflects changes in dI is
γ2 =
dIave × σdI
(6.11)
1000
where dIave and σdI are computed over each LS/NSL call or a NS voiced region. For each speaker, the
average γ2 values for the three cases NS, LS and NSL are given in columns (g), (h) and (i), respectively.
γ
−γ
Changes in γ2 for LS and NSL from NS, i.e., ∆γ2LS (%) = 2LSγ2 2N S × 100 and ∆γ2N SL (%) =
NS
γ2N SL −γ2N S
γ 2N S
× 100, are given in columns (j) and (k), respectively. The values are rounded to integers.
The average γ2 values for NS, LS and NSL, are 54, 48 and 69, respectively. For most of the speakers,
|∆γ2 |N SL > |∆γ2 |LS , i.e., the degree of changes in γ2 is more for nonspeech-laugh than for laughedspeech, with reference to normal speech.
122
Table 6.3 Changes in dI and temporal measure for dI (i.e., φ) for laughed-speech and nonspeech-laugh,
with reference to normal speech. Columns (a)-(c) are average dI (Imps/sec), (d)-(f) are σdI (Imps/sec),
(g)-(i) are average γ2 and (l)-(n) are average φ values for NS, LS and NSL. Columns (j), (k) are ∆γ2 (%)
and (o), (p) are ∆φ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and
M/F male/female.
(p)
(j)
(o)
(k) (l) (m) (n)
Speaker (a) (b) (c) (d) (e) (f) (g) (h) (i)
(M/F) dIN S dILS dIN SL σdIN SσdILSσdIN SL γ2N Sγ2LSγ2N SL∆γ2LS ∆γ2N SLφN S φLS φN SL ∆φLS ∆φN SL
S1 (M)
S2 (M)
S3 (F)
S4 (M)
S5 (M)
S6 (F)
S7 (M)
S8 (M)
S9 (F)
S10(F)
S11(M)
436
461
424
378
423
486
428
425
458
427
428
426
486
495
382
445
482
455
453
469
408
402
581
520
540
532
486
557
557
489
498
467
376
106
132
142
78
100
162
129
118
136
125
136
99
134
90
83
74
137
143
89
136
84
93
137
165
70
156
125
162
199
98
180
59
134
Average
46
61
60
29
42
79
55
49
62
53
58
54
42
65
45
32
33
66
65
41
64
34
38
48
79
86
38
83
61
90
111
48
90
28
50
69
-9
7
-26
7
-22
-16
17
-19
2
-36
-36
71
40
-37
181
44
15
100
-4
44
-48
-14
89
109
122
46
78
126
95
89
100
91
109
96
75
114
57
47
60
102
103
60
114
44
77
78
147
193
84
178
152
146
220
119
146
31
101
138
-16
4
-54
3
-22
-19
9
-32
14
-52
-29
66
77
-31
288
96
16
133
33
46
-66
-8
[B.] Measuring temporal changes in feature dI
Changes in the dI contours are observed to be more rapid than those in F0ZF F contours. Hence,
temporal changes in the dI contour are captured by a parameter φ, that is computed using ∆dI between
all successive negative to positive going zero-crossings of the modZFF signal (zx [n]). The temporal
measure for dI , i.e., parameter φ is given by
N 1 X ∆ ∆dI φ=
∆t ∆t ,
N
i = 1, 2, ..., N
(6.12)
i=1
where N is number of (negative to positive going) zero-crossings of zx [n]. Hence, parameter φ captures
the rate of temporal change in dI for a call, computed per second. An illustration of the temporal
measure φ is given in Fig. 6.10. Larger changes in ∆dI can be observed for NSL than LS calls. Also,
parameter φ is helpful in discriminating between regions of nonspeech-laugh and laughed-speech.
In Table 6.3, average φ values for each speaker are given in columns (l), (m) and (n), for NS, LS
and NSL, respectively. Changes in φ for LS and NSL in comparison with NS, i.e., ∆φLS (%) =
φLS −φN S
φ
−φN S
× 100 and ∆φN SL (%) = N SL
× 100, are given in columns (o) and (p), respectively.
φN S
φN S
Average φ values for NS, LS and NSL, are 96, 78 and 138, respectively. In general, higher changes in
average φ for nonspeech-laugh can be observed in column (p). Interestingly, the changes in the temporal measure φ (i.e., ∆φ) in columns (o) and (p), are mostly in line with changes in the parameter γ2 (i.e.,
∆γ2 ) that captures degree of changes in dI , in columns (j), (k). Both parameters γ2 and φ, derived from
the feature dI (impulse density), are helpful in discriminating between laughter and normal speech.
123
(a) Input acoustic signal
xin[n]
1
0
−1
(b) Changes in temporal parameter for dI (φ)
600
1
2
3
0.1
0.3
0.5
5
4
6
8
7
|φ|
400
200
0
0.7
0.9
1.1
1.3
Time (sec)
1.5
1.7
1.9
2.1
Figure 6.10 Illustration of changes in the temporal measure for dI , i.e., φ, for NSL and LS calls.
(a) Acoustic signal (xin [n]). (b) φ for NSL and LS calls, i.e., regions 1-4 and 5-8, respectively. The
signal segment is for the text “it is really funny” produced by a male speaker.
6.5.3
Analysis using the strength of excitation (SoE)
In the production of laughter, changes in the characteristics of the glottal excitation source are reflected in two ways: (i) in the locations of epochs (GCIs) or all positive going zero-crossings of the
modZFF signal (zx [n]), that is manifested as changes in the features F0ZF F and dI , respectively, and
(ii) in the amplitude represented as the strength of impulse-like excitation (SoE, i.e., ψ) at the epochs.
Similar to F0ZF F and dI , changes in the SoE for laughter (LS/NSL) calls, with reference to normal
speech (NS), are also analysed in two ways, by measuring the degree of changes and temporal (intracall) changes in the SoE contour.
[A.] Measuring the degree of changes in SoE (ψ)
In Table 6.4, the average SoE values (ψave ) for each speaker are given in columns (a), (b) and (c)
for NS, LS and NSL, respectively. Corresponding standard deviations in ψ (i.e., σψ ) are given in
columns d), (e) and (f). Changes in both ψave and σψ are captured by a parameter γ3 , given by
γ3 =
σψ
× 100
ψave
(6.13)
where σψ and ψave are computed for each LS/NSL call and NS voiced region. The average γ3 computed for each speaker are given in Table 6.4 in columns (g), (h) and (i), for the three cases NS, LS
γ
−γ
and NSL, respectively. Changes in γ3 for LS and NSL from NS, i.e., ∆γ3LS (%) = 3LSγ3 3N S × 100
γ
NS
−γ
3N S
and ∆γ3N SL (%) = 3N SL
× 100, are given in columns (j) and (k), respectively. The values are
γ 3N S
rounded to integers. The average γ3 values for NS, LS and NSL are 57, 60 and 65, respectively. For most
speakers, the changes in parameter γ3 , with reference to normal speech, are larger for nonspeech-laugh
than for laughed-speech.
124
Table 6.4 Changes in SoE (i.e., ψ) and temporal measure for SoE (i.e., ρ) for laughed-speech and
nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average ψ, (d)-(f) are σψ , (g)-(i)
are average γ3 and (l)-(n) are average ρ values for NS, LS and NSL. Columns (j), (k) are ∆γ3 (%) and
(o), (p) are ∆ρ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F
male/female.
(j)
(p)
(k) (l) (m) (n) (o)
Speaker (a) (b) (c)
(f)
(d) (e)
(g) (h) (i)
(M/F) ψN S ψLS ψN SL σψN S σψLS σψN SL γ3N Sγ3LS γ3N SL∆γ3LS ∆γ3N SLρN S ρLS ρN SL ∆ρLS ∆ρN SL
S1 (M)
S2 (M)
S3 (F)
S4 (M)
S5 (M)
S6 (F)
S7 (M)
S8 (M)
S9 (F)
S10(F)
S11(M)
.262
.289
.282
.289
.321
.303
.327
.363
.299
.278
.273
.238
.318
.278
.233
.278
.207
.449
.294
.283
.132
.251
.222
.260
.178
.320
.341
.316
.204
.310
.332
.162
.192
.126
.181
.172
.152
.158
.195
.184
.223
.176
.164
.163
.137
.199
.140
.120
.131
.149
.258
.167
.162
.079
.149
.161
.128
.084
.175
.176
.256
.152
.184
.210
.099
.126
Average
48
60
60
49
48
65
54
62
59
59
60
57
58
63
51
49
45
95
61
57
58
62
57
60
71
52
63
55
55
81
80
58
66
64
69
65
21
4
-15
0
-6
48
13
-7
-3
5
-5
47
-14
5
13
16
25
47
-5
11
7
14
16
36
39
41
37
39
36
59
41
54
38
39
35
35
46
25
27
30
37
29
50
19
53
35
60
27
75
43
73
119
26
10
124
17
27
54
116
-3
15
-39
-28
-21
5
-51
20
-65
39
265
-25
89
5
94
208
-28
-84
199
-68
-29
[B.] Measuring temporal changes in SoE (ψ)
Temporal changes in SoE (ψ) contour are captured through a parameter ρ, similar to parameter θ
for F0ZF F . Like for θ, the parameter ρ also comprises of two factors, a monotonicity factor (mψ ) and
a duration factor (δtψ ). (i) The monotonicity factor (mψ ) is computed in a similar way as for mF0
using (6.8). The only difference is that here only the (+)ve signed ∆ψ values are considered in each
window. It is because the ψ contour is expected to have more regions of monotonically increasing SoE
within a laugh call. (ii) The duration factor (δtψ ) is computed in the same way as δtF0 using (6.9). The
temporal measure for ψ, i.e., parameter ρ = mψ × δtψ , is computed as:
mψ =
5
n X
X
∆ψ|+ , i = 1, 2, ..., n, j = 1, 2, ...5
(6.14)
i=1 j=1
NGC+
1
×
NGC
tdseg
 seg

X
5
n X
N
1
GC
+
ρ = 
×
∆ψ|+  ×
NGCseg
tdseg i=1 j=1
δtψ =
(6.15)
(6.16)
where i is index of window (each of size of 5 successive pitch periods) in a call, n is number of such
windows in a laugh call, and j is index of ∆ψ values of same (+ve) sign within each window. Here,
NGC+ is number of glottal cycles (epochs) having ∆ψ of same (+)ve sign, and NGCseg is total number
of epochs within a laugh call or NS voiced segment of duration tdseg (in ms).
In Table 6.4, the average values of parameter ρ for each speaker are given in columns (l), (m) and (n)
for NS, LS and NSL, respectively. Changes in average ρ for LS and NSL from NS, i.e., ∆ρLS (%) =
125
Table 6.5 Changes in FD1ave and σFD1 for laughed-speech (LS) and non-speech laugh (NSL) in comparison to those for normal speech (NS). Columns (a),(b),(c) are FD1ave (Hz) and (d),(e),(f) are σFD1 (Hz)
for the three cases NS, LS and NSL. Columns (g),(h),(i) are the average ν1 values computed for NS, LS
and NSL, respectively. Columns (j) and (k) are ∆ν1 (%) for LS and NSL, respectively. Note: Si below
means speaker number i (i = 1 to 11), and M/F indicates male/female.
Speaker# (a)(Hz) (b)(Hz) (c)(Hz) (d)(Hz) (e)(Hz) (f)(Hz) (g) (h)
(j)(%) (k)(%)
(i)
σ
σ
σ
F
F
F
F
F
F
D
D
D
D
D
D
(M/F)
∆ν1N SL
1ave
1ave
1ave
1N S
1LS
1N SL ν1N S ν1LS ν1N SL ∆ν1LS
N SL
LS
NS
S1 (M)
1604
1637
1296
413
378
355
66.3 62.0 46.1 -6.49 -30.53
S2 (M)
1689
1700
1738
309
242
228
52.2 41.1 39.6 -21.31 -24.22
S3 (F)
1324
1555
1659
675
449
442
89.4 69.8 73.4 -21.97 -17.94
S4 (M)
1503
1505
1346
458
486
369
68.8 73.2 49.7
6.37
-27.80
S5 (M)
1244
1251
1173
501
457
512
62.3 57.2 60.1 -8.24
-3.64
S6 (F)
1088
1175
1242
671
735
275
73.0 86.3 34.1 18.28 -53.23
S7 (M)
1588
1326
1422
383
637
318
60.9 84.5 45.1 38.79 -25.79
S8 (M)
1429
1483
1502
499
374
301
71.3 55.5 45.2 -22.10 -36.57
S9 (F)
1279
1379
1537
538
781
623
68.9 107.7 95.8 56.39
39.06
S10(F)
1265
1314
939
747
508
267
94.5 66.8 25.1 -29.36 -73.49
S11(M) 1759
1978
1869
406
336
488
71.3 66.5 91.2 -6.72
27.93
Average
70.8 70.1 55.0
ρLS −ρN S
ρN S
ρ
−ρ
NS
× 100 and ∆ρN SL (%) = N SL
× 100, are given in columns (o) and (p), respectively. The
ρN S
values are rounded to integers. The average values for NS, LS and NSL are 39, 35 and 54, respectively.
In general, the degree of changes in parameter ρ (i.e., ∆ρ) are larger for nonspeech-laugh than for
laughed-speech, with reference to normal speech. Also, the temporal changes in SoE (ψ) measured
using the parameter ρ, are mostly in line with degree of changes in SoE captured using the parameter γ3 .
6.5.4
Analysis of vocal tract system characteristics of laughter
In the production of laughter, since there occur rapid changes in the excitation source characteristics,
it is possible that there also occur associated changes in the vocal tract system characteristics as well.
Hence, system characteristics are also examined. Features such as first two dominant frequencies (FD1
and FD2 ) are derived from the speech signal using LP analysis [112] and group delay method [128, 129],
discussed in Section 3.7.
In Table 6.5, the average FD1 values (FD1ave ) are given in columns (a), (b) and (c), for the three cases
NS, LS and NSL, respectively. The corresponding average values of standard deviation in FD1 (σFD1 )
for all speakers are given in columns (d), (e) and f). All the values are rounded to nearest integer. In
general, the values of average FD1 (FD1ave ) and its spread (σFD1 ) are observed to be lower for laughter
(LS/NSL) in comparison to those for normal speech, as can be observed from columns (a)-(c) and (d)(f). Hence, a single parameter ν1 representing both features FD1ave and σFD1 is computed as
ν1 =
FD1ave × σFD1
10000
126
(6.17)
Distribution of F
D
2
vs F
D
1
5000
4000
F
D
2
(Hz)
3000
2000
1000
0
0
500
1000 1500
F (Hz)
2000
2500
D
1
Figure 6.11 (Color online) Illustration of distribution of FD2 vs FD1 for nonspeech-laugh (“•”) and
laughed-speech (“◦”) bouts of a male speaker. The points are taken at GCIs in respective calls.
where FD1ave and σFD1 are computed over an voiced region of NS or a LS/NSL call. The values of ν1 for
NS, LS and NSL are given in columns (g), (h) and (i), respectively. Average ν1 values across speakers
are 70.8, 70.1 and 55.0 for NS, LS and NSL, respectively. Percentage changes in ν1 for LS/NSL bouts
ν
−ν1N S
ν
−ν
× 100 are
in comparison to that for NS, i.e., ∆ν1LS = 1LSν1 1N S × 100 and ∆ν1N SL = 1N SL
ν
1
NS
NS
given in columns (j) and (k), respectively. Similarly, the average values of FD2 (FD2ave ) and standard
deviation in FD2 (σFD2 ) are obtained for the three cases. The single parameter ν2 representing both
features FD2ave and σFD2 , and percentage changes in ν2 , i.e., ∆ν2LS and ∆ν2N SL are computed in a
way similar to that for FD1 .
The average values of FD1 and FD2 (FD1ave and FD2ave ) are computed for each speaker, for the
three cases (NS, LS and NSL). The corresponding values of standard deviation in FD1 (σFD1 ) and in
FD2 (σFD2 ) are also computed. An illustration of distribution of FD2 vs FD1 for LS and NSL bouts
produced by a male speaker is given in Fig. 6.11. The points (FD1ave , FD2ave ) are marked as centroids
for LS and NSL bouts. It may be observed that distinct clusters are formed by the relative distribution of
FD1 and FD2 for nonspeech laugh and laughed speech, discriminating the two. Also, for some speakers,
the average FD1 and FD2 values (FD1ave and FD2ave ) and their respective spread (σFD1 and σFD2 ) are
observed to be lower for laughter (NSL and LS) than for normal speech. However, the observation is
not consistent across all speakers.
The average values of parameter ν1 representing changes in both FD1ave and σFD1 for NS, LS and
NSL, are 70.8, 70.1 and 55.0, respectively. The average values of the parameter ν2 representing changes
in FD2ave and σFD2 , for NS, LS and NSL, are 195.4, 169.4 and 173.8, respectively. Although, for some
speakers, the reduction in parameters ν1 and ν2 is observed to be larger for nonspeech laugh than for the
laughed speech. But, the changes in ν1 and ν2 are not consistent across all speakers.
127
xin[n]
hp
(a) Normal speech: (i) input acoustic signal
1
0
−1
1
0.8
0.6
0.4
0.2
(ii) Peaks of Hilbert envelope of LP residual
0
5
10
Time (ms)
xin[n]
1
0.8
0.6
0.4
0.2
1
0.8
0.6
0.4
0.2
(ii) Peaks of Hilbert envelope of LP residual
0
5
10
Time (ms)
15
20
(c) Nonspeech laugh: (i) input acoustic signal
1
0
−1
hp
xin[n]
hp
(b) Laughed speech: (i) input acoustic signal
1
0
−1
15
(ii) Peaks of Hilbert envelope of LP residual
20
0
5
10
Time (ms)
15
20
Figure 6.12 Illustration of (a) input acoustic signal (xin [n]) and few (b) peaks of Hilbert envelope of LP
residual (hp ) for 3 cases: (i) normal speech, (ii) laughed-speech and (iii) nonspeech-laugh.
6.5.5
Analysis of other production characteristics of laughter
Apart from the glottal excitation source and vocal tract system characteristics examined earlier in
this section, the acoustic signal consisting of laughter seems to carry some additional information that
humans can perceive easily. This information may possibly be extracted from LP residual [114] of
the acoustic signal. This additional information of the production characteristics of laughter is derived
from the Hilbert envelope [137] of LP residual of the acoustic signal. Two features are extracted, the
amplitude (hp ) and sharpness measure (η) of peaks in the Hilbert envelope (HE) of LP residual at GCIs
[170]. The sharpness measure (η) is observed to be useful in discriminating the NS, LS and NSL cases.
An illustration of peaks in the Hilbert envelope of LP residual at GCIs is given in Fig. 6.12(a),
Fig. 6.12(b) and Fig. 6.12(c), for the three cases NS, LS and NSL, respectively. Normalized values of
HE peaks (hp ) are used. The peaks are narrower and sharper for nonspeech-laugh in comparison with
laughed-speech calls. Also, the width of peaks (near half-height level) is relatively more for NS than
for LS/NSL calls. The degree of sharpness of these peaks can be compared in terms of the sharpness
measure η [170]
N
1 X σhn (xw ) , i = 1, 2, ..., N
(6.18)
η=
N
µhn (xw ) xw =xi −l1 to xi +l2
i=1
where i is index of epoch within a laugh (NSL/LS) call or NS voiced region, and N is total number of
epochs in the segment. Here, σhn and µhn are standard deviation and mean of the normalized values of
128
Table 6.6 Changes in average η and ση for laughed speech (LS) and nonspeech laugh (NSL) with
reference to normal speech (NS). Columns (a)-(c) are average η, (d)-(f) are ση and (g)-(i) are average ξ
values for NS, LS and NSL. Columns (j) and (k) are ∆ξ (%) for LS and NSL, respectively. Note: Si
indicates speaker number (i = 1 to 11) and M/F male/female.
(g) (h)
(j)
(k)
(i)
Speaker (a) (b) (c)
(d) (e)
(f)
(M/F) ηN S ηLS ηN SL σηN S σηLS σηN SL ξN S ξLS ξN SL ∆ξLS ∆ξN SL
S1 (M)
S2 (M)
S3 (F)
S4 (M)
S5 (M)
S6 (F)
S7 (M)
S8 (M)
S9 (F)
S10(F)
S11(M)
.588
.574
.514
.575
.507
.530
.487
.579
.538
.533
.522
.559
.547
.511
.572
.505
.527
.518
.592
.523
.577
.569
.509
.516
.537
.499
.505
.546
.492
.513
.521
.539
.492
.137
.194
.124
.150
.139
.141
.117
.176
.153
.124
.139
.142
.169
.135
.154
.127
.128
.154
.155
.131
.147
.164
.160
.158
.139
.137
.110
.129
.129
.131
.117
.099
.135
Average
239
333
242
262
279
265
241
302
283
233
266
268
260
306
260
271
252
244
298
262
249
255
234
263
308
308
266
276
219
237
264
256
226
183
204
250
9
-8
7
3
-9
-8
24
-13
-12
9
-12
29
-7
10
5
-21
-10
10
-15
-20
-22
-23
Hilbert envelope (hn ), computed over a window (xw ) of size xi −l1 to xi +l2 located at xi for ith epoch.
Normalized (hn ) values are obtained by dividing the hp values in the window at xi , by the amplitude at
xi . A lower value of η indicates a comparatively sharper (i.e., less spread) peak.
In Table 6.6, for each speaker, the average η (ηave ) values are given in columns (a), (b) and (c), for
NS, LS and NSL cases, respectively, with corresponding standard deviation (ση ) in columns (d)-(f).
Since ση reduces and ηave increases more for NSL than LS calls, a parameter ξ is computed as
ξ=
ση
× 1000
ηave
(6.19)
where ηave and ση are average and standard deviation of η, computed for a call. The average values of
parameter ξ for each speaker are given in columns (g), (h) and (i), for NS, LS and NSL, respectively.
ξ
−ξN S
ξ −ξ
× 100,
Changes in ξ for LS and NSL cases, i.e., ∆ξLS = LSξ N S × 100 and ∆ξN SL = N SL
ξN S
NS
are given (in %) in columns (j) and (k), respectively. The values are rounded to integers. The average ξ
for NS, LS and NSL, are 268, 263 and 250, respectively. For most speakers, larger changes (mostly
reduction) in ξ occur for nonspeech-laugh than for laughed-speech, with reference to normal speech. It
indicates that peaks of Hilbert envelope of LP residual (at GCIs) are generally more sharp (less spread)
for laughter (NSL/LS) calls than for normal speech. It is possible that this increased sharpness of HE
peaks is related to faster rate of closing (abrupt closure) of the vocal folds, during production of laughter.
6.6 Discussion on the results
Production characteristics of laughter are examined in this study using the EGG and acoustic signals.
Following features are extracted: (i) source features α, F0 (i.e., F0ZF F ), dI and SoE (i.e., ψ), (ii) system
129
features FD1 and FD2 , and (iii) other production features hp and η. Parameters are derived from these
features, that distinguish laughter (LS/NSL) calls and NS voiced regions. Parameters (βα , γ1 , γ2 , γ3 ,
ν1 , ν2 and ξ) capturing the degree of changes use average values and standard deviations in all these
feature, whereas temporal parameters (θ, φ and ρ) capture the intra-call temporal changes in the source
features (F0 , dI and SoE). All the parameters derived in this study can be summarized as follows:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
parameter βα , derived from the closed phase quotient (α) using EGG signal
parameters γ1 and θ, derived from the F0 (i.e., F0ZF F ) using acoustic signal
parameters γ2 and φ, derived from the source feature dI (impulse-density)
parameters γ3 and ρ, derived from the source feature SoE (i.e., ψ)
parameters ν1 and ν2 , derived from the system features FD1 and FD2
parameters ξ, derived from other production feature η
Analysis from EGG signal indicated that the closed phase quotient (α) within each glottal cycle is
reduced for laughter, in comparison to normal speech (Table 6.1). The feature α is reduced more for
nonspeech-laugh than laughed-speech. Changes in the closed phase quotient (α) are reflected better in
a parameter βα . Across all speakers, the increase in βα from normal speech (i.e., ∆βα ) is more for
nonspeech-laugh (Table 6.1). The reduction of closed phase quotient (α) for laughter (NSL/LS) calls
is possibly related to abrupt closure of vocal folds, which is perhaps reflected in sharper HE peaks, examined using features hp and η in Section 6.5.5. Also, the glottal cycle period (T0 ) is reduced more for
nonspeech-laugh than laughed-speech, which causes the average F0EGG to increase more for nonspeechlaugh (Table 6.1). Thus, analysis from EGG signal highlights that significant changes in the characteristics of the glottal source of excitation indeed take place during production of laughter. Due to limitations
in collecting the EGG signal, the production characteristics are analysed from the acoustic signal.
The excitation source characteristics are analysed in terms of features F0 , dI and SoE derived from
the acoustic signal using a modified ZFF method proposed in Section 6.4. The trends of changes in
average F0EGG and F0ZF F for LS/NSL laugh relative to NS are quite similar in Tables 6.1 and 6.2. In a
few cases the average values of F0ZF F are marginally higher than F0EGG . It may be due to possible rare
presence of the secondary excitation pulses that are otherwise suppressed well in the Hilbert envelope
of the modified ZFF output signal used for computing the F0ZF F . Theoretically, F0EGG should be more
reliable than F0ZF F , and is used as ground truth reference. But, for relative convenience in collecting the
acoustic signal data over EGG, and the reason that other features are also derived from the same signal,
the F0ZF F is used as F0 in this study. Changes in F0 (i.e., F0ZF F ), that are captured better by using
a parameter γ1 , are larger for nonspeech-laugh than laughed-speech, with reference to normal speech
(Table 6.2). Interestingly, there occurs gradual inter-calls increasing/decreasing trend in the average α
over successive calls in a (LS/NSL) laugh bout (Fig. 6.4). The inter-calls rising/falling trend is observed
also in the average F0 values for calls in a bout, which is in-line with an earlier study [11].
In the production of laughter, there possibly exists amplitude modulation of some higher frequency
content (around 500-1000Hz) [159], as can be observed in the acoustic signal shown in Fig. 6.8(a).
This higher frequency content is not noticeable in the EGG signal (Fig. 6.8(b)). It is possibly related
130
to the presence of secondary excitation pulses in each pitch period [67, 8]. The instants of this secondary impulse-like excitation in the case of laughter, seem to appear as negative to positive going
zero-crossings that are additional to regular GCIs (epochs). These instants can be captured better by
using some special signal processing technique such as the modified ZFF method (Fig. 6.8(c)). These
additional zero-crossing instants can be exploited for discriminating laughter and normal speech, by
using a feature dI that represents the density of excitation impulses located at all positive going zerocrossings of the modZFF signal (zx [n]). Changes in the feature dI are examined for the three cases
NS, LS and NSL. In general, the average dI (dIave ), intra-call fluctuations in dI (σdI ) and a parameter γ2 representing changes in dI , are observed to be higher for nonspeech-laugh than for laughed-speech
(Table 6.3). Changes in the source characteristics are also examined in terms of the strength of excitation (SoE), i.e., feature ψ. The parameter γ3 , representing changes in SoE, shows larger changes for
nonspeech-laugh than for laughed-speech, for most speakers (Table 6.4).
Temporal changes in the source characteristics are examined to validate the results. Temporal parameters θ, φ and ρ capture temporal changes in the features F0 , dI and SoE (ψ), respectively. Parameter θ
captures changes in the intra-call rising (or falling in some cases) gradient of F0 contour, parameter φ the
absolute rate of change in the density of excitation impulses (dI ), and parameter ρ captures the (mostly
rising) gradient of SoE, within each laugh call. Larger changes in the parameters θ, φ and ρ can be
observed for nonspeech-laugh than laughed-speech, with reference to normal speech, in Tables 6.2, 6.3
and 6.4, respectively. Similarity of results in discriminating laughter and normal speech by two different
approaches, i.e., by using both the parameters γ1 , γ2 and γ3 and the temporal parameters θ, φ and ρ,
validates the utility of these parameters as well as the results.
Associated changes in the vocal tract system characteristics during production of laughter are examined in terms of features such as dominant frequencies FD1 and FD2 , derived from the acoustic signal
using LP analysis and group delay function. The distribution of FD2 vs FD1 (Fig. 6.11) highlights the
ability of FD1 and FD2 in discriminating between NSL and LS bouts, in some cases, which is similar
to formants-clusters used in [11]. Parameters ν1 and ν2 derived using averages and fluctuations in these
features, show larger changes for nonspeech-laugh than for laughed-speech, for some speakers. But, the
observations are not consistent for all speakers. Hence, there is scope for better features of the vocal
tract system, to help distinguishing laughter from normal speech.
The additional information present in the acoustic signal consisting of laughter, is examined using
the Hilbert envelope (HE) of LP residual of the signal. Two features hp and η are extracted, that measure
the amplitude and sharpness of HE peaks, respectively. Parameters ξ captures changes in the feature η.
Larger changes in the parameter ξ observed for nonspeech-laugh than for laughed-speech (Table 6.6),
are mostly in line with changes in the parameters (βα , γ1 , γ2 and γ3 ) for the source characteristics and
the parameters (ν1 and ν2 ) for the system characteristics. From all the parameters derived it is observed
that, in general, larger changes take place in the production characteristics for nonspeech-laugh than
for laughed-speech, with reference to normal speech.
131
6.7 Summary
In this study, the production characteristics of laughter are examined from both EGG and acoustic
signals. The speech-laugh continuum is analysed in three categories, namely, normal speech, laughedspeech and nonspeech-laugh. Data was collected by eliciting natural laughter responses. Three texts
were used for comparing the laughed-speech and normal speech. Laughter data is examined at call and
bout levels. Only, voiced cases of spontaneous laughter are considered. The excitation source features
are extracted from both the EGG and acoustic signals. The vocal tract system features are extracted
from the acoustic signal, along with some production characteristics related to both source and system.
Parameters representing changes in these features are derived, to distinguish the three cases.
The closed phase quotient (α) of glottal cycles and the instantaneous fundamental frequency (F0 )
are obtained from the analysis of EGG signal. Average α reduces and F0 increases more for nonspeechlaugh than laughed-speech, with reference to normal speech. The average values of α and F0 also
exhibit some inter-calls decreasing/increasing trend for (LS/NSL) laughter bouts. The excitation source
characteristics are derived from the acoustic signal using a modZFF method proposed. In the acoustic
signal of laughter, the likely presence of secondary impulse-like excitation is examined in terms of density of impulses (dI ) and the strength of excitation (SoE), derived from the modZFF signal (zmx [n]).
Parameters βα , γ2 and γ3 represent the degree of changes in the source features α, dI and SoE, respectively. Results are validated using two temporal parameters φ and ρ derived from the source features dI
and SoE, respectively. These parameters also discriminate well the three cases (NS, LS and NSL).
Changes in the vocal tract system characteristics are examined in terms of the first two dominant
frequencies FD1 and FD2 derived from the acoustic signal using LP analysis. The features discriminate
between laughter and normal speech, in some cases. Additional excitation information present in the
acoustic signal of laughter is examined using the Hilbert envelope of the LP residual around epochs, in
terms of sharpness (η) of HE peaks. Parameter ξ derived from the feature η helps in further discriminating the three cases.
In this study, the unvoiced grunt-like or snort-like laughs are not focused. Also, changes in the
vowel-like nature of different laughter-types may be studied further. It would be further interesting
to examine expressive voices, that are trained voices and involve voluntary control over the speech
production mechanism. However, the production source characteristics of laughter examined in this
study and the parameters derived, would be useful in further developing systems for automatic detection
of laughter in continuous speech.
132
Chapter 7
Analysis of Noh Voices
7.1 Overview
Production characteristics are changed under voluntary control of speech production mechanism in
the case of nonverbal emotional speech sounds, and involuntarily in the case of paralinguistic sounds.
But, in the case of expressive voices such as opera or Noh singing, the trained voice is produced through
the voluntary control exercised by the artist, which is achieved after years of training and practice.
This chapter analyzes the significance of aperiodic component of excitation in contributing to the
voice quality of expressive voices, in particular, the Noh Voice. The study highlights the feasibility
of representing the excitation source characteristics in expressive voice signals, through a time domain
sequence of excitation impulses which are related to pitch-perception of aperiodicity. The aperiodic
component is represented in terms of a sequence of impulses with amplitudes representing the relative
strengths of the impulses. The frequency characteristics of the impulse sequence explain the perception
of pitch and its subharmonics. The availability of the aperiodic component in the form of sequence of
impulses with relative amplitudes helps in studying the significance of excitation in contributing to the
quality of expressive voices, both by analysis and by synthesis. The role of amplitude/frequency modulation (AM/FM) in the excitation component of expressive voice signal is examined using synthetic
AM/FM sequences. A signal processing method is proposed for deriving the impulse sequence representation of the excitation source information in expressive voices. Validation of results is carried out
using spectrograms, a pitch perception measure and signal synthesis. A method is also proposed for F0
extraction from expressive voices in the regions of harmonics/subharmonics and aperiodicity.
The chapter is organised as follows. Section 7.2 discusses issues in representing the excitation source
component in expressive voices, in terms of a time domain sequence of impulses having amplitudes corresponding to the strength of excitation around those impulse locations. Analysis approach adopted in
this study is discussed in Section 7.3. Section 7.4 discusses the proposed method of analysing of the
aperiodic components in terms of saliency of the pitch-perception. The effect of rapid changes in pitchperiods in expressive voices on saliency, and the effect of different window lengths for trend removal
operation in ZFF method are analyzed for synthetic AM and FM sequences. Section 7.5 discusses the
133
excitation impulse sequence representation for expressive voices, using a modified zero-frequency filtering (modZFF) method proposed, to minimize the effect of window length for trend removal operation
in ZFF method. In Section 7.6, the characteristics of aperiodic excitation for the segments of Noh voice
studied by the XSX method [55] are examined using the proposed modZFF method. Perception of
subharmonic characteristics and rapid variations in the source characteristics are examined using spectrograms of the source represented in terms of an aperiodic sequence of excitation impulses. The derived
SoE impulse sequence represents the source characteristics only, is ascertained by decomposition of the
speech signal into source and system characteristics. Results are validated by comparing the saliency
plots with results of the XSX based method [55], and with ground truth obtained from the LP residual.
Section 7.7 discusses the significance of aperiodicity in expressive voices. Synthesis of speech signal
and F0 extraction, using the information of pitch perception in the case of expressive voices, are also
demonstrated. Section 7.8 gives a summary and research contributions in this chapter.
7.2 Issues in representing excitation source in expressive voices
The zero-frequency filtering method (ZFF) is a simple and effective method for deriving the sequence
of epochs and relative strengths of impulse-like excitation at epochs [130, 216]. The method involves
passing the differenced speech signal through a cascade of two zero-frequency resonators (ZFRs). Each
ZFR is an ideal digital resonator with the pair of poles on the unit circle in the z-plane at 0 Hz. The effect
of passing the signal through the cascade of ZFRs is equivalent to successive integration operation, as
shown in Fig. 7.1(b). The trend in the output is removed by subtracting the local mean computed over a
window of length in the range of one to two pitch periods. The resulting signal is called zero-frequency
filtered (ZFF) signal. The instants of negative to positive zero crossings in the ZFF signal correspond
to the glottal closure instants (GCIs), termed as epochs [130]. The slope of the ZFF signal around the
epochs is used to represent the relative strengths of the impulses at epochs [130, 216]. It is termed as
relative strength of significant excitation (SoE). The steps involved in deriving the ZFF signal, epochs
and strength of excitation [130] from speech signal, for modal voicing are discussed in Section 4.2.2.
The results of the ZFF method for three different window lengths on the extracted epochs and their
strengths are shown in Fig. 7.1 for a segment of voiced speech, whose average pitch period is about 9 ms.
Note that the epoch locations and their strengths remain same for a range of window lengths within about
one to two pitch periods. For smaller window lengths more locations for epochs are identified, and some
of them may be attributed to either excitation impulses with lower strengths or may be spurious epochs.
The choice of the window length is not critical if changes in the pitch period are not rapid, which is
the case for modal voicing. In cases where the pitch period changes rapidly, the ZFF method [130, 216]
needs to be modified in order to capture the variations in the pitch period, as in the case of laughter [185].
In expressive voices also, there could be significant changes in the intervals of successive pitch periods.
An illustration of the effect of shorter window lengths for trend removal is shown in Fig. 7.1. The ZFF
output signal along with epochs (marked with downward arrows) is shown in Fig. 7.1(c), (e) and (g),
134
(a) Input voice signal
s[n]
1
0
−1
y1[n]
10
10
x 10
(b) ZF Resonator output
5
0
(c) ZFF output with window length =12 ms
y2[n]
1
0
−1
SoE
(d) SoE impulse sequence with window length =12 ms
1
0.8
(e) ZFF output with window length =8 ms
y2[n]
1
0
−1
SoE
(f) SoE impulse sequence with window length =8 ms
1
0.5
(g) ZFF output with window length =4 ms
y2[n]
1
0
−1
(h) SoE impulse sequence with window length =4 ms
SoE
1
0.5
0
20
40
60
80
100
Time (ms)
120
140
160
180
200
Figure 7.1 Results of the ZFF method for different window lengths for trend removal for a segment of
voiced speech. Epoch locations are indicated by inverted arrows.
for window lengths of 12ms, 8ms and 4ms, respectively. Impulses at epochs with respective strength
of excitation (SoE) are shown in Fig. 7.1(d), (f) and (h). It may be observed in Fig. 7.1(h) that shorter
window lengths may highlight relatively more information, which may be useful for signals having
rapid pitch variations, e.g., expressive voices. But, sometimes it is difficult to interpret epochs derived
using small window length (< one pitch period), especially in the case of modal voicing where pitch
does not vary rapidly. Some of these epochs could also be spurious, which may not actually correspond
to the instants of impulse-like excitation. Hence, there is need to reduce the effect of spurious epochs
that can occur when a small window length is used for trend removal operation in the ZFF method.
7.3 Approach adopted in this study
In this study, we take a different approach for analyzing the characteristics of the aperiodic components. The approach is based on representing the characteristics of the excitation signal in time domain
in terms of sequence of impulses and their relative strengths. The strengths of the impulses in the aperiodic excitation are shown as amplitudes of the nonzero sample values, at the locations of the impulses
corresponding to the instants of significant excitation or epochs. The irregular intervals between epochs
along with the variable strengths of the impulses are used to characterize the unique excitation char135
acteristics of the expressive voices. The information extracted using the voice signal samples around
epochs may reflect the net effect of the vocal tract system. The effects of the nonuniform intervals of
the location of impulses, with nonuniform strengths, and the vocal tract system characteristics can be
studied to examine which of these components contribute to the unique characteristics of the expressive voices. The illustrations of Noh voice [55] are considered for comparing the characteristics of the
aperiodic components studied by the XSX method [55, 79, 81] with the proposed time domain method.
The impulse sequence in time domain is initially extracted using the zero-frequency filtering (ZFF)
method [130, 216], which is suitable mainly for modal voicing. The effect of different window lengths
for trend removal operation in the ZFF method is examined. The impulse sequence for aperiodic signal
is analysed in terms of saliency [55, 79, 80], to signify the perceived pitch. Effect of different window
lengths for trend removal on the derived sequence of excitation impulses at epoch locations is examined
in terms of saliency, for two synthetic cases. Two different sequences of impulses formed by amplitude
modulation (AM) and frequency modulation (FM) of a unit impulse sequence are used for studying this
effect on saliency. In order to eliminate the need for selecting an appropriate window length and also
to minimize the effect of window length on the derived impulse sequence for aperiodic signal like Noh
voice, a modified zero-frequency filtering (modZFF) method is proposed. Deriving the time domain impulse sequence using the modZFF also involves preprocessing of the input signal, advantage of which is
validated first by using the Hilbert envelope of LP residual of signal. The characteristics of aperiodicity
in expressive voices is then examined in terms of impulse sequence derived using the modZFF method
and the saliency [55] computed for this derived impulse sequence. The instantaneous fundamental frequency (F0 ) for expressive voices is obtained from the saliency information. The results obtained by
using the proposed signal processing methods are validated through saliency plots, spectrograms and visual comparison with results [55] obtained by XSX based TANDEM STRAIGHT method [55, 79, 81].
Analysis-synthesis approach is adopted for further validation and application of the results.
7.4 Method to compute saliency of expressive voices
For aperiodic excitation signals, it is difficult to fix the appropriate window length for trend removal,
as the intervals between successive epochs may vary rapidly and randomly. Moreover, due to aperiodicity, the perception of pitch is also difficult to express. The term “saliency” is used to express the
significance of perceived pitch [55].
In this study, we consider the autocorrelation function derived from the signal to express the saliency
of the perceived pitch frequency. The autocorrelation function is computed using the inverse discrete
Fourier transform (IDFT) [137] of the low pass (cut-off frequency 800 Hz) filtered spectrum of a 40 ms
Hann windowed segment of the signal. The locations of the peaks in the normalized autocorrelation
function (for lags > 0.5 ms) are used as estimates of the perceived pitch periods, and the magnitudes
of the peaks are used to represent the saliency (importance) of the estimates. The magnitudes of the
normalized autocorrelation function are displayed in terms of gray levels as a function of pitch frequency
136
(1/τ ) for each analysis frame, where τ is the time lag of the autocorrelation function. The gray level
display of saliency values as a function of frequency and analysis frame index (frame size = 40 ms and
frame shift = 1 ms) gives a spectrogram-like display. The resulting plot is called saliency plot.
The following steps are used to obtain the ‘saliency plot’ for a signal:
1. Select a segment sw [n] of 40 ms of the signal s[n] for every 1 ms.
2. Multiply the segment with Hann window w1 [n].
3. Compute the squared magnitude of short-time DFT [137] (Xw [k]) of the Hann windowed segment
xw [n], after appending with sufficiently large number of zeros to obtain adequate samples in the
frequency domain. It can be expressed as
Xw [k] =
N
−1
X
xw [n] exp
n=0
−j2πnk
N
(7.1)
where, xw [n] = sw [n].w1 [n] and N is number of points in DFT. Here, N is a power of 2, and is
taken sufficiently large.
4. Multiply the spectrum Xw [k] with a (half Hamming) window function W2 [k] (in frequency domain) to obtain an approximate low pass (< 800 Hz) filtered spectrum (Xw2 [k]).
5. Compute the inverse DFT (IDFT) [137] of the filtered spectrum (Xw2 [k]) to obtain the autocorrelation function r[τ ]. It can be expressed as
N −1
1 X
j2πτ k
2
r[τ ] =
|Xw2 [k]| exp
N
N
(7.2)
k=0
where Xw2 [k] = Xw [k].W2 [k] and N , which is a power of 2, is number of points in IDFT.
6. Use the normalized autocorrelation function r[τ ] from lag τ = 0.5 ms to τ = 40 ms to obtain it as
a function of frequency (1/τ ).
7. Plot amplitudes of the autocorrelation function r[τ ] as a function of the inverse of the time
lag (τ ) (i.e., frequency represented by 1/τ ), as gray levels for each analysis frame (frame rate =
1000 frames per sec). The resulting plot is the saliency plot.
Saliency plots are useful for studying the effects of window lengths on the extracted epoch sequences.
In order to study the effect of window size for trend removal operation in ZFF method [130, 216],
on the estimated locations of epochs, these epoch locations are extracted for two cases of synthetic
aperiodic sequences of impulses: an (i) amplitude modulated (AM) pulse train and a (ii) frequency
modulated (FM) pulse train. Fig. 7.2(a) shows the saliency plot of the AM pulse sequence shown in
Fig. 7.2(b), whose base fundamental frequency is 160 Hz and subharmonic components are at 80 Hz,
i.e., at F0 /2. Fig. 7.3 shows the saliency plots and the corresponding epoch sequences derived by using
ZFF method on the sequence in Fig. 7.2(b), for different window lengths for trend removal. Since
the intervals between successive pulses are nearly same, the ZFF method gives epochs locations nicely
when the window length for trend removal is in the range of one to two periods, as in Fig. 7.3(b) for the
(fixed) window size of 7 ms.
137
(b) Synthetic AM sequence
xs[n]
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
0.45
0.5
0.45
0.5
0.45
0.5
0.45
0.5
0.45
0.5
Figure 7.2 (a) Saliency plot of the AM pulse train and (b) the synthetic AM sequence.
(b) SoE impulse sequence for AM sequence. Window length=7ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
(d) SoE impulse sequence for AM sequence. Window length=3ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
(f) SoE impulse sequence for AM sequence. Window length=1ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
(h) Cleaned SoE impulse sequence for AM sequence. Window length=1ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
Figure 7.3 Saliency plots ((a),(c),(e),(g)) of the synthetic AM pulse train and the epoch sequences ((b),(d),(f),(h)) derived by using different window lengths for trend removal: 7 ms ((a),(b)),
3 ms ((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for AM sequence and the cleaned
SoE sequence for 1 ms window length, respectively.
138
(b) Synthetic FM sequence
xs[n]
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
0.45
0.5
0.45
0.5
0.45
0.5
0.45
0.5
0.45
0.5
Figure 7.4 (a) Saliency plot of the FM pulse train and (b) the synthetic FM sequence.
(b) SoE impulse sequence for FM sequence. Window length=7ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
(d) SoE impulse sequence for FM sequence. Window length=3ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
(f) SoE impulse sequence for FM sequence. Window length=1ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
(h) Cleaned SoE impulse sequence for FM sequence. Window length=1ms
SoE
1
0.5
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
Figure 7.5 Saliency plots ((a),(c),(e),(g)) of the synthetic FM pulse train and the epoch sequences ((b),(d),(f),(h)) derived by using different window lengths for trend removal: 7 ms ((a),(b)),
3 ms ((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for FM sequence and the cleaned
SoE sequence for 1 ms window length, respectively.
139
It is interesting to note that for the aperiodic sequence derived using a small window length (1 ms)
the saliency plot (Fig. 7.3(e)) matches well with the original (Fig. 7.2(a)). This indicates that the epoch
sequence along with respective strengths can be derived using a smaller window length for the trend
removal. But, there appear many spurious epochs with smaller strengths (Fig. 7.3(f)). Some of these
can be eliminated by retaining the epoch with the largest strength within 1 ms interval (arbitrarily chosen) of the current epoch, which gives cleaned SoE sequence. The resultant cleaned epoch sequence
(Fig. 7.3(h)) matches well with the original AM sequence in Fig. 7.2(b). The saliency plot (Fig. 7.3(g))
of the epoch sequence in Fig. 7.3(h), also matches well with that of the original sequence (Fig. 7.2(a)).
For aperiodic signals such as the AM sequence, longer window lengths for trend removal does not give
proper epoch sequence as can be seen from Fig. 7.3(b) and 7.3(d), in comparison with Fig. 7.2(b). Also,
it is difficult to remove the spurious epochs from Fig. 7.3(d), as compared to those in Fig. 7.3(f).
Fig. 7.4 shows a synthetic FM pulse sequence (base fundamental frequency = 160 Hz) and the corresponding saliency plot. Note that, since the intervals between epochs decrease with time, there appears
a split in saliency at 160 Hz as two diverging lines. Hence for extracting very small intervals as in the
region 0.4 sec to 0.45 sec, it is necessary to use a small window length for trend removal. Fig. 7.5 shows
the extracted epoch sequences in (b), (d) and (f) using three different window lengths for trend removal.
The corresponding saliency plots are also shown on the left hand side in (a), (c) and (e). It is clear that
use of 1 ms window for trend removal produces all the epochs correctly as shown in Fig. 7.5(f). But
there are many spurious epochs, which can be removed by selecting the epochs with highest strength
within 1 ms (arbitrarily chosen in this case) interval of the current epoch. The resulting cleaned epoch
sequence and the corresponding saliency plot are shown in Fig. 7.5(h) and (g), respectively.
7.5 Modified zero-frequency filtering method for analysis of Noh voices
7.5.1
Need for modifying the ZFF method
As discussed in Section 3.5, the characteristics of the excitation signal can be represented in time
domain in the terms of locations of impulses and their relative strengths, i.e., epoch sequence. The ZFF
method [130, 216] used for deriving the impulse sequence representation for modal voicing has two
limitations when applied for expressive voices: (i) Shorter window length would be required for trend
removal for higher F0 [159]. (ii) The impulse sequence for aperiodic signals is affected by the choice of
window length for trend removal. The analysis of epoch sequence representation of aperiodic signal in
terms of saliency, using synthetic AM/FM pulse trains in Section 7.4, establishes two points: (i) Shorter
window lengths bring out the information better, for nonverbal sounds with high degree of aperiodicity.
(ii) Some additional zero-crossings obtained by using short window lengths may be spurious.
First, we examine whether these additional zero-crossings obtained by using short window lengths
are related more to the excitation source component or the vocal tract system. The signal (s[n]) is
downsampled to 8000 Hz and 14th order LP analysis is carried out. In order to suppress the system
140
(a) Input speech signal
s[n]
1
0
−1
(b) LP Residual of signal
e[n]
0.01
0
−0.01
he[n]
(c) Hilbert envelope of LP Residual
0.01
0.005
0
(d) modZFF output of Hilbert envelope of LP Residual
h
z [n]
1
0
−1
(e) SoE impulse sequence from modZFF output
ψ[n]
1
0.5
0
0.25
0.26
0.27
0.28
0.29
0.3
0.31
Time (sec)
0.32
0.33
0.34
0.35
Figure 7.6 Illustration of waveforms of (a) input speech signal, (b) LP residual, (c) Hilbert envelope
of LP residual and (d) modZFF output, and (e) the SoE impulse sequence derived using the modZFF
method. The speech signal is a segment of Noh singing voice used in Fig. 3 in [55].
component, linear prediction (LP) residual [112, 114, 153] e[n] (= x[n] − x
ˆ[n]) is obtained from the
difference of the downsampled signal x[n] and the predicted signal x
ˆ[n] (using LP coefficients {ak }).
Then, the excitation source component in LP residual (e[n]) is highlighted by taking its Hilbert envelope (he [n]) [153, 137]. This signal he [n], now carrying predominantly the excitation source information, is used as input to the modified zero-frequency filtering (modZFF) method proposed.
Both limitations in ZFF method for the case of expressive voices, mentioned before, are addressed
in modZFF method by using gradually reducing window lengths for the trend removal operation. The
trend is removed coarsely first, using window lengths 20 ms, 10 ms and 5 ms. Then window lengths
3 ms, 2 ms and 1 ms are used successively, to capture the finer variations in the excitation component.
The impulse sequence is then obtained from the resultant modZFF output, whose positive to negative
going zero-crossings give impulse locations and its slope around zero-crossings the amplitudes (SoE).
Fig. 7.6 is an illustration of the signal (s[n]), LP residual (e[n]), Hilbert envelope of LP residual
(he [n]), modZFF output (zh [n]) and the SoE impulse sequence (ψ[n]) for a segment of Noh voice [55].
It may be observed in Fig. 7.6(b), (c) and (e) that the impulses of larger amplitude coincide with instants
of significant excitation, which may possibly correspond to the glottal closure instants (GCIs), i.e.,
epochs. The impulses of smaller amplitudes are located at intermediate points between two epochs.
Since, the vocal tract system component was substantially suppressed by taking first the LP residual and
then its Hilbert envelope, these impulses of smaller amplitudes most likely correspond to the excitation
component, and not the system. Though it is also possible that some of these impulses may be spurious.
141
(a) Input speech signal
s[n]
1
0
−1
(b) Preprocessed signal
sp[n]
1
0
−1
(c) modZFF output signal
zm[n]
1
0
−1
(d) SoE impulse sequence from modZFF output
ψ[n]
1
0.5
0
0.25
0.26
0.27
0.28
0.29
0.3
0.31
Time (sec)
0.32
0.33
0.34
0.35
Figure 7.7 Illustration of waveforms of (a) input speech signal, (b) preprocessed signal and (c) modZFF
output, and (d) the SoE impulse sequence derived using the modZFF method. The speech signal is a
segment of Noh singing voice used in Fig. 3 in [55].
A closer observation of the LP residual, modZFF output and impulse sequence representation obtained for different segments of Noh voice indicated that the excitation component is less likely to be
present beyond 1000 Hz. Hence, a preprocessing step prior to ZFF step is proposed, in place of computing the LP residual and its Hilbert envelope. The preprocessing step involves downsampling the signal
to 8000 Hz, smoothing over m sample points so as to get equivalent effect of low-pass filtering with
cut-off frequency (Fc ) around 1000 Hz, and then upsampling back to original sampling frequency (fs )
of the signal. Then modZFF output is obtained by performing the ZFF steps and trend removal operation
(using gradually reducing window lengths) on this preprocessed signal. In Fig. 7.7, an illustration of signal (s[n]), preprocessed signal (sp [n]), modZFF output (zm [n]) and the SoE impulse sequence (ψ[n])
is shown for the same segment of Noh voice as in Fig. 7.6. The impulse sequence (ψ[n]) (in Fig. 7.7(d))
has same or less number of impulses than in the impulse sequence obtained by using the Hilbert envelope of LP residual as input (in Fig. 7.6(e)). Visual comparison between zm [n] in Fig. 7.6(d) and zh [n]
in Fig. 7.7(c) indicates reduction in number of impulses by using the preprocessing step, that can be
seen in Fig. 7.7(d) as compared to Fig. 7.6(d). This is more likely due to reduction in spurious impulses.
7.5.2
Key steps in the modZFF method
Key steps in the proposed modified zero-frequency filtering method can be summarized as follows:
1. Preprocess the input signal (s[n]) by downsampling the signal to 8000 Hz, smoothing over m sample points to obtain an equivalent effect of low-pass filtering with cut-off frequency (Fc ) around
1000 Hz, and then upsampling back to original sampling frequency (fs ) of signal.
142
(a) Input speech signal
s[n]
1
0
ψ1[n]
zm1[n]
−1
(b) modZFF output [last window length=2.5ms]
1
0
−1
(c) SoE impulse sequence [modZFF last window length=2.5ms]
1
0.5
ψ2[n]
zm2[n]
0
(d) modZFF output [last window length=2ms]
1
0
−1
(e) SoE impulse sequence [modZFF last window length=2ms]
1
0.5
ψ3[n]
zm3[n]
0
(f) modZFF output [last window length=1.5ms]
1
0
−1
(g) SoE impulse sequence [modZFF last window length=1.5ms]
1
0.5
ψ4[n]
zm4[n]
0
(h) modZFF output [last window length=1ms]
1
0
−1
(i) SoE impulse sequence [modZFF last window length=1ms]
1
0.5
0
0.25
0.26
0.27
0.28
0.29
0.3
Time (sec)
0.31
0.32
0.33
0.34
0.35
Figure 7.8 Illustration of waveforms of speech signal (in (a)), modZFF outputs (in (b),(d),(f),(h)) and
SoE impulse sequences (in (c),(e),(g),(i)), for the choice of last window lengths as 2.5 ms, 2.0 ms, 1.5 ms
and 1.0 ms. The speech signal is a segment of Noh voice used in Fig. 3 in [55].
2. Get differenced signal (˜
x[n]) from the pre-processed signal (sp [n]), to obtain a zero-mean signal.
3. Pass the differenced signal (˜
x[n]) through a cascade of two zero-frequency resonators (ZFRs) as
in (8.1).
4. Remove the trend in the output of cascaded ZFRs (y˜1 [n]), coarsely first by using the gradually
reducing window lengths 20 ms, 10 ms and 5 ms in successive stages, and then using smaller
window lengths 3 ms, 2 ms and 1 ms successively, to highlight the information related to aperiodicity better. In each trend removal stage, implemented as in (8.2), the window has 2N + 1 sample
points. Its final output is called the modified zero-frequency filtered (modZFF) signal (zm [n]). An
illustration of modZFF output signal (zm [n]) is shown in Fig. 7.7(c) for a segment of Noh voice.
5. The positive to negative going zero-crossings of the modZFF signal (zm [n]) give locations of
impulses. The slope of the modZFF signal (zm [n]) around each of these locations indicates the
strength of excitation (SoE) of the impulse around that location. An illustration of the SoE impulse
sequence (ψ[n]) is shown in Fig. 7.7(d).
7.5.3
Impulse sequence representation of source using modZFF method
The advantage of modZFF method is its ability to derive an impulse sequence that represents the
excitation component of an aperiodic signal. The only issue remaining to be addressed in it now, appears
143
Table 7.1 Effect of preprocessing on number of impulses: (a) Last window length (wlast ) (ms), #impulses obtained (b) without preprocessing (Norig ), (c) with preprocessing (Nwpp ), and (d) difference
N
−Nwpp
(∆Nimps = orig
%). The 3 Noh voice segments correspond to Figures 6, 7 and 8 in [55].
Norig
(d) ∆Nimps =
(c)
Voice segment: Noh
(a) wlast
(b)
Norig −Nwpp
%
voice
Norig Nwpp
(ms)
Norig
Segment1:
Fig6. in [55]
Segment2:
Fig7. in [55]
Segment3:
Fig8. in [55]
2.5
2.0
1.5
1.0
0.5
2.5
2.0
1.5
1.0
0.5
2.5
2.0
1.5
1.0
0.5
112
120
133
137
152
182
201
210
211
375
300
319
326
327
381
111
115
130
135
147
172
195
209
210
220
270
316
325
327
353
0.89
4.17
2.26
1.46
3.29
5.49
2.99
0.48
0.47
41.33
10.00
0.94
0.31
0.00
7.35
to be - ‘what should be the last window length in the trend removal operation’. It also needs to be verified
that - ‘how sensitive the locations of zero-crossings of modZFF output (zm [n]) are, to the choice this
last window length’. In order to examine this, the trend removal operation was performed for a Noh
voice segment in 4 different iterations, each using the last window length as 2.5 ms, 2.0 ms, 1.5 ms
or 1.0 ms, respectively. Fig. 7.8 shows an illustration of input speech signal (s[n]) and the modZFF
outputs (zmj [n]) (where j = 1, 2, 3, 4) obtained by using these last window lengths. Corresponding
SoE impulse sequence (ψj [n]) is also shown for each case. It is interesting to observe from Fig. 7.8 that
the locations of zero-crossings of modZFF output and also of the impulses having larger amplitudes are
nearly same, for the last window length (wlast ) taken in the range of 1.0 ms to 2.5 ms. The number
of impulses are though marginally increased for some segments when shorter last window lengths are
used. It may be inferred that the locations and amplitudes of impulses in the SoE sequence obtained by
the modZFF method are not sensitive to the choice of last window length in 1.0 ms to 2.5 ms range.
Main advantage of using the preprocessing step in modZFF method is the lesser number of impulses obtained with preprocessing (Nwpp ), in comparison to those obtained without any preprocessing (Norig ), i.e., using the original signal itself. This reduction in the number of impulses (Nwpp ) resulted by using the preprocessing step is possibly related to reduction in spurious impulses. In Table 7.1,
the number of impulses obtained by the modZFF method with/without preprocessing are given for 3 different segments of Noh voice [55]. The number of impulses obtained without preprocessing (Norig ) and
with preprocessing (Nwpp ) are given in columns (b) and (c), respectively, for the choice of last window length (wlast ) in column (a). The percentage difference in the number of these impulses relative to
N
−Nwpp
% is given in column (d). Fig. 7.9 shows the results given in Table 7.1
Norig , i.e., ∆Nimps = orig
Norig
144
Change(%) in #impulses vs last window length
10
∆Nimps (%)
8
6
4
2
0
0.5
1
w
1.5
(ms)
2
2.5
last
Figure 7.9 Selection of last window length: Difference (∆Nimps )(%) in the number of impulses obtained with/without preprocessing vs choice of last window length (wlast ) (ms), for 3 different segments
of Noh singing voice [55]. [Solid line: segment1, dashed line: segment2, dotted line: segment3]
Figure 7.10 (a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last
window length=1 ms), for the input (synthetic) AM sequence.
by plotting the percentage difference (∆Nimps ) vs choice of the last window length (wlast ), and fitting
a curve of 4th order polynomial to the data points. Table 7.1 and Fig. 7.9 indicate that the difference in
the number of impulses obtained with/without preprocessing in modZFF method, is near minimum for
the choice of last window length as 1 ms (i.e., wlast =1 ms), for each of the 3 signal segments considered.
The ability of modZFF method in giving an impulse sequence representation of the excitation component is verified for the synthetic AM/FM pulse trains (see Section 7.4). Fig. 7.10 shows the saliency plot
of the impulse sequence derived using the modZFF, for a synthetic AM sequence. Likewise, Fig. 7.11
shows the saliency plot of the impulse sequence derived using the modZFF method, for FM sequence.
It is interesting to observe that the derived impulse sequence and the saliency plot (Fig. 7.10) are very
close to those for the original AM sequence (Fig. 7.2). Similarly, the derived impulse sequence and
saliency (Fig. 7.11) are quite similar to those shown for the original FM sequence (Fig. 7.4). The similarity between original AM/FM sequences (Fig. 7.2 and Fig. 7.4) and those derived using the modZFF
method (Fig. 7.10 and Fig. 7.11), with similarity of respective saliency plots, thus validates the proposed
modZFF method and indicates its usefulness in capturing aperiodicity in expressive voices.
145
Figure 7.11 (a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last
window length=1 ms), for the input (synthetic) FM sequence.
(a) Speech signal waveform
s[n]
1
0
−1
Frequency (Hz)
Frequency (Hz)
4000
Frequency (Hz)
(b) Spectrogram of signal
4000
4000
3000
2000
1000
(c) Spectrogram of LP residual of signal
3000
2000
1000
(d) Spectrogram of SoE impulse sequence
3000
2000
1000
2
4
6
8
Time (sec)
10
12
14
Figure 7.12 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse
sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 1 in [55].
7.6 Analysis of aperiodicity in Noh voice
The feasibility of representing the information of excitation source through a sequence of impulses
was examined in previous sections. It is premised that locations and amplitudes of impulses in this sequence also carry the perceptually significant information of aperiodicity in expressive voices. Hence,
it is necessary to first verify whether the impulse sequence derived using modZFF method really represents the excitation source component, or does it also carry the information of vocal tract system? After
ascertaining this first, we then examine the aperiodicity in expressive voices later in this section.
7.6.1
Aperiodicity in source characteristics
The aperiodic impulse sequences with relative amplitudes (i.e., SoE), representing the excitation
source characteristics, can be obtained for different segments of Noh voice signals by using the modZFF
146
(a) Speech signal waveform
s[n]
1
0
−1
Frequency (Hz)
Frequency (Hz)
4000
Frequency (Hz)
(b) Spectrogram of signal
4000
4000
3000
2000
1000
(c) Spectrogram of LP residual of signal
3000
2000
1000
(d) Spectrogram of SoE impulse sequence
3000
2000
1000
2
4
6
8
Time (sec)
10
12
14
Figure 7.13 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse
sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 2 in [55].
method. The spectrograms of the epoch sequences reflect the aperiodic characteristics better than the
spectrogram of the signal waveform, as the effect of resonances of the vocal tract is suppressed in these.
The spectrogram of the epoch sequence also highlights the excitation characteristics such as harmonics,
subharmonics, pitch modulations, pitch rise and fall etc. better than the spectrogram of the signal.
An important feature of the spectrogram of the aperiodic signals is that the spectral features of the
aperiodicity will be mixed with spectral features of the vocal tract system. Hence, it is usually difficult to identify the formant features corresponding to the vocal tract shape in the spectrogram. Linear
prediction (LP) residual [112] of the speech signal is expected to reflect mainly the source characteristics [112, 114, 153]. The spectrogram of the LP residual may thus be used as a reference for representing
the excitation source characteristics, although it does show some vocal tract system features also.
Illustrations of the spectrograms for the speech signal, LP residual and the SoE impulse sequence
are shown in Fig. 7.12(b), (c) and (d), respectively. The spectrogram in Fig. 7.12(d) for the SoE impulse
sequence displays the excitation source characteristics clearly. It may also be observed in the region
of 9-10 sec that the features of harmonics and subharmonics change temporally. The overall spectral
characteristics due to aperiodic components are quite distinct from the spectral characteristics of modal
voicing (compare region of 9-10 sec with the region around 10-11 sec in Fig. 7.12(b) and (d)). It is
likely that the nature and extent of aperiodicity may be different in different short segments, as shown
in the spectrogram in Fig. 7.12(d) for the SoE impulse sequence. Spectrograms obtained in a similar
fashion for the SoE impulse sequences derived using the modZFF method, for the other two segments
of Noh singing voice (corresponding to Fig. 2 and Fig. 3 in [55]) are shown in Fig. 7.13 and Fig. 7.14.
A visual comparison between spectrograms of signal (Fig. 7.12(b)-Fig. 7.14(b)) and LP residual
(Fig. 7.12(c)-Fig. 7.14(c)) indicates that the LP residual also carries the information of the vocal tract
147
(a) Speech signal waveform
s[n]
1
0
−1
Frequency (Hz)
Frequency (Hz)
4000
Frequency (Hz)
(b) Spectrogram of signal
4000
4000
3000
2000
1000
(c) Spectrogram of LP residual of signal
3000
2000
1000
(d) Spectrogram of SoE impulse sequence
3000
2000
1000
2
4
6
8
Time (sec)
10
12
14
Figure 7.14 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse
sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 3 in [55].
system to some extent. The vocal tract system information is completely suppressed in the spectrograms
of the corresponding SoE impulse sequence. This fact is evident from the comparison of spectrograms
of LP residuals (Fig. 7.12(c)-Fig. 7.14(c)) and of SoE impulse sequences (Fig. 7.12(d)-Fig. 7.14(d)).
The dark broader contours visible in the spectrograms of the signal (Fig. 7.12(a)-Fig. 7.14(a)) indeed
carry the system information, as is shown later in this section.
7.6.2
Presence of subharmonics and aperiodicity
The regions of aperiodicity in expressive voices, such as the regions between 9-10 sec in Fig. 7.12,
8.9-9.5 sec and 13-14.5 sec in Fig. 7.13, and 8.0-9.5 sec and 13-14.5 sec in Fig. 7.14, consist of possibly
subharmonics also. In order to confirm this, the spectrograms expanded in the frequency range between
0-800 Hz were examined more closely. An illustration of the expanded spectrograms of the signal
and the SoE impulse sequence for Noh voice signal, in the region between 13-14.5 sec in Fig. 7.14
for the frequency region 0-800 Hz, is shown in Fig. 7.15(b) and (c), respectively. The differences in
both the spectrograms in Fig. 7.15 can be observed in three distinct regions: (i) R1: 13.8-14.2 sec,
(ii) R2: 14.25-14.45 sec and (iii) R3: 14.45-14.65 sec. The differences in these regions are better visible
in the spectrogram of SoE impulse sequence (in (c)).
In the first region R1 (13.8-14.2 sec)(in Fig. 7.15(c)), the presence of periodicity and thereby harmonics is indicated by the regularity of harmonics peaks. In the second region R2 (14.2-14.45 sec),
148
s[n]
(a) Speech signal waveform
1
0
−1
(b) Spectrogram of signal
Frequency (Hz)
800
600
400
200
0
(c) Spectrogram of SoE impulse sequence
Frequency (Hz)
800
600
400
200
0
13.8
13.9
14
14.1
R1
14.2
14.3
14.4
Time (sec) R2
14.5
14.6
14.7
R3
Figure 7.15 Expanded (a) signal waveform, and spectrograms of (b) signal and its (e) SoE impulse
sequence obtained using the modZFF method, for Noh voice segment corresponding to Fig. 3 in [55].
subharmonics are clearly visible around 100 Hz. In the third region R3 (14.45-14.65 sec), there exists neither periodicity nor harmonics/subharmonics. The signal exhibits randomness and a noise-like
behaviour in this region. Hence, the second region, and possibly third region also in some cases, may
be called as regions of ‘aperiodicity’. First region R1 is the region of ‘periodicity’. It is interesting to
note that the presence of subharmonics, indicated by a dark band around 100 Hz in the region R2 (14.214.45 sec), is highlighted better in the spectrogram of the SoE impulse sequence.
Similar regions of aperiodicity are also observed in other two segments of Noh voice corresponding
to Fig. 7.12 and Fig. 7.13. Hence, it may be inferred that regions of aperiodicity in expressive voices
can indeed be analysed better from the source characteristics, that is represented by the SoE impulse
sequence derived using the modZFF method.
7.6.3
Decomposition of signal into source and system characteristics
Decomposition of speech signals for analysis of aperiodic components of excitation was proposed
in [214], using the ZFF method [130, 216], in which the vocal tract system characteristics were derived
using group delay [128, 129]. In another study [81, 78, 79], the system characteristics were derived using
a TANDEM STRAIGHT method, with a XSX method [55] for extracting pitch information. The SoE
impulse sequence obtained in previous sections represents the excitation source and carries the aperiodicity information. But, is this impulse sequence representation enough to derive the characteristics of
both the source and the system? Also, do the dark broader contours visible in the spectrograms of signal
(in Fig. 7.12(a)-Fig. 7.14(a)) and partially visible in the spectrograms of LP residual (in Fig. 7.12(b)Fig. 7.14(b)), really pertain to the system component? We examine both of these questions now.
149
(a) Speech signal waveform
s[n]
1
0
−1
Frequency (Hz)
Frequency (Hz)
4000
Frequency (Hz)
(b) Spectrogram of signal
4000
4000
3000
2000
1000
(c) Spectrogram of source component
3000
2000
1000
(d) Spectrogram of system component
3000
2000
1000
2
4
6
8
Time (sec)
10
12
14
Figure 7.16 (a) Signal waveform, and spectrograms of (b) signal, and its decomposition into (c) source
characteristics and (d) system characteristics, for a Noh voice segment corresponding to Fig. 3 in [55].
Assuming that the speech signal s[n] can be decomposed into the excitation source characteristics es [n] (i.e., epoch sequence) and the vocal tract system characteristics hs [n] (i.e., filter characteristics), the signal can be broadly represented as convolution of the two, according to the source-filter
model [153]. The spectrum of the signal is given by
PT (ω) = Es (ω) Hs (ω)
(7.3)
where, PT (ω), Hs (ω) and Es (ω) are power spectra of the signal, epoch sequence and vocal tract system, respectively. Using the discrete frequency domain representations and taking only the magnitude
spectrum part, PT (ω) corresponds to |S[k]|2 , Es (ω) to |Es [k]|2 and Hs (ω) to |Hs [k]|2 , where S[k],
Es [k] and Hs [k] are DFT [137] of s[n], es [n] and hs [n], respectively. Using these relations and (8.5),
magnitude spectrum of the system is given by
|Hs [k]|2 =
|S[k]|2
|Es [k]|2
(7.4)
The system characteristics hs [n] can be obtained by using IDFT [137] of Hs [k].
Hence, the vocal tract system characteristics can be broadly obtained from the signal spectrum, by
knowing the excitation source characteristics. In the illustration shown in Fig. 7.16(d), the wideband
spectrogram of the vocal tract system characteristics is derived from the spectrograms of the signal and
the source characteristics (by using (8.6)). The wideband spectrogram of the corresponding Noh voice
signal and the signal are shown in Fig. 7.16(b) and (a), respectively. Narrowband spectrogram of the
same signal segment can be seen in Fig. 7.14(b). For better clarity, the narrowband spectrogram of
150
the source characteristics (i.e., SoE impulse sequence derived using the modZFF method) is shown in
Fig. 7.16(c). Visual similarity between both the wideband spectrograms of the signal (Fig. 7.16(b))
and the system (Fig. 7.16(d)) indicates that spectrogram in Fig. 7.16(d) represents mainly the system
characteristics. It also indicates that the dark broader contours in Fig. 7.16(b) (also in Fig. 7.14(b)),
indeed represent the formant contours, which are suppressed in all the spectrograms for the SoE impulse sequences shown in Fig. 7.12(d)-Fig. 7.14(d). The spectrogram of the excitation component in
Fig. 7.16(c)) can be contrasted with that of the vocal tract system in Fig. 7.16(d). The system characteristics derived in a similar fashion for other segments of Noh voice, also exhibit similar distinction between
the spectrograms of the excitation source characteristics and the vocal tract system characteristics.
7.6.4
Analysis of aperiodicity using saliency
In this subsection, we validate the ability of the representation of excitation source information
through the impulse sequence, in capturing the perceptually significant pitch information. The representation in the form of SoE impulse sequences of different intervals and amplitudes is validated by
using the saliency measure [55]. Saliency plots are obtained for different segments of Noh voice signal
first by using LP residual, that can be used as reference or ground truth. The saliency plots obtained by
using the SoE impulse sequences are then compared with these reference plots, and also with saliency
plots obtained by using the XSX method [55]. Aperiodicity in expressive voices is analysed in terms of
saliency, which is computed from the SoE impulse sequence representation of the excitation.
Fig. 7.17(a) is the saliency plot for a segment of Noh voice signal (in region 9.4-9.8 sec in Fig. 7.12(b)),
computed from its LP residual obtained by 14th order LP analysis [112, 114, 153] of signal downsampled to 8000 Hz. This saliency plot may be considered as reference, since it is computed from the
derived excitation component of signal. It may be noted that higher harmonics (≥ 300 Hz) are not much
visible in this saliency plot. Fig. 7.17(b) is the saliency plot obtained by the XSX method [55, 81], which
is meant to represent the perceptually significant pitch information as discussed in [55, 80]. It may be
observed that Fig. 7.17(b) captures the prominent features (indicated by darker lines) of the saliency
plot in Fig. 7.17(a) computed from the LP residual. Fig. 7.17(c) is the saliency plot computed from the
SoE impulse sequence derived by using the modZFF method. Saliency plot (especially the dark bands
indicating large saliency) in Fig. 7.17(c) matches well with those in Fig. 7.17(a) and (b). It may be
observed that in Fig. 7.17(c) the prominent features visible in the reference plot (in Fig. 7.17(a)) and the
higher harmonics visible in the results of XSX based method (in Fig. 7.17(b)), both can be seen clearly.
The frequency of large saliency may be interpreted as being perceived pitch harmonics and subharmonics. But due to large deviations from periodicity in the epoch (or SoE impulse) sequence, it is
difficult to interpret the resulting perception as small deviation from periodicity. It is indeed difficult
to determine which frequency components of the excitation are perceived well by human listener, as
the significance of a frequency component need not be based only on its saliency. The saliency plots
from the LP residual seem noisy, whereas the saliency plot from the SoE impulse sequence seems to
pick up the high salience frequency components well. Thus the SoE impulse sequence may be used
151
Frequency (Hz)
(b) XSX method (Fig.6 in [2])
600
400
300
200
100
80
60
40
0.1
0.2
0.3
Frequency (Hz)
(e) XSX method (Fig.7 in [2])
600
400
300
200
100
80
60
40
0.2
0.4
0.6
(h) XSX method (Fig.8 in [2])
Frequency (Hz)
600
400
300
200
100
80
60
40
0.2
Time (sec)
0.4
0.6
Time (sec)
0.8
Time (sec)
Figure 7.17 Saliency plots computed with LP residual (in (a),(d),(g)), using XSX method (copied
from [55]) (in (b),(e),(h)), and computed with SoE impulse sequence derived using the modZFF method
(in (c),(f),(i)). The signal segments S1, S2 and S3 correspond, respectively, to the vowel regions
[o:](Fig. 6 in [55]), [i](Fig. 7 in [55]), and [o](with pitch rise)(Fig. 8 in [55]) in Noh singing voice [55].
as a good representation of the excitation source from perception point of view also. Even for representation and manipulation, the epoch sequence with amplitudes of impulses reflecting the strength of
excitation (SoE) is a better choice than the LP residual.
Detailed analyses of the aperiodic components in the specific regions considered in [55] are carried
out by using the SoE impulse sequences, in order to compare with the analyses made for the same
segments using XSX method [55, 81, 80]. Specific regions of Noh voice signal selected for detailed
study are the vowel segments considered in [55] namely, 36.9-37.3 sec in Fig. 6, 56.2−57 sec in Fig. 7
and 109.4−110.3 sec in Fig. 8 in [55]. In this paper, these regions correspond to the regions between
9.39−9.81 sec in Fig. 7.12, 13.7−14.5 sec in Fig. 7.13, and 13.88−14.81 sec in Fig. 7.14, respectively.
We compare the saliency plots computed from the SoE impulse sequence derived using the modZFF
method, with the saliency plots derived using the XSX method [55, 81], to show that all the important
features are preserved. The SoE impulse sequence thus provides an alternative representation of the
aperiodic component of the voiced excitation, in expressive voices.
152
Fig. 7.17(f), (i) show the saliency plots obtained from the SoE impulse sequences for the other
two vowel segments of Noh voice [55], with corresponding saliency plots obtained by XSX method
(Fig. 7.17(e), (h)) taken from [55]. The effects of nonuniform intervals and amplitudes of the impulses
in the epoch sequences are similar to those as shown in XSX based plots (Fig. 7.17(e), (h)) in terms of
regions of large saliency (dark lines). In the saliency plots obtained from SoE impulse sequences, the
time intervals of harmonic and subharmonic pitch regions can be seen more clearly. The saliency plots
(Fig. 7.17(f), (i)) for SoE impulse sequences obtained by using the modZFF method, show the prominent
features in reference saliency plots computed with LP residual (Fig. 7.17(d), (g)). Note that it is difficult
to set a threshold on the saliency to determine the significance of the corresponding pitch frequency. It
is likely that human perception takes all the values to appreciate the artistic features of the voice in the
excitation, rather than characterizing in terms of a few harmonic and subharmonic components.
From Fig. 7.17, visual comparison of saliency plots for three vowel segments, computed from the
SoE impulse sequences (Fig. 7.17(c), (f) and (i)) with those obtained from LP residual (Fig. 7.17(a), (d)
and (g)) validates three points: (i) Epoch sequence representation of the excitation source characteristics
is sufficient to represent aperiodicity of source characteristics in expressive voices, since system characteristics can also be derived from it in some cases. (ii) Epoch sequence representation is better than LP
residual to represent the excitation source characteristics, since the latter may have traces of spectral
characteristics of the vocal tract system also, as observed in spectrograms in Fig. 7.12(c)-Fig. 7.14(c).
(iii) The locations of impulses and their relative amplitudes in the epoch sequence, indeed capture the
perceptually significant pitch information, that is represented better through saliency plots.
7.7 Significance of aperiodicity in expressive voices
The relative importance of nonuniform epoch intervals/amplitudes in the impulse sequence representation of speech signal on the resulting harmonic features in spectrograms was examined in Section 7.6.
From the Fig. 7.12(d), Fig. 7.13(d) and Fig. 7.14(d), it can be inferred that the aperiodic structure preserves the harmonic and subharmonic structure. Theoretically, if these nonuniform epoch intervals are
made uniform, then most of the subharmonic components will be lost. The study of AM and FM sequences and their saliency, examined in Section 7.4, also highlights the effect of amplitude/frequency
modulation of signal on pitch perception that is likely to be present in expressive voices.
The inferences drawn from these earlier sections lead to few assertions, which are as follows:
(1) Expressive voices like Noh singing voice have regions of aperiodicity, apart from regions of periodicity and randomness. The regions of aperiodicity are highlighted better in the saliency plots.
(2) Regions of aperiodicity are more likely to have subharmonic structure, apart from the harmonic
structure. Presence of subharmonics is better seen in spectrograms.
(3) The instantaneous fundamental frequency (F0 ) changes rapidly in the regions of aperiodicity. Changes
in F0 in these regions may appear to be random to some extent.
153
(a) Excitation (synthetic FM sequence)
0.5
x
FM
[n]
1
0
(b) Synthesized output (using FM sequence excitation)
synth
[n]
0.5
s
0
−0.5
synth
0.4
ψ
s
[n]
(c) SoE impulse sequence derived from synthesized output
0.2
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
0.45
0.5
Figure 7.18 (a) FM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence.
(4) Human perception is likely to take into account all likely values of the changing pitch frequency in
the regions of aperiodicity.
(5) Aperiodicity, subharmonics and changes in pitch perception in expressive voice signals are more
related to the excitation component, which can be represented through an SoE impulse sequence.
7.7.1
Synthesis using impulse sequence excitation
The effectiveness of the aperiodic sequence in capturing the perceptually salient excitation information can also be studied using synthesis and subjective listening. Speech is synthesized by exciting a
14th order LP model computed for every 10 ms shifted by 1 ms for 3 cases, using the excitation sequences consisting of either of the following: (a) impulse sequence with local averaging of the pitch
period, (b) impulse sequence with actual epoch intervals with constant amplitudes (unit impulses), and
(c) impulse sequence with actual epoch intervals along with their respective amplitudes (i.e., SoE).
Speech is synthesized for each of the three Noh voice signals corresponding to Fig. 7.12, Fig. 7.13
and Fig. 7.14. Speech synthesized with excitation (c), i.e., SoE impulse sequence, sounds relatively
better in comparison to other two cases. It is interesting to note that it is the aperiodicity that contributes more to the expressive voice quality. The amplitudes of impulses are not very critical. However,
naturalness is lost if the excitation consists of only aperiodic sequence of impulses, as it does not have
other residual information. Moreover, the 14th order LP model computed for every 10 ms smoothes the
spectral information pertaining to the vocal tract, and hence the fast changes in the vocal tract system
characteristics are also not reflected in the synthesis.
Usefulness of the modZFF in deriving an aperiodic impulse sequence representation of source characteristics of Noh voice was also examined by a speech-synthesis experiment. Synthetic AM/FM sequence was used for exciting a 14th order all-pole model [153, 112, 114] derived from vowel [a] in modal
154
(a) Excitation (synthetic AM sequence)
0.5
x
AM
[n]
1
0
(b) Synthesized output (using AM sequence excitation)
synth
[n]
0.5
s
0
−0.5
(c) SoE impulse sequence derived from synthesized output
0.5
[n]
ψ
s
synth
1
0
0.05
0.1
0.15
0.2
0.25
Time (sec)
0.3
0.35
0.4
0.45
0.5
Figure 7.19 (a) AM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence.
voice. Thus synthesized output consists of excitation by the AM/FM sequence and system characteristics of vowel. In order to highlight the source features better, the LP residual of this synthesized output
was taken. Then SoE impulse sequence was derived from it, using the modZFF method. Fig. 7.18(a), (b)
and (c) show the excitation FM sequence (xF M [n]), the synthesized output (ssynth [n]) and the derived
SoE impulse sequence (ψssynth [n]), respectively. It is interesting to note that the locations of impulselike pulses in the synthesized output (Fig. 7.18(b)), correspond well to those in the excitation FM sequence (Fig. 7.18(a)). In the derived SoE impulse sequence also (Fig 7.18(c)), the location of impulses
can be observed to correspond fairly well to those in the excitation FM sequence (Fig 7.18(a)). Changes
in the amplitude of impulses are primarily due to effect of the system characteristics.
The synthesized output using AM sequence as excitation is shown in Fig 7.19. In both cases of
excitation by synthetic AM/FM sequences, the locations of impulses derived from the synthesized signal
using modZFF method correspond to the excitation sequence, although some spurious impulses are also
present. The modZFF method is helpful in getting back the location of impulses in the AM/FM sequence
excitation, which carries the information of harmonic/subharmonic structure discussed in Section 7.6.2
and shown through the saliency plot in Fig. 7.17. Retrieving this finer information in the form of closely
spaced impulses (in FM sequence) from the synthesized output, is difficult otherwise.
The synthesized output using AM sequence as excitation shows similar results. In both cases of
excitation by synthetic AM/FM sequence, the locations of impulses derived from the synthesized output
using modZFF method correspond well to the excitation sequence. The observations made by using the
excitation AM/FM excitation may apply to expressive voices also, which do exhibit amplitude/frequency
modulation of the excitation signal. The observations seem to suggest that: (i) aperiodicity is more
related to the excitation source than the system (this was observed in Section 7.6 also), and (ii) the
information of aperiodicity is embedded perhaps more in the location than in the amplitude of impulses.
155
(a) Speech signal waveform
s[n]
1
0
−1
(b) SoE impulse sequence (using modZFF method)
ψ
1
0.5
0
(c) F as per Saliency
0
Frequency (Hz)
600
400
300
200
100
80
60
40
0.1
0.2
0.3
0.4
0.5
Time (sec)
0.6
0.7
0.8
Figure 7.20 Illustration of (a) speech signal waveform, (b) SoE impulse sequence derived using the
modZFF method and (c) F0 contour extracted using the saliency information. The voice signal corresponds to the vowel region [o](with pitch rise) in Noh singing voice (Ref. Fig. 8 in [55]).
7.7.2
F0 extraction in regions of aperiodicity
In general, it is difficult to compute F0 for an aperiodic signal. It is even more challenging to derive
the F0 information which is guided by the pitch perception information, especially in the case of expressive voices. A method for F0 extraction was proposed in [55, 81, 80] by utilizing the pitch perception
information captured through saliency, which was computed using a TANDEM STRAIGHT method. In
this paper, an alternative method is proposed to compute saliency for an impulse sequence derived using
the modZFF method, that does capture the pitch perception information. From the saliency plot, which
actually is the autocorrelation (r[τ ]) of low-pass filtered magnitude spectrum of the signal, the highest N peaks are taken (in descending order of magnitude) for each frame. This autocorrelation (r[τ ])
is obtained by using (8.4). Inverse of the location of the highest among these peaks, i.e., time-lag
(τ |(max(r[τ ])) ), gives the frequency (F0 ) of perceived pitch for the frame taken at that time instant.
Hence, the F0 information (as F0 = 1/(τ |(max(r[τ ])) )), i.e., the frequency of most salient pitch
perceived in a frame at a particular time instant, can be computed from the highest peak in the autocorrelation (r[τ ]), i.e., the lag (τ ) for r[τ ]|max . An illustration of the signal waveform, SoE impulse
sequence derived using the modZFF method and the F0 information extracted using the saliency computed for a segment of Noh voice signal is shown in Fig. 7.20(a), (b) and (c), respectively. Similarly,
F0 contours are obtained for three segments of Noh voice signal considered in Fig. 7.17(a)-(c), (d)-(f)
and (g)-(i) (i.e., the segments corresponding to Fig. 6, Fig. 7 and Fig. 8 in [55]). It is interesting to
note that the saliency (pitch perception) information is thus useful in extracting the F0 information for
expressive voices, which otherwise is difficult to obtain especially in the regions of aperiodicity.
156
7.8 Summary
The aperiodic characteristics of the excitation component of expressive voice signals were studied in
the context of Noh voices. The aperiodic information is captured in the form of sequence of impulses
with relative amplitudes. It was shown that the perceptual features of these voices are well preserved
in the epoch/impulse sequence, and that the epoch sequence is derived directly from the speech signal,
without computing the short-time spectrum. The aperiodic epoch sequence gives pitch estimation similar to what was obtained using the fluctuating harmonic components in the short-time spectrum [55, 80].
Since the aperiodicity information is available in the time domain, it is much easier to control the excitation by modifying the epoch sequence in a desired manner. Synthesis of Noh voices using LP model
for system and epoch sequence for excitation indicates that the aperiodic component mainly contributes
to the peculiarities of these voices.
Key contributions of this study are two signal processing methods, first a modZFF method, for deriving an impulse/epoch sequence representation of the excitation component in expressive voice signal,
and second a method for computing saliency that captures the pitch-perception information related to
aperiodicity in expressive voices. The embedded harmonic/subharmonic structure is also examined using the spectrograms. The epoch sequence representation is considered adequate to represent the excitation source component in the speech signal. In some cases, it is also possible to derive the approximate
vocal tract system characteristics from this representation. The role of amplitude/frequency modulation in aperiodicity and in harmonic/subharmonic structure is examined by using two synthetic AM/FM
pulse trains. Examining the saliency plots for these and different segments of Noh voice signal, it is confirmed that the epoch sequence representation does capture the aperiodicity and salient pitch perception
information quite well. In the impulse sequence representation, the information of aperiodicity is more
related to the time intervals among impulses, than their amplitudes. Validation of the results is carried
out using spectrograms, saliency plots and an analysis-synthesis approach. Extraction of F0 information
that captures the pitch perception information for expressive voices, is also demonstrated.
Only one set of samples of Noh voice is used in this study. Due to inherent peculiarities of Noh
singing voice, analysing the aperiodicity in it from the production characteristics is useful for analysing
other expressive voices also. It is assumed that similar regions of aperiodicity and harmonic/subharmonic
structures are present in other types of natural expressive voices. Effectiveness of the signal processing
methods like modZFF, saliency computation and F0 extraction has also been tested for other speech
signals, such as laughter and modal voice. Representation of the excitation source information through
an epoch/impulse sequence that also captures adequately the pitch perception information, is focussed
in this study. Further, it may be interesting to explore whether the impulse sequence can be generated
directly from the pitch perception information extracted from the expressive voice signal? However,
this study may be useful in characterizing and analysing the excitation source characteristics of natural
expressive voices (laughter, cry etc.) that involve aperiodicity in the speech signal.
157
Chapter 8
Automatic Detection of Acoustic Events in Continuous Speech
8.1 Overview
Analysis of the nonverbal speech sounds has helped in identifying their few unique characteristics.
Exploiting these, few distinct features are extracted and parameters are derived that discriminate well
these sounds from that of normal speech. Towards applications of the outcome of this research work,
experimental systems are developed for detection of a few acoustic events in continuous speech. Three
prototype systems are developed for automatic detection of trills, shouts and laughter, which are named
as automatic trills detection system, shout detection system, laughter detection system, respectively.
In this chapter, we discuss the details, performance evaluation and results of these three systems. In
Section 8.2, the feasibility of developing an ‘Automatic Trills Detection System’ (ATDS) is discussed,
along with results of limited testing of the experimental system. In Section 8.3, a prototype ‘Shout Detection System’ (SDS) is developed for automatic detection of regions of shouts in continuous speech.
The parameters are derived from the production features studied in Section 5.6. An algorithm is proposed for taking decision of shouts, using these parameters. Performance evaluation results are also
discussed in comparison with those from other methods. In Section 8.4, an experimental ‘Laughter Detection System’ (LDS) is discussed, which can be further developed using laughter production features
and parameters derived earlier in Section 6.5. A summary of this chapter is discussed in Section 8.5.
8.2 Automatic Trills Detection System
The system uses the production features of apical trills, studied in Section 4.2. Excitation source
features F0 and SoE are used, along with an autocorrelation lag based feature proposed in [40]. In the
second phase, LF modeled pulses based synthesis is used for confirming the trills detected in first phase.
Limited testing of the ATDS has been carried out on a database consisting of 397 trills in different
contexts, recorded in the voice of an expert male phonetician. The system gives trill detection rate of
84.13%, with accuracy of 98.82% for this test data. The experimental ATDS is developed mainly to
examine the feasibility of developing an automatic system for spotting trills in continuous speech.
158
Figure 8.1 Schematic block diagram of prototype shout detection system
8.3 Shout Detection System
Automatic detection of shouted speech (or shout in short) in continuous speech in real-life practical scenarios is a challenging task. We aim to exploit the changes in the production characteristics of
shouted speech for automatic detection of shout regions in continuous speech. Changes in the characteristics of the vibration of the vocal folds and associated changes in the vocal tract system for shout from
those for normal speech are exploited in discriminating the two modes. Significant changes apparently
take place in the excitation source characteristics like the vibration of the vocal folds at the glottis, during production of shouted speech. But, there are very few attempts made in using these for automatic
detection of shouted speech. We have studied the excitation source characteristics of shouted and normal
speech signals along with Electroglottograph (EGG) signals, in Section 5.6 and Section 5.5, respectively.
The closed phase quotient (α) is observed to be longer for shout than for normal speech [123]. Also,
spectral band energy ratio (β) is higher for shout in comparison to normal speech (refer Section 5.6.1).
Excitation source features F0 and strength of excitation (SoE) are derived from speech signal using the
zero-frequency filtering (ZFF) method, and effect of the associated changes in the vocal tract system are
studied through feature dominant frequency (FD ), discussed in Section 5.6.3.
In this Section, we develop a decision logic for automatic detection of regions of shout in continuous
speech. An experimental shout detection system is developed to examine the efficacy of the decision
logic and the production features like β, F0 , SoE and FD . Feature β is computed using short-time
spectrum. The decision logic uses the degree of deviation in these features for the production of shout
as compared to normal speech. Multiple evidences for the decision of shout are collected for each
speech segment. Temporal nature of changes in the features and their mutual relations are also exploited.
Parameters capturing these changes are used in the decision of shout for each speech segment. Decision
for shout for each segment is taken by a linear classifier that uses these eight decision criteria. Schematic
block diagram of the experimental shout detection system is shown in Fig. 8.1. Performance of the shout
detection system is tested on four datasets of continuous speech in three languages.
159
8.3.1
Production features for shout detection
Comparison of differenced EGG signals for normal and shouted speech in Section 5.5 shows that, in
the case of shouted speech, the duration of glottal cycle period reduces, and the closed phase quotient
within each glottal cycle period increases. The reduction in the period of the glottal cycle, i.e., the rise
in the instantaneous fundamental frequency (F0 ) gives perception of higher pitch. The larger closed
phase quotient in each glottal cycle period is related to increased air pressure at the glottis, and also
to higher resonance frequencies. The study of spectral characteristics in Section 5.6 also indicates that
the spectral energy in the higher frequency band (500-5000 Hz), i.e., EHF , increases and in the lower
frequency band (0-400 Hz), i.e., ELF , reduces for shout in comparison to normal speech.
The effect of coupling between the excitation source and the vocal tract system is usually examined through spectral features like MFCCs or short-time spectra. But the effect can be seen better in
the dominant resonance frequency (FD ) of the vocal tract system, discussed in Section 5.6.3. Use of
spectral feature FD is also computationally convenient, in comparison to HNGD spectrum discussed in
Section 5.6.1. The spectral energies EHF and ELF are computed using the short-time Fourier spectrum
for computational convenience, instead of the HNGD spectrum. The excitation source feature F0 and
SoE are computed for each segment of speech signal, using the ZFF method discussed in Section 3.5.
Relative changes in the features F0 , SoE and FD for shout in comparison to normal speech can be
observed in the illustration given in Figures 5.9 and 5.10. It may be noted that the features F0 , SoE, FD
and β are derived directly from the speech signal, using computationally efficient methods. This makes
these features suitable for developing a decision logic for the shout detection system, discussed next.
8.3.2
Parameters for shout decision logic
The degree of changes in the production features F0 , SoE and FD for shout indicates the extent
of deviation from normal. Nature of changes relates to the temporal changes in features. Parameters
capturing both these aspects of changes can be exploited for the decision of shout in the SDS. Parameters
capturing both these aspects of changes are exploited in the algorithm used in the shout detection system
for decision of shout. Changes in spectral energies for shout are captured by spectral band energy
ratio (β), i.e., ratio of spectral energies in high/low frequency bands (β = EHF /ELF ) in Section 5.6.
The average values of F0 , FD and β increase in the case of shouted speech. The degree of changes
in F0 , FD and β above respective threshold values is used for shout detection. These thresholds can be
obtained either from average values computed for the reference normal speech, or dynamically for each
block of the input speech data. The temporal nature of changes in the contours of F0 , SoE and FD can
also be exploited for the detection of shout. It is interesting to note the relative fall/rise patterns in the
contours of F0 , SoE and FD . Their pair wise mutual relations can actually help in discriminating the
shouted speech from normal. For example, in some regions, SoE decreases for shout whereas the F0
increase, and vice versa. Likewise, some patterns of fall/rise in FD contour, relative to the F0 and SoE
contours, can also be observed for shout. Such changes are negligible for normal speech.
160
Total eight parameters are computed from the degree and nature of changes in F0 , SoE, FD , and
spectral energies, that are used in the decision logic. Three parameters capture the degree of changes
in these features, other two capture the temporal nature (mainly gradients) of changes and rest three
capture their mutual relations. Average values of these features are computed for each speech segment.
The gradients of F0 , SoE and FD contours are computed from changes in their average values for
successive segments. This smoothing helps in reducing the transient effects of stray fluctuations in
these features. It is also computationally convenient. Using these eight parameters, decision is then
taken for each speech segment, to decide - ‘whether this segment belongs to a shout region or not?’
8.3.3
Decision logic for automatic shout detection system
In this section we develop an experimental system for shout detection using the features β, F0 , SoE
and FD . The average values of features β, F0 , SoE and FD are computed for each speech segment. The decision logic for shout detection uses eight parameters, capturing the degree of deviation in these features and the temporal nature of changes in their contours. Multiple decision criteria
(d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 ) are used, to collect evidences of shout from these eight parameters for
each speech segment. For a speech segment, higher number of such evidences gives higher confidence
in deciding that segment as shout. The key steps are as follows:
Step 1: The decision criteria d1 , d2 and d3 are derived from the parameters using thresholds for
average values of the features F0 , FD and β, respectively. The decision criteria d4 to d8 are derived
from the parameters capturing the temporal nature of changes in the features F0 , SoE and FD . The
decision criteria d4 and d5 use the parameters for thresholds on the gradients of changes in F0 and FD
contours, respectively. The decision criteria d6 , d7 and d8 use parameters capturing the pair wise mutual
relations of temporal changes in F0 , SoE and FD contours. Since scales and units are different for these
features, only directions (signs) of changes in their gradients (rise/fall patterns) are used.
The schematic diagram in Fig. 8.2 shows different scenarios for considering pair wise mutual relations of temporal changes in the gradients of F0 , SoE and FD contours. The six segments marked as
shout candidates in Fig. 8.2(d) illustrate three possible evidences for shout. (i) Case 1 (d6 ): when
changes in the gradients of F0 and SoE contours are in opposite direction, e.g., segments 1, 2 in
Fig. 8.2(d). (ii) Case 2 (d7 ): when changes in the gradients of SoE and FD contours are in opposite direction, e.g., segments 3, 4 in Fig. 8.2(d). (iii) Case 3 (d8 ): when changes in the gradients of
both F0 and FD contours are in the same direction, e.g., segments 5, 6 in Fig. 8.2(d). Cases 1, 2 and 3
correspond to the decision criteria d6 , d7 and d8 , respectively. The eight decision criteria {di } based
upon the parameters derived from the production features, can be summarized below, where θ denotes
threshold, g the gradient and G the high gradient. The gradients (g and G) are computed between the
average values of features (∆F0 , ∆SoE and ∆FD ) for successive segments.
161
+1
(a) F0
Don’t care
−1
+1
(b) SoE
Don’t care
−1
+1
(c) F
Don’t care
D
−1
1
(d) Shout
1
Candidate
0
2
3
4
case 1
5
case 2
6
case 3
Figure 8.2 Schematic diagram for decision criteria (d5 , d6 , d7 ) using the direction of change in gradients
of (a) F0 , (b) SoE and (c) FD contours, for the decision of (d) shout candidate segments 1 & 2 (d5 ),
3 & 4 (d6 ), and 5 & 6 (d7 ).
Ave(F0 ) > θF0
⇒ d1
(8.1)
Ave(FD ) > θFD
⇒ d2
(8.2)
Ave(β) > θβ
⇒ d3
(8.3)
g∆F0
> θg∆F
0
⇒ d4 , G∆F0
(8.4)
g∆FD
> θg∆F
D
⇒ d5 , G∆FD
(8.5)
sign(G∆F0 ) = −sign(g∆SoE ) ⇒ d6
(8.6)
sign(g∆SoE ) = −sign(G∆FD ) ⇒ d7
(8.7)
sign(G∆F0 ) =
(8.8)
sign(G∆FD ) ⇒ d8
Step 2: The shout decision for each speech segment uses these eight decision criteria {di } for classifying into shout or normal speech. Confidence scores for each speech segment are computed from these
decision criteria {di }. Total confidence score above a desired level decides that as shout segment.
Step 3: Final decision for a shout region is taken for each utterance or block of speech data. Only
contiguous segments of shout candidates are considered in the final decision. It also minimizes the
spurious cases of wrong detection, since sporadic shout candidate segments are likely to be false alarms.
8.3.4
Performance evaluation
There is a large variability across speakers and languages in speech consisting of shout. Hence, it is
a challenging to use the distinguishing features of shouted speech (as identified in Section 5.6) for its
automatic detection. Since there is large variability in the values of features, parameters and thresholds
162
Table 8.1 Results of performance evaluation of shout detection: number of speech regions (a) as per
ground truth (GT), (b) detected correctly (TD), (c) (shout) missed detection (MD) and (d) wrongly
detected as shouts (WD), and rates of (e) true detection (TDR), (f) missed detection (MDR) and (g) false
alarm (FAR). Note: CS is concatenated, NCS is natural continuous and MixS is mixed speech.
Test set
#
Test set 1
Test set 2
Test set 3
Test set 4
Data
Type
CS
NCS
MixS
MixS2
(a) (b)
GT TD
44 40
92 85
184 133
591 471
(c)
MD
4
6
14
45
(d) (e)(%) (f)(%)
WD TDR MDR
0
90.9 9.1
1
92.4 6.5
37
72.3 7.6
75
79.7 7.6
(g)(%)
FAR
0
1.9
20.1
12.7
across different scenarios, and the aim is to develop a speaker/language independent SDS, a heuristics
based approach is adopted. Dynamic changes in the distinguishing features are captured. Empirical
values of the thresholds are computed dynamically from the derived parameters, thereby factoring-in
the variability across scenarios. Performance evaluation of the shout decision logic has two limitations,
absence of any labelled shout database and nonavailability of ground truth in the natural data sourced
from different media resources. Hence, the experimental SDS was tested on four sets of test data.
Test set 1: Concatenated speech (CS) data. It consists of 6 concatenated pairs of utterances of same
text in normal and shout modes, by 6 speakers. The data was drawn from 51 such pairs (by 17 speakers,
for 3 different texts) recorded in the Speech and Vision Lab, IIIT, Hyderabad in English (see Section 5.4).
Test set 2: Natural continuous speech (NCS) data. It has 6 audio files having shout content. Data
was drawn from IIIT-H AVE database of 1172 audio-visual clips sourced from movies/TV chat shows.
Test set 3: Mixed speech (MixS) data. It consists of 184 utterances (47 (27 neutral, 20 anger)+72 (41
neutral, 31 anger)+65 (30 neutral, 35 shout)= 98 neutral, 86 shout), by 24 speakers in 3 languages. Data
was drawn from 3 databases: (i) Berlin EMO-DB emotion database [19] in German, having 535 utterances for 7 emotions, (ii) IIIT-H emotion database in Telugu, having 171 utterances for 4 emotions [89],
and (iii) IIIT-H AVE database in English, having 1172 utterances for 19 affective states.
Test set 4: Mixed speech (MixS2) data. It is same as test set 3, but used by taking shout decision for
every 1 sec block, instead of utterance level. Data in test sets 3 and 4 is for 645 sec (591 sec voiced).
Assumption is made here that anger speech is usually associated with presence of shout regions.
Hence, test data includes anger regions. Ground truth was established by listening to the speech data.
Testing results of the SDS over 4 test sets are given in Table 8.1. Total numbers of ground truth
speech utterances (or 1 sec blocks), speech regions detected correctly as shout/normal speech, missed
shout detection, and normal speech regions detected wrongly as shout are given in columns (a), (b), (c)
and (d), respectively. Three performance measures are used: (i) True detection rate (TDR) (= b/a),
(ii) Missed detection rate (MDR) (= c/a) and (iii) False alarm rate (FAR) (= d/a). The TDR, MDR
and FAR for each test set are given in (%) in columns (e), (f) and (g), respectively. The testing results
obtained with granularity of shout decision blocks of 1 sec for test set 4 are better, than for utterance
163
level for test set 3. It is because, shout is an unsustainable state and the utterances in test set 3 are up to
15 sec long during which shout/normal speech often gets interspersed, reducing the decision accuracy.
The performance of shout/normal speech detection as 72.3-92.4% with false alarm rate of 1.9-20.1%,
by using the proposed SDS is better than those reported in [132] as 64.6-92% and 22.6-35.4%, respectively. This performance is also better than the test results of Gaussian mixture model (GMM) based
classifier used in [219] that reported shout detection performance (TDR) of 67.5%. The results are also
comparable with test results of multiple model framework approach using Hidden Markov model with
support vector machine or GMM classifier, proposed in [217], that reported success rate as 63.8-83.3%,
with 5.6-11% miss rate (MDR) and 11.1-25.3% error rate (FAR). Actually, the MDR as 6.5-9.1% and
FAR of 1.9-20.1% achieved by using the proposed algorithm are comparatively better.
8.4 Laughter Detection System
Excitation sources features F0 and SoE are extracted for each voiced segment, using the modified
zero-frequency filtering method. Parameters are then derived from Pitch period (T0 ), Strength of excitation (SoE), ratio of strength of excitation and pitch period (R), slope of pitch period contour (Slopepp ),
and slope of strength of excitation contour (SlopeSoE ) [185].
The proposed method for laughter spotting consists of the following steps. (a) The signal is first
segmented into voiced and nonvoiced regions. (b) Then five features are extracted for every epoch in the
voiced region. (c) If a voiced segment has more epochs than determined by the ‘fraction threshold’ for
at least 60% of the features, then that segment is considered as a laughter segment. Regions of durations
less than 30 ms have not been considered for detection, so as to minimize spurious detections.
Performance of the LDS is evaluated on a limited dataset taken from a database IIIT-H audio-visual
emotion (AVE) database collected in the Speech and Vision Lab, IIIT, Hyderabad. Some audio files
drawn from movies and TV soap-operas are used. Over a dataset of 180 laugh calls, the detection
rate of 75% is achieved by the prototype, with the false detection rate of 19.44%. The performance is
comparable with state of the art, and can be improved by using other features and parameters discussed
in Section 6.5.
8.5 Summary
In this chapter, feasibility is examined for automatic detection of trills, shouted speech and laughter in continuous speech. Three prototype systems are developed. Using the source features of apical
trills, studied in Section 4.2, a trill detection system is developed. Limited testing of the ATDS carried
out on a trill-database, gave encouraging results. It indicates feasibility of developing a complete automatic system for spotting trills in continuous speech, using the production characteristics of apical trills.
Testing of the ATDS has limitations, since only a few languages are rich in usage of apical trills. The
experimental system is tested only for apical trills and not for other types of trills.
164
Automatic shout detection in real-life practical scenarios is challenging, yet important for a range of
applications. Changes in the characteristics of the vibration of the vocal folds and associated changes
in the vocal tract system for shout from those for normal speech are exploited in discriminating the
two modes. Parameters capturing the changes in production features β, F0 , SoE and FD are used
in developing the SDS. The decision criteria use parameters capturing the extent of deviation and the
temporal nature of changes in these. Decision for shout is taken for each segment. Performance of the
SDS is evaluated on four test sets, drawn from three different databases. Results are comparable to other
reported results and are even better. Further, an online SDS can be developed for live real-life data.
Feasibility of automatic detection of laughter (nonspeech-laugh or laughed-speech) in continuous
speech is explored in this chapter, by developing an experimental LDS. The initial performance evaluated
on a limited dataset is encouraging. The performance can be improved by using other features and
parameters discussed in Section 6.5. Using these features and parameters, an online complete system
can be developed further for automatic detection of laughter in continuous speech in different scenarios.
165
Chapter 9
Summary and Conclusions
9.1 Summary of the work
Nonverbal speech sounds are analysed in this research work, by examining the differences in their
production characteristics from normal speech. Four categories of sounds are considered, namely, normal speech, emotional speech, paralinguistic sounds and expressive voices. These sound categories
differ in the increasing order of rapidness of pitch-changes and the degree of aperiodicity content. Voluntary control of the excitation source and the vocal tract system during production of these sounds,
or involuntary changes in their production characteristics are other differences. The effects of sourcesystem coupling and acoustic loading on the glottal vibration are studied first, for variations in normal
speech sounds such as trills, fricatives and nasals. The source-system coupling effect, also present
in emotional speech sounds, is studied next. Shouted speech is examined in this category of sounds.
Laughter sounds and Noh voice are examined in paralinguistic sounds and expressive voices categories,
respectively.
The production characteristics, that of the glottal excitation source in particular, are examined from
both acoustic and EGG signals in each case. The excitation source features such as F0 and SoE are
derived mainly using the zero-frequency filtering method and modifications in it proposed in this thesis
for nonverbal speech sounds. Changes in the source characteristics are also examined in terms of glottal
pulse shape characteristics such as open/closed phase durations and closed phase quotient derived from
the EGG signal. Associated changes in the resonance characteristics of the vocal tract system are examined using the dominant frequencies FD1 and FD2 , derived using LP analysis and group delay function.
Changes in the spectral characteristics using Hilbert envelope of the numerator of group delay (HNGD)
spectrum, derived using the zero-time liftering (ZTL) method. Other production characteristics derived
using Hilbert envelope of LP residual of the acoustic signal are also used in the analysis in few cases.
Apart from using some standard signal processing techniques such as short-time spectrum and recently proposed methods such as ZFF and ZTL, a few new methods are proposed in this thesis. A
modified ZFF method, a method to compute first two dominant resonance frequencies, an alternative
method for computing saliency of pitch-perception, and a method for extracting F0 in the regions of
166
aperiodicity are proposed. Using these, the features are extracted and parameters derived that reflect
changes in the production characteristics of nonverbal speech sounds from those for normal speech.
Efficacy of these features and parameters in discriminating the nonverbal sounds from normal speech is
evaluated through three prototype systems developed, for automatic detection of acoustic events such as
trills, shouts and laughter in continuous speech.
The representation of the excitation source in terms of a time domain impulse sequence, which has
been sought in speech coding methods, is a challenge for nonverbal speech sounds. But using this
representation is immensely useful for deriving the production features and later manipulating these for
speech synthesis. One such method, the zero-frequency filtering (ZFF) method, is suitable mainly for
modal voicing. Using this, we have examined the effects of acoustic loading of the vocal tract system
on the vibration characteristics of the vocal folds and system-source interaction, in the production of a
selected set of six sound categories sounds. Modifications in the ZFF method are used for extracting
this impulse sequence representation of the source characteristics, for the nonverbal sounds examined.
To study the effect of coupling between the system and the source characteristics in the case of some
emotional speech sounds, it is necessary to extract the spectral characteristics of speech production
mechanism with high temporal resolution, which is still a challenging task. Signal processing methods
like HNGD (ZTL) that can represent the fine temporal variations of the spectral features, are explored
in this study. Production characteristics of speech in four loudness levels, i.e., soft, normal, loud and
shout are examined. It is shown that these temporal variations indeed capture the features of glottal
excitation that can discriminate shout vs normal speech. The effect of coupling between the excitation
source and the vocal tract system during production of shouted speech is examined in different vowel
contexts, using dominant frequency computation, along with source features such as F0 and SoE.
The production characteristics of paralinguistic sounds like laughter are studied from changes in the
vibration characteristics of the glottal excitation source, using the modified ZFF method. Three cases
namely normal speech, laughed-speech and nonspeech-laugh are considered. Associated changes in
the vocal tract system characteristics are examined using first two dominant frequencies FD1 and FD2 .
Other production characteristics of laughter are also examined using features derived from the Hilbert
envelope of LP residual of speech signal. Parameters are derived, that represent the changes in these
features and help in distinguishing the three cases.
The proposed modZFF method is used for deriving an impulse sequence representation of the excitation component in expressive voice signal. A newly proposed method is used to compute saliency,
that captures the pitch-perception information related to aperiodicity in expressive voices. The role
of amplitude/frequency modulation in aperiodicity and in harmonic/subharmonic structure in the expressive voices such as Noh voice, is examined by using two synthetic AM/FM pulse trains. Examining the saliency plots for these AM/FM sequences and for different segments of Noh voice signal, it
is confirmed that the impulse/epoch sequence representation does capture the aperiodicity and salient
pitch-perception information quite well. This sequence is considered adequate to represent the excitation source component in the speech signal, and is helpful in some cases to derive the approximate
167
vocal tract system characteristics as well. In the impulse sequence, the information of aperiodicity is
more related to the time intervals among impulses, than to their relative amplitudes. The embedded
harmonic/subharmonic structure is examined using the spectrograms. Validation of the results is carried
out also using saliency plots and an analysis-synthesis approach. Extraction of F0 that captures the
pitch-perception information in expressive voices, is also demonstrated.
The analyses of the production characteristics of nonverbal speech sounds has helped in identifying
their few unique characteristics. Using these, three experimental systems are developed for automatic
detection of trills, shouts and laughter in continuous speech. The automatic shout detection system
(SDS) is developed to an extent which is much closer to the complete online system. Performance evaluation of these systems, using specifically collected databases labelled with ground truth, gave encouraging results. The results indicate the feasibility of developing these prototype systems into complete
systems for automatic detection of such acoustic events in continuous speech, in real-life scenarios.
These experimental system confirm that analyses of production features are indeed insightful in the
case of nonverbal speech sounds. Also, the excitation impulse sequence representation of the source
characteristics which is guided by the pitch perception, is further helpful not only in analyses of these
sounds, but also in diverse purposes that possibly includes synthesis of natural-sounding speech.
In this study, only a few representative sounds in each of the four categories are examined. The
study is expected to be helpful in providing further insight into production of these sounds and also in
developing systems for real-life applications, that need to be tested on large databases.
9.2 Major contributions of this work
The key contributions of this research work can be listed as follows:
(i) The role of system-source coupling, and the effect of acoustic loading of vocal tract system on the
glottal vibration are studied for a few dynamic voiced sounds such as trills, laterals, fricatives and
nasals. These sounds are examined in vowel context [a] on both sides, in modal voicing.
(ii) Four categories of sounds, namely, normal speech, emotional speech, paralinguistic sounds and expressive voices are analysed, both from speech production and perception points of view. Shouted
speech, laughter and Noh voice in particular, are studied, by examining changes in their source
and system characteristics. Features are extracted to distinguish these sounds from normal speech.
(iii) Impulse-sequence representation of the excitation information in the acoustic signal is proposed
for each category of sounds. The representation of excitation information by an impulse-sequence,
that is guided by pitch perception, is proposed for Noh voice and laughter sounds.
(iv) A few new signal processing methods are proposed, such as modified zero-frequency filtering
method, method to compute saliency (i.e., a measure of pitch perception), dominant frequency
computation method and a method for F0 extraction in the regions of aperiodicity.
(v) Three prototype systems are developed for demonstrating the efficacy of features extracted and
parameters derived, in distinguishing these nonverbal sounds from normal speech. Performance
168
evaluation results of these prototype systems indicate the feasibility of further developing complete
systems for automatic detection of shouts, trill and laughter in continuous speech.
9.3 Research issues raised and directions for future work
The impulse-sequence representation of the excitation source characteristics was explored in speech
coding methods, mainly for normal speech. ‘Can the impulse-sequence representation of the excitation
source information be obtained from acoustic signal for nonverbal speech also’, is explored in this
work. The key challenge for these sounds is the ‘rapid changes in the F0 and pitch-perception’. Further,
presence of the regions of aperiodicity and subharmonics poses another set of challenges in extracting
F0 in these regions. Hence, it is important to explore, ‘can this impulse-sequence representation of the
excitation source characteristics be guided by pitch-perception information for nonverbal sounds’?
The signal-processing methods like ZFF, that work well for the modal voicing in normal speech, exhibit limitations in the case of nonverbal speech sounds. The proposed modified zero-frequency filtering
(modZFF) method helps in deriving the impulse-sequence representation of the excitation, for laughter sounds and Noh voice that have rapid changes in F0 and pitch. A method is proposed to compute
saliency, i.e., a measure for pitch-perception information, in the regions of subharmonics and aperiodicity. Using saliency and the impulse-sequence representation of excitation, the F0 is extracted for
these nonverbal sounds, which otherwise is a difficult task. ‘Can the impulse-sequence representation
of the excitation, be obtained only from the pitch perception information’ would be a further interesting
problem and a research challenge.
Analyses of nonverbal speech sounds from the production and perception points of view, has been
a challenging research issue. Some studies have investigated the vocal tract system characteristics, but
the excitation source characteristics are not explored much. In this study, these sounds are analysed by
deriving these characteristics from both EGG and acoustic signals. However, ‘whether the glottal pulse
shape characteristics that can be derived easily from EGG signal, can also be derived from the acoustic
signal reliably’ is still a challenge, in spite of several studies like inverse-filtering, DYPSA or ZFF.
Distinguishing features are extracted and parameters are derived for these sounds, that may help in
automatic detection of acoustic events like trills, shouts and laughter in continuous speech. Systems for
their automatic detection in natural speech may be developed further for diverse applications in real-life
scenarios. But, developing these systems in natural environment would be different from investigating
in lab environment, and that may possibly unearth new set of challenges as well.
Deriving the ‘impulse-sequence representation of the excitation information from the saliency of
pitch-perception’, may be attempted in future. Deriving this information would be more interesting for
nonverbal sounds. Also, deriving the glottal pulse-shape characteristics reliably from the acoustic signal
(rather than EGG signal), would provide further insight into the details of changes in production characteristics of these sounds. Using the production features, the systems can be developed for detection
of more acoustic events like laughter or expressive voices in natural conversational speech.
169
Bibliography
[1] T. Abe, T. Kobayashi, and S. Imai. The IF spectrogram: A new spectral representation. In Proc. International Symposium on Simulation, Visualization and Auralization for Acoustics Research and Education,
ASVA’97, pages 423–430, April 1997.
[2] P. Alku. Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2-3):109–118, June 1992.
[3] P. Alku, T. Backstrom, and E. Vilkman. Normalized amplitude quotient for parametrization of the glottal
flow. 112(2):701–710, Feb. 2002.
[4] P. Alku and E. Vilkman. Amplitude domain quotient for characterization of the glottal volume velocity
waveform estimated by inverse filtering. Speech Communication, 18(2):131–138, 1996.
[5] B. Atal and M. R. Schroeder. Predictive coding of speech signals and subjective error criteria. IEEE
Transactions on Acoustics, Speech and Signal Processing, 27(3):247–254, 1979.
[6] B. S. Atal and B. E. Caspers. Periodic repetition of multi-pulse excitation. The Journal of the Acoustical
Society of America, 74(S1):S51–S51, 1983.
[7] B. S. Atal and S. L. Hanauer. Speech Analysis and Synthesis by Linear Prediction of the Speech Wave.
Journal of the Acoustical Society of America, 50(2B):637–655, 1971.
[8] B. S. Atal and J. R. Remde. A new model of LPC excitation for producing natural-sounding speech at low
bit rates. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 1, pages 614–617, May
1982.
[9] A.V. Oppenheim, R.W. Schafer. Digital Signal Processing, chapter 3, pages 87–121. PHI Learning Private
Limited, New Delhi, India, 2 edition, 1975.
[10] J. A. Bachorowski and M. J. Owren. Not all laughs are alike: voiced but not unvoiced laughter readily
elicits positive affect. Psychology Science, 12(3):252–257, May 2001.
[11] J. A. Bachorowski, M. J. Smoski, and M. J. Owren. The acoustic features of human laughter. 110(3):1581–
1597, 2001.
[12] A. Barney, C. H. Shadle, and P. O. A. L. Davies. Fluid flow in a dynamic mechanical model of the
vocal folds and tract. I. Measurements and theory. The Journal of the Acoustical Society of America,
105(1):444–455, 1999.
170
[13] A. Barney, A. D. Stefano, and N. Henrich. The effect of glottal opening on the acoustic response of the
vocal tract. Acta Acustica united with Acustica, 93(6):1046–1056, 2007.
[14] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir. Whodunnit – Searching for the Most Important Feature Types Signalling
Emotion-Related User States in Speech. Computer Speech and Language, Special Issue on Affective
Speech in real-life interactions, 25(1):4–28, 2011.
[15] M. Berouti, H. Garten, P. Kabal, and P. Mermelstein. Efficient computation and encoding of the multipulse
excitation for LPC. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing,
ICASSP ’84, volume 9, pages 384–387, 1984.
[16] C. A. Bickley and S. Hunnicutt. Acoustic analysis of laughter. In Proc. Second International Conference
on Spoken Language Processing, 1992 (ICSLP’92), pages 927–930. ISCA, Oct 13-16 1992.
[17] D. Bitouk, R. Verma, and A. Nenkova. Class-level spectral features for emotion recognition. Speech
Communication, 52(7-8):613–625, 2010.
[18] B. Boashash. Estimating and interpreting the instantaneous frequency of a signal. I. Fundamentals. Proceedings of the IEEE, 80(4):520–538, April 1992.
[19] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss. A database of german emotional
speech. In in Proceedings of Interspeech, Lissabon, pages 1517–1520, 2005.
[20] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. S. Narayanan.
Iemocap: Interactive emotional dyadic motion capture database. Journal of Language Resources and
Evaluation, 42(4):335–359, Dec. 2008.
[21] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and
S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In in Sixth International Conference on Multimodal Interfaces ICMI 2004, pages 205–211. ACM
Press, 2004.
[22] C. Busso, S. Lee, and S. Narayanan. Analysis of emotionally salient aspects of fundamental frequency for
emotion detection. IEEE Transactions on Audio, Speech and Language Processing, 17(4):582–596, 2009.
[23] R. Cai, L. Lu, H.-J. Zhang, and L.-H. Cai. Highlight sound effects detection in audio stream. In Proc.
IEEE International Conference on Multimedia and Expo, 2003 (ICME’03), volume 3, pages 37–40, July
2003.
[24] N. Campbell, H. Kashioka, and R. Ohara. No laughing matter. In Proc. 9th European Conference on
Speech Communication and Technology, 2005 (INTERSPEECH’05), pages 465–468, Sep. 4-8 2005.
[25] J. R. Carson and T. C. Fry. Variable Frequency Electric Circuit Theory with Application to the Theory of
Frequency Modulation. Bell System Technical Journal, 16:513–540, Oct. 1937.
[26] B. Caspers and B. Atal. Role of multi-pulse excitation in synthesis of natural-sounding voiced speech.
In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’87, volume 12, pages 2388–2391, 1987.
171
[27] B. E. Caspers and B. S. Atal. Changing pitch and duration in LPC synthesized speech using multipulse
excitation. The Journal of the Acoustical Society of America, 73(S1):S5–S5, 1983.
[28] J. C. Catford. Fundamental problems in phonetics, pages 1–278. Indiana University Press, Bloomington,
USA, 1977.
[29] J. C. Catford. A Practical Introduction to Phonetics, chapter four, pages 59–69. Oxford University Press
Inc., New York, USA, second edition, 2001.
[30] R. W. Chan and I. R. Titze. Dependence of phonation threshold pressure on vocal tract acoustics and vocal
fold tissue mechanics. The Journal of the Acoustical Society of America, 119(4):2351–2362, 2006.
[31] T. Chen and R. R. Rao. Audio-visual integration in multimodal communication. In Proc. IEEE, pages
837–852, 1998.
[32] X. Chi and M. Sonderegger. Subglottal coupling and its influence on vowel formants. The Journal of the
Acoustical Society of America, 122(3):1735–1745, 2007.
[33] Z. Ciota. Emotion recognition on the basis of human speech. In IEEE ICECom2005, pages 1–4. IEEE,
2005.
[34] L. Colantoni. Increasing periodicity to reduce similarity: An acoustic account of deassibilation in rhotics.
In M. Diaz-Campos, editor, Selected Proceedings of the 2nd Conference on Laboratory Approaches to
Spanish Phonetics and Phonology, pages 22–34. Cascadilla Proceedings Project, Somerville, MA, 2006.
[35] R. Cowie and R. R. Cornelius. Describing the emotional states that are expressed in speech. Speech
Communication, 40(1-2):5–32, Apr. 2003.
[36] D. Crystal. Prosodic Systems and Intonation in English, chapter 2, pages 62–79. Cambridge Studies in
Linguistics. Cambridge University Press, Cambridge, UK, 1976.
[37] L. Deng and D. O’Shaughnessy. Speech Processing: A Dynamic and Optimization-oriented Approach,
chapter seven, pages 213–226. Signal Processing and Communications Series. Marcel Dekker Incorporated, New York, USA, first edition, 2003.
[38] N. Dhananjaya. Signal processing for excitation-based analysis of acoustic events in speech. PhD thesis,
Dept. of Computer Science and Engineering, IIT Madras, Chennai, Oct. 2011. (last viewed Sep. 23, 2013).
[39] N. Dhananjaya and B. Yegnanarayana. Voiced/nonvoiced detection based on robustness of voiced epochs.
IEEE Signal Processing Letters, 17(3):273–276, Mar. 2010.
[40] N. Dhananjaya, B. Yegnanarayana, and P. Bhaskararao. Acoustic analysis of trill sounds. The Journal of
the Acoustical Society of America, 131(4):3141–3152, 2012.
[41] M. Diaz-Campos. Variable production of the trill in spontaneous speech: Sociolinguistic implications. In
L. Colantoni and J. Steele, editors, Selected Proceedings of the 3rd Conference on Laboratory Approaches
to Spanish Phonology, pages 115–127. Cascadilla Proceedings Project, Somerville, MA, 2008.
[42] W. G. Ewan. Can the Intrinsic F0 Differences between Vowels Be Explained by Source/Tract Coupling?
Status Report on Speech Research, Haskins Laboratories, SR-51/52:197–199, 1977.
172
[43] W. G. Ewan and J. J. Ohala. Can intrinsic vowel F0 be explained by source/tract coupling? The Journal
of the Acoustical Society of America, 66(2):358–362, 1979.
[44] F. Eyben, M. W¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. On-line Emotion Recognition in a 3-D Activation-Valence-Time Continuum using Acoustic and Linguistic Cues. Journal on
Multimodal User Interfaces, Special Issue on Real-Time Affect Analysis and Interpretation: Closing the
Affective Loop in Virtual Agents and Robots, 3(1–2):7–12, March 2010.
[45] C. G. Fant. Descriptive analysis of the acoustic aspects of speech. LOGOS, 5(1):3–17, 1962.
[46] G. Fant. Acoustic Theory of Speech Production, chapter 1.1, pages 15–24. second printing. Mounton Co.
N. N. Publishers, The Hague, Netherlands, first edition, 1970.
[47] G. Fant. Glottal source and excitation analysis. Speech Transmission Laboratory, KTH, Sweden, Quarterly
Progress and Status Report, 20(1):85–107, 1979.
[48] G. Fant. SPEECH ACOUSTICS AND PHONETICS Selected Writings, chapter 4.1, pages 143–161. Text,
Speech and Language Technology, Volume 24. Kluwer Academic Publishers, Dordrecht, The Netherlands,
first edition, 2004.
[49] G. Fant and Q. Lin. Glottal source - vocal tract acoustic interaction. Speech Transmission Laboratory,
KTH, Sweden, Quarterly Progress and Status Report, 28(1):13–27, 1987.
[50] G. Fant, Q. Lin, and C. Gobl. Notes on glottal flow interaction. Speech Transmission Laboratory, KTH,
Sweden, Quarterly Progress and Status Report, 26(2-3):21–45, 1985.
[51] M. Filippelli, R. Pellegrino, I. Iandelli, G. Misuri, J. R. Rodarte, R. Duranti, V. Brusasco, and G. Scano.
Respiratory dynamics during laughter. Journal of Applied Physiology, 90(4):1441–1446, Apr. 2001.
[52] J. L. Flanagan. Speech Analysis Synthesis and Perception. Springer-Verlag, 2nd edition, 1972.
[53] A. Fourcin and E. Abberton. First application of a new laryngograph. Medical and Biological Illustration,
21(3):172–182, Jul 1971.
[54] M. Fratti, G. A. Mian, and G. Riccardi. An Approach to Parameter Reoptimization in Multipulse-Based
Coders. IEEE Transactions on Speech and Audio Processing, 1(4):463–465, Oct. 1993.
[55] O. Fujimura, K. Honda, H. Kawahara, Y. Konparu, M. Morise, and J. C. Williams. Noh voice quality.
Logopedics Phoniatrics Vocology, 34(4):157–170, 2009.
[56] D. Gabor. Theory of communication. Part 1: The analysis of information. Journal of the Institution of
Electrical Engineers - Part III: Radio and Communication Engineering, 93(26):429–441, 1946.
[57] P. K. Ghosh and S. S. Narayanan. Joint source-filter optimization for robust glottal source estimation in
the presence of shimmer and jitter. Speech Communication, 53(1):98–109, 2011.
[58] C. Gobl. Voice source dynamics in connected speech. STL-QPSR, KTH, Sweden, 29(1):123–159, 1988.
[59] C. Gobl. A preliminary study of acoustic voice quality correlates. STL-QPSR, KTH, Sweden, 30(4):9–22,
1989.
[60] M. Gordon and P. Ladefoged. Phonation types: a cross-linguistic overview. Journal of Phonetics, pages
383–406, 2001.
173
[61] W. Granzow, B. Atal, K. Paliwal, and J. Schroeter. Speech coding at 4 kb/s and lower using single-pulse
and stochastic models of LPC excitation. In Proc. International Conference on Acoustics, Speech, and
Signal Processing, 1991. ICASSP-91, 1991, volume 1, pages 217–220, 1991.
[62] G. S. Hall and A. Allin. The psychology of tickling, laughing, and the comic. The American Journal of
Psychology, 9(1):1–41, 1897.
[63] W. Hamza, R. Bakis, E. M. Eide, M. A. Picheny, and J. F. Pitrelli. The IBM expressive speech synthesis
system. In Proc. of the 8th International Conference on Spoken Language Processing, Jeju, Korea, pages
14–16, 2004.
[64] H. Hatzikirou, W. T. Fitch, and H. Herzel. Voice instabilities due to source-tract interactions. Acta Acustica
united with Acustica, 92(3):468–475, 2006.
[65] N. Henrich, C. d’Alessandro, B. Doval, and M. Castellengo. Glottal open quotient in singing: Measurements and correlation with laryngeal mechanisms, vocal intensity, and fundamental frequency. The
Journal of the Acoustical Society of America, 117(3):1417–1430, 2005.
[66] N. C. Henriksen and E. W. Willis. Acoustic characterization of phonemic trill production in Jerezano
Andalusian Spanish. In M. Ortega-Llebaria, editor, Selected Proceedings of the 4th Conference on Laboratory Approaches to Spanish Phonology, pages 115–127. Cascadilla Proceedings Project, Somerville,
MA, 2010.
[67] J. Holmes. Formant excitation before and after glottal closure. In Proc. IEEE International Conference on
Acoustics, Speech, and Signal Processing, ICASSP’76., volume 1, pages 39–42, 1976.
[68] P. Hong, Z. Wen, and T. S. Huang. Real-time speech-driven face animation with expressions using neural
networks. IEEE Trans. Neural Networks, 13:916–927, 2002.
[69] M. S. Howe and R. S. McGowan. On the role of glottis-interior sources in the production of voiced sound.
The Journal of the Acoustical Society of America, 131(2):1391–1400, 2012.
[70] W. Huang, T. K. Chiew, H. Li, T. S. Kok, and J. Biswas. Scream detection for home applications. In The
5th IEEE Conference on Industrial Electronics and Applications (ICIEA), 2010, pages 2115–2120, Jun.
2010.
[71] A. I. Iliev, M. S. Scordilis, J. P. Papa, and A. X. Falco. Spoken emotion recognition through optimum-path
forest classification using glottal features. Computer Speech and Language, 24(3):445–460, 2010.
[72] T. Irino and R. D. Patterson. Segregating information about the size and shape of the vocal tract using
a time-domain auditory model: The stabilised wavelet-mellin transform. Speech Communication, pages
181–203, 2002.
[73] C. T. Ishi, H. Ishiguro, and N. Hagita. Automatic extraction of paralinguistic information using prosodic
features related to F0, duration and voice quality. Speech Communication, 50(6):531–543, 2008.
[74] N. S. Jayant. Digital coding of speech waveforms: PCM, DPCM and DM Quantization. In Proc. IEEE,
volume 62, pages 611–632, May 1974.
174
[75] M. A. Joseph, S. Guruprasad, and B. Yegnanarayana. Extracting formants from short segments of speech
using group delay functions. pages 1009–1012, Pittsburgh PA, USA, Sep. 2006.
[76] N. Kamaruddin and A. Wahab. Speech emotion verification system (SEVS) based on MFCC for real time
applications. In Proc. 4th International Conference on Intelligent Environments (IE 08), pages 1–7. IEEE,
2008.
[77] H. Kawahara, H. Katayose, A. de Cheveigne, and R. D. Patterson. Fixed point analysis of frequency to
instantaneous frequency mapping for accurate estimation of F0 and periodicity. In Proc. Eurospeech’99,
volume 6, pages 2781–2784, 1999.
[78] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign. Restructuring speech representations using a pitchadaptive time-frequency smoothing and an instantaneous-frequency-based {F0} extraction: Possible role
of a repetitive structure in sounds. Speech Communication, 27(34):187 – 207, 1999.
[79] H. Kawahara and M. Morise. Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana, 36(5):713–727, 2011.
[80] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, H. Banno, and T. Irino. A unified approach for F0
extraction and aperiodicity estimation based on a temporally stable power spectral representation. In ISCA
ITRW, Speech, Analysis and Processing for Knowledge Discovery, June 4-6 2008.
[81] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno. TANDEM-STRAIGHT: A
temporally stable power spectral representation for periodic signals and applications to interference-free
spectrum, F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Speech and
Signal Processing, 2008 (ICASSP 2008), pages 3933 –3936, April 4 2008.
[82] M. Kaynak, Q. Zhi, A. Cheok, K. Sengupta, and K. C. Chung. Audio-visual modeling for bimodal speech
recognition. In Systems, Man, and Cybernetics, 2001 IEEE International Conference on, volume 1, pages
181–186, 2001.
[83] L. S. Kennedy and D. P. W. Ellis. Laughter detection in meetings. In Proc. NIST ICASSP 2004 Meeting
Recognition Workshop, pages 118–121, Montreal, Mar. 2004.
[84] S. Z. K. Khine, T. L. Nwe, and H. Li. Speech/laughter classification in meeting audio. In Proc. 9th Annual
Conference of the International Speech Communication Association, 2008 (INTERSPEECH’08), pages
793–796, Sep. 22-26 2008.
[85] S. Kipper and D. Todt. The role of rhythm and pitch in the evaluation of human laughter. Journal of
Nonverbal Behavior, 27(4):255–272, 2003.
[86] M. T. Knox and N. Mirghafori. Automatic laughter detection using neural networks. In Proc. 8th Annual
Conference of the International Speech Communication Association, 2007 (INTERSPEECH’07), pages
2973–2976, Aug. 27-31 2007.
[87] K. J. Kohler. ‘Speech-smile’, ‘Speech-laugh’, ‘Laughter’ and their sequencing in dialogic interaction.
Phonetica, 65:1–18, 2008.
175
[88] S. G. Koolagudi, A. Barthwal, S. Devliyal, and K. S. Rao. Real Life Emotion Classification from Speech
Using Gaussian Mixture Models. In IC3, pages 250–261, 2012.
[89] S. G. Koolagudi, S. Maity, A. K. Vuppala, S. Chakrabarti, and K. S. Rao. IITKGP-SESC: Speech database
for emotion analysis. In S. Ranka, S. Aluru, R. Buyya, Y.-C. Chung, S. Dua, A. Grama, S. K. S. Gupta,
R. Kumar, and V. V. Phoha, editors, IC3, volume 40 of Communications in Computer and Information
Science, pages 485–492. Springer, 2009.
[90] S. G. Koolagudi and K. S. Rao. Emotion recognition from speech using source, system, and prosodic
features. Int. J. Speech Technol., 15(2):265–289, June 2012.
[91] A. Kounoudes, P. A. Naylor, and M. Brookes. The DYPSA algorithm for estimation of glottal closure
instants in voiced speech. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP),
2002, volume 1, pages 349–352, May 2002.
[92] J. Kreiman and D. V. L. Sidtis. Foundations of Voice Studies. Wiley-Blackwell, Malden, 2011.
[93] P. Kroon and B. Atal. Quantization procedures for the excitation in CELP coders. In IEEE International
Conference on Acoustics, Speech, and Signal Processing, ICASSP ’87, volume 12, pages 1649–1652,
1987.
[94] P. Kroon, E. Deprettere, and R. Sluyter. Regular-pulse excitation–a novel approach to effective and efficient
multipulse coding of speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5):1054–
1063, 1986.
[95] H. Kuwabara and Y. Sagisak. Acoustic characteristics of speaker individuality: Control and conversion.
Speech Communication, 16(2):165–173, 1995.
[96] P. Ladefoged. Vowels And Consonants: An Introduction To The Sounds Of Languages, chapter 13, pages
149–150. Blackwell Pub., 2003.
[97] P. Ladefoged. Vowels And Consonants: An Introduction To The Sounds Of Languages, chapter 2, pages
18–24. Blackwell Pub., 2003.
[98] P. Ladefoged, A. Cochran, and S. F. Disner. Laterals and trills. 7:46–54, 1977.
[99] P. Ladefoged and K. Johnson. A course in Phonetics, chapter One, pages 4–7. Cengage Learning India
Private Limited, Delhi, India, sixth edition, 2011.
[100] P. Ladefoged and I. Maddieson. Sounds of World’s Languages, chapter 7, pages 217–236. Blackwell
publishing, Oxford, UK, 1996.
[101] E. Lasarcyk and J. Trouvain. Imitating conversational laughter with an articulatory speech synthesis. In
Proc. of the Interdisciplinary Workshop on The Phonetics of Laughter, pages 43–48, Aug. 4-5 2007.
[102] J. Laver. Principles of Phonetics, chapter five, pages 119–158. Cambridge Textbooks in Linguistics.
Cambridge University Press, 1994.
[103] T. Li and M. Ogihara. Content-based music similarity search and emotion detection. In Acoustics, Speech,
and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, volume 5,
pages 705–708, May 2004.
176
[104] W.-H. Liao and Y.-K. Lin. Classification of non-speech human sounds: Feature selection and snoring
sound analysis. In IEEE International Conference on Systems, Man and Cybernetics, 2009 (SMC 2009),
pages 2695–2700, Oct. 2009.
[105] M. Lindau. The story of /r/. In V. Fromkin, editor, Phonetic Linguistics: Essays in Honor of P. Ladefoged,
pages 157–167. Academic Press, Orlando, USA, 1985.
[106] J. Lipski. Latin American Spanish, pages 1–440. Longman Linguistics Library, New York, USA, 1994.
[107] A. Lockerd and F. Mueller. LAFCam: Leveraging affective feedback camcorder. In L. G. Terveen and
D. R. Wixon, editors, Extended abstracts of the 2002 Conference on Human Factors in Computing Systems,
CHI 2002, Minneapolis, Minnesota, USA, April 20-25, 2002, pages 574–575, New York, USA, 2002.
ACM.
[108] J. C. Lucero, K. G. Lourenc¸o, N. Hermant, A. V. Hirtum, and X. Pelorson. Effect of source–tract acoustical
coupling on the oscillation onset of the vocal folds. The Journal of the Acoustical Society of America,
132(1):403–411, 2012.
[109] E. S. Luschei, L. O. Ramig, E. M. Finnegan, K. K. Baker, and M. E. Smith. Patterns of laryngeal electromyography and the activity of the respiratory system during spontaneous laughter. Journal of Neurophysiology, 96(1):442–450, Jul. 2006.
[110] I. Maddieson. Patterns of sounds, pages 1–422. Cambridge University Press, Cambridge, UK, 1984.
[111] M. M. Makagon, E. S. Funayama, and M. J. Owren. An acoustic analysis of laughter produced by congenitally deaf and normally hearing college students. The Journal of the Acoustical Society of America,
124(1):472–483, 2008.
[112] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, Apr. 1975.
[113] J. D. Markel and A. H. Gray. A Linear Prediction Vocoder Simulation Based upon the Autocorrelation
Method. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-22(2):124–134, April
1974.
[114] J. E. Markel and A. H. Gray. Linear Prediction of Speech. Springer-Verlag New York, Inc., Secaucus, NJ,
USA, 1982.
[115] H. Masubuchi and H. Kobayashi. An acoustic abnormal detection system. In Proceedings., 2nd IEEE
International Workshop on Robot and Human Communication, 1993, pages 237–242, Nov 1993.
[116] R. S. McGowan. Tongue-tip trills and vocal-tract wall compliance. The Journal of the Acoustical Society
of America, 91(5):2903–2910, 1992.
[117] H. Mcgurck and J. W. Macdonald. Hearing lips and seeing voices. Nature, 264(246-248), 1976.
[118] C. Menezes and Y. Igarashi. The speech laugh spectrum. In Proc. 6th International Seminar on Speech
Production, 2006 (ISSP’06), pages 157–524, Dec. 13-15 2006.
[119] A. Metallinou, S. Lee, and S. Narayanan. Audio-visual emotion recognition using gaussian mixture models
for face and voice. In Proceedings of the IEEE International Symposium on Multimedia, page 250257,
Berkeley, CA, Dec. 2008.
177
[120] A. Metallinou, M. W¨ollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan. Context-Sensitive
Learning for Enhanced Audiovisual Emotion Classification. IEEE Transactions on Affective Computing,
3(2):184–198, April – June 2012.
[121] D. G. Miller. EGGs for Singers, 2012. (last viewed Apr. 1, 2013).
[122] V. K. Mittal, N. Dhananjaya, and B. Yegnanarayana. Effect of tongue tip trilling on the glottal excitation
source. In Proc. INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, Sep. 2012.
[123] V. K. Mittal and B. Yegnanarayana. Effect of glottal dynamics in the production of shouted speech. The
Journal of the Acoustical Society of America, 133(5):3050–3061, May 2013.
[124] V. K. Mittal and B. Yegnanarayana. Production features for detection of shouted speech. In Proc. 10th
Annual IEEE Consumer Communications and Networking Conference, 2013 (CCNC’13), pages 106–111,
Jan. 11-14, 2013.
[125] P. Moore and H. Von Leden. Dynamic variations of the vibratory pattern in the normal larynx. Folia
Phoniat (Basel), 10(4):205–238, 1958.
[126] D. Morrison, R. Wang, and L. C. D. Silva. Ensemble methods for spoken emotion recognition in callcentres. Speech Communication, 49(2):98–112, 2007.
[127] E. Mower, M. J. Mataric, and S. Narayanan. A framework for automatic human emotion classification
using emotion profiles. Trans. Audio, Speech and Lang. Proc., 19(5):1057–1070, Jul. 2011.
[128] H. A. Murthy and B. Yegnanarayana. Formant extraction from group delay function. Speech Communication, 10(3):209 – 221, 1991.
[129] H. A. Murthy and B. Yegnanarayana. Group delay functions and its applications in speech technology.
Sadhana, 36(5):745–782, 2011.
[130] K. S. R. Murty and B. Yegnanarayana. Epoch extraction from speech signals. IEEE Transactions on
Audio, Speech, and Language Processing, 16(8):1602–1613, 2008.
[131] H. Nanjo, H. Mikami, S. Kunimatsu, H. Kawano, and T. Nishiura. A fundamental study of novel speech
interface for computer games. In IEEE 13th International Symposium on Consumer Electronics, 2009
(ISCE ’09), pages 558–560, May 2009.
[132] H. Nanjo, T. Nishiura, and H. Kawano. Acoustic-based security system: Towards robust understanding
of emergency shout. In Fifth International Conference on Information Assurance and Security, 2009 (IAS
’09), volume 1, pages 725–728, Aug. 2009.
[133] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes. Estimation of glottal closure instants in voiced
speech using the DYPSA algorithm. IEEE Transactions on Audio, Speech, and Language Processing,
15(1):34 –43, Jan. 2007.
[134] E. E. Nwokah, P. Davies., A. Islam, H. C. Hsu, and A. Fogel. Vocal affect in three-year-olds: a quantitative
acoustic analysis of child laughter. 94(6):3076–3090, Dec 1993.
178
[135] E. E. Nwokah, H.-C. Hsu, P. Davies, and A. Fogel. The integration of laughter and speech in vocal
communication: A dynamic systems perspective. J Speech Lang Hear Res, 42(4):880–894, 1999.
[136] J. J. Ohala and B. W. Eukel. Explaining the intrinsic pitch of vowels. Channon, Shockey, in In Honor of
Ilse Lehiste,, pages 207–215, 1987.
[137] A. V. Oppenheim and R. W. Schafer. Digital Signal Processing, chapter 7, pages 337–365. Prentice Hall,
Englewood Cliffs, New Jersey, USA, 1975.
[138] A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-Time Signal Processing (2nd Edition) (Prantice
Hall Signal Processing Series), chapter 2, pages 42–96. Pearson Prantice Hall, New Delhi, India, 2 edition,
Jan. 1999.
[139] M. J. Owren and J.-A. Bachorowski. Reconsidering the evolution of nonlinguistic communication: The
case of laughter. Journal of Nonverbal Behavior, 27(3):183–200, 2003.
[140] K. Ozawa and T. Araseki. Low bit rate multi-pulse speech coder with natural speech quality. In Proc.
IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’86., volume 11,
pages 457–460, 1986.
[141] A. Panat and V. Ingole. Affective State Analysis of Speech for Speaker Verification: Experimental Study,
Design and Development. In Proc. International Conference on Computational Intelligence and Multimedia Applications, pages 255–261, Los Alamitos, CA, USA, 2007. IEEE Computer Society.
[142] T.-L. Pao, Y.-T. Chen, and J.-H. Yeh. Emotion recognition from mandarin speech signals. In Chinese
Spoken Language Processing, 2004 International Symposium on, pages 301–304, 2004.
[143] T.-L. Pao, W.-Y. Liao, T.-N. Wu, and C.-Y. Lin. Automatic visual feature extraction for mandarin audiovisual speech recognition. In SMC, pages 2936–2940. IEEE, 2009.
[144] J. S. Perkell and M. H. Cohen. An indirect test of the quantal nature of speech in the production of the
vowels /i/, /a/ and /u/. Journal of Phonetics, 17:123–133, 1989.
[145] A. Perkis, E. B. Ribbum, and E. T. Ramstad. Improving subjective quality in waveform coders by the use
of postfiltering. Department of Elec. Eng. and Comp. Science, pages 60–65, 1985.
[146] J. Pohjalainen, P. Alku, and T. Kinnunen. Shout detection in noise. In IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2011, pages 4968–4971, May 2011.
[147] J. Pohjalainen, T. Raitio, S. Yrttiaho, and P. Alku. Detection of shouted speech in noise: Human and
machine. 133(4):2377–89, Apr. 2013.
[148] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audio-visual speech. In PROC. IEEE, pages 1306–1326, 2003.
[149] S. R. M. Prasanna and D. Govind. Analysis of excitation source information in emotional speech. In
INTERSPEECH, pages 781–784, 2010.
[150] R. R. Provine. Laughter: A Scientific Investigation. Viking, New York, USA, 2000.
[151] R. R. Provine and K. R. Fischer. Laughing, smiling, and talking: Relation to sleeping and social context
in humans. Ethology, 83(4):295–305, 1989.
179
[152] R. R. Provine and Y. L. Yong. Laughter: A stereotyped human vocalization. Ethology, 89(2):115–124,
1991. (published by Blackwell Publishing Ltd).
[153] L. Rabiner, B. H. Juang, and B. Yegnanarayana. Fundamentals of Speech Recognition, chapter third, pages
88–113. Pearson Education Inc., New Delhi, India, Indian subcontinent adaptation, first edition edition,
2009.
[154] K. S. Rao and S. G. Koolagudi. Characterization and recognition of emotions from speech using excitation
source information. International Journal of Speech Technology, 16:181–201, 2013.
[155] D. Recasens. On the production characteristics of apicoalveolar taps and trills. 19:267–280, 1991.
[156] G. Rigoll, R. M¨uller, and B. Schuller. Speech Emotion Recognition Exploiting Acoustic and Linguistic
Information Sources. In G. Kokkinakis, editor, Proceedings 10th International Conference Speech and
Computer, SPECOM 2005, volume 1, pages 61–67, Patras, Greece, October 2005.
[157] P. Roach. English Phonetics and Phonology: A practical course, chapter 4, pages 26–35. Cambridge
University Press, Cambridge, UK, 1998.
[158] M. Rothenberg. Acoustic interaction between the glottal source and the vocal tract. In Vocal fold physiology, pages 305–323. University of Tokyo Press, Tokyo, 1981. edited by K. N. Stevens and M. Hirano.
[159] H. Rothganger, G. Hauser, A. C. Cappellini, and A. Guidotti. Analysis of laughter and speech sounds in
italian and german students. Naturwissenschaften, 85(8):394–402, 1998.
[160] J.-L. Rouas, J. Louradour, and S. Ambellouis. Audio events detection in public transport vehicle. In IEEE
Intelligent Transportation Systems Conference, 2006 (ITSC ’06), pages 733–738, Sep. 2006.
[161] W. Ruch and P. Ekman. The Expressive Pattern of Laughter. Emotion, Qualia, and Consciousness, pages
426–443, 2001. edited by A. W. Kaszniak (Word Scientific, Tokyo).
[162] M. Ruhlen. A Guide to the World’s Languages, Vol. 1: Classification, pages 1–492. Stanford University
Press, Stanford, USA, 1987.
[163] N. Ruty, X. Pelorson, and A. V. Hirtum. Influence of acoustic waveguides lengths on self-sustained oscillations: Theoretical prediction and experimental validation. The Journal of the Acoustical Society of
America, 123(5):3121–3121, 2008.
[164] R. W. Schafer and L. R. Rabiner. System for automatic formant analysis of voiced speech. The Journal of
the Acoustical Society of America, 47(2B):634–648, 1970.
[165] M. Schroeder and B. Atal. Code-excited linear prediction(CELP): High-quality speech at very low bit rates.
In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’85, volume 10,
pages 937–940. IEEE, April 1985.
[166] M. R. Schroeder. Recent Progress in Speech Coding at Bell Telephone Laboratories. In Proc. III Int.
Congress on Acoustics. Elsevier Publishing Co., Amsterdam.
[167] B. Schuller, A. Batliner, S. Steidl, and D. Seppi. Recognising Realistic Emotions and Affect in Speech:
State of the Art and Lessons Learnt from the First Challenge. Speech Communication, Special Issue
180
on Sensing Emotion and Affect - Facing Realism in Speech Processing, 53(9/10):1062–1087, November/December 2011.
[168] B. Schuller, B. Vlasenko, F. Eyben, M. W¨ollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. CrossCorpus Acoustic Emotion Recognition: Variances and Strategies. IEEE Transactions on Affective Computing, 1(2):119–131, July-December 2010.
[169] B. Schuller, Z. Zhang, F. Weninger, and F. Burkhardt. Synthesized Speech for Model Training in CrossCorpus Recognition of Human Emotion. International Journal of Speech Technology, Special Issue on
New and Improved Advances in Speaker Recognition Technologies, 15(3):313–323, 2012.
[170] G. Seshadri and B. Yegnanarayana. Perceived loudness of speech based on the characteristics of glottal
excitation source. The Journal of the Acoustical Society of America, 126(4):2061–2071, 2009.
[171] C. H. Shadle. Intrinsic fundamental frequency of vowels in sentence context. The Journal of the Acoustical
Society of America, 78:1562–1567, 1985.
[172] M. Shami and W. Verhelst. An evaluation of the robustness of existing supervised machine learning
approaches to the classification of emotions in speech. Speech Communication, 49(3):201–212, 2007.
[173] S. Singhal. Optimizing pulse amplitudes in multipulse excitation. The Journal of the Acoustical Society
of America, 74(S1):S51–S51, 1983.
[174] S. Singhal and B. Atal. Improving performance of multi-pulse LPC coders at low bit rates. In Proc. IEEE
Int. Conf. Acoust., Speech, Signal Processing, volume 9, pages 9–12, 1984.
[175] M. J. Sol´e. Aerodynamic characteristics of trills and phonological patterning. 30:655–688, 2002.
[176] M. A. Sonderegger. Subglottal coupling and vowel space: An investigation in quantal theory. Physics B.
S. Thesis, Massachusetts Institute of Technology, Cambridge, MA, 2004.
[177] M. Song, J. Bu, C. Chen, and N. Li. Audio-Visual Based Emotion Recognition- A New Approach. 2013
IEEE Conference on Computer Vision and Pattern Recognition, 2:1020–1025, 2004.
[178] S. Spajic, P. Ladefoged, and P. Bhaskararao. The trills of Toda. 26(1):1–21, 1996.
[179] S. Steidl, A. Batliner, D. Seppi, and B. Schuller. On the Impact of Children’s Emotional Speech on
Acoustic and Language Models. EURASIP Journal on Audio, Speech, and Music Processing, Special
Issue on Atypical Speech, 2010(Article ID 783954), 2010.
[180] K. N. Stevens. Airflow and turbulence noise for fricative and stop consonants: Static considerations. The
Journal of the Acoustical Society of America, 50(4B):1180–1192, 1971.
[181] K. N. Stevens. Physics of laryngeal behavior and larynx modes. Phonetica, 34(4):264–279, 1977.
[182] K. N. Stevens. Acoustic Phonetics, chapter two, pages 55–126. Current Studies in Linguistics 30. The
MIT Press, Cambridge, Massachusetts, London, first edition, 1998.
[183] K. N. Stevens. Acoustic Phonetics, chapter three, pages 167–198. Current Studies in Linguistics 30. MIT
Press, Cambridge, first edition, 2000.
[184] K. N. Stevens, D. N. Kalikow, and T. R. Willemain. A miniature accelerometer for detecting glottal
waveforms and nasalization. Journal of Speech, Language, and Hearing Research, 18:594–599, 1975.
181
[185] K. Sudheer Kumar, M. Sri Harish Reddy, K. Sri Ram Murty, and B. Yegnanarayana. Analysis of laugh
signals for detecting in continuous speech. In Proc. 10th Annual Conference of the International Speech
Communication Association,2009 (INTERSPEECH’09), pages 1591–1594. ISCA, Sep 6-10 2009.
[186] S. Sundaram and S. Narayanan. Automatic acoustic synthesis of human-like laughter. The Journal of the
Acoustical Society of America, 121(1):527–535, 2007.
[187] H. Tanaka and N. Campbell. Acoustic features of four types of laughter in natural conversational speech.
In Proc. 17th International Congress of Phonetic Sciences, 2011 (ICPhS XVII), pages 1958–1961, Aug.
17-21 2011.
[188] T. Tanaka, T. Kobayashi, D. Arifianto, and T. Masuko. Fundamental frequency estimation based on instantaneous frequency amplitude spectrum. In Proc. IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2002, volume 1, pages I–329–I–332, 2002.
[189] I. Titze, T. Riede, and P. Popolo. Nonlinear source–filter coupling in phonation: Vocal exercises. The
Journal of the Acoustical Society of America, 123(4):1902–1915, Apr. 2008.
[190] I. R. Titze. The physics of small-amplitude oscillation of the vocal folds. Journal of the Acoustical Society
of America, 83(4):1536–1552, 1988.
[191] I. R. Titze. Theory of glottal airflow and source-filter interaction in speaking and singing. Acta Acustica
united with Acustica, 90(4):641–648, 2004.
[192] I. R. Titze. Nonlinear source–filter coupling in phonation: Theory. The Journal of the Acoustical Society
of America, 123(5):2733–2749, 2008.
[193] I. R. Titze and B. H. Story. Acoustic interactions of the voice source with the lower vocal tract. The
Journal of the Acoustical Society of America, 101(4):2234–2243, 1997.
[194] K. P. Truong and D. A. V. Leeuwen. Automatic detection of laughter. In Proc. of 9th European Conference
on Speech Communication and Technology, 2005 (INTERSPEECH’05), pages 485–488, Sep. 4-8 2005.
[195] K. P. Truong and D. A. V. Leeuwen. Automatic discrimination between laughter and speech. Speech
Communication, 49(2):144–158, Feb. 2007.
[196] K. P. Truong and S. Raaijmakers. Automatic recognition of spontaneous emotions in speech using acoustic
and lexical features. In A. Popescu-Belis and R. Stiefelhagen, editors, MLMI, volume 5237 of Lecture
Notes in Computer Science, pages 161–172. Springer, 2008.
[197] C. K. Un and D. T. Magill. The Residual-Excited Linear Prediction Vocoder with Transmission Rate
Below 9.6 kbits/s. IEEE Transactions on Communications, 23(12):1466–1474, 1975.
[198] J. Urbain, R. Niewiadomski, E. Bevacqua, T. Dutoit, A. Moinet, C. Pelachaud, B. Picart, J. Tilmanne, and
J. Wagner. AVLaughterCycle. Journal on Multimodal User Interfaces, 4(1):47–58, 2010.
[199] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. Scream and gunshot detection and
localization for audio-surveillance systems. In IEEE Conference on Advanced Video and Signal Based
Surveillance, 2007 (AVSS 2007), pages 21–26, Sep. 2007.
182
[200] J. Van Den Berg. Myoelastic-aerodynamic theory of voice production. Journal of Speech and Hearing
Research, 1(3):227–244, 1958.
[201] B. Van Der Pol. The fundamental principles of frequency modulation. Journal of the Institution of Electrical Engineers - Part III: Radio and Communication Engineering, 93(23):153–158, 1946.
[202] P. W. J. Van Hengel and T. C. Andringa. Verbal aggression detection in complex social environments.
In IEEE Conference on Advanced Video and Signal Based Surveillance, 2007 (AVSS 2007), pages 15–20,
Sep. 2007.
[203] D. Ververidis and C. Kotropoulos. Emotional speech recognition: Resources, features, and methods.
Speech Communication, 48(9):1162–1181, 2006.
[204] J. Ville. Theory and Applications of the Notion of Complex Signal, volume 2A. RAND Corporation, Santa
Monica, CA.
[205] T. Vogt, E. Andr, and J. Wagner. Automatic recognition of emotions from speech: a review of the literature
and recommendations for practical realisation. In In LNCS 4868, pages 75–91, 2008.
[206] J. Wagner, J. Kim, and E. Andr. From physiological signals to emotions: Implementing and comparing
selected methods for feature extraction and classification. In ICME, pages 940–943. IEEE, 2005.
[207] M. W¨ollmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll. LSTM-Modeling of Continuous Emotions
in an Audiovisual Affect Recognition Framework. Image and Vision Computing, Special Issue on Affect
Analysis in Continuous Input, 31(2):153–163, February 2013.
[208] M. W¨ollmer, M. Kaiser, F. Eyben, F. Weninger, B. Schuller, and G. Rigoll. Fully Automatic Audiovisual
Emotion Recognition – Voice, Words, and the Face. In T. Fingscheidt and W. Kellermann, editors, Proceedings of Speech Communication; 10. ITG Symposium, pages 1–4, Braunschweig, Germany, September
2012. ITG, IEEE.
[209] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi. Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE - Trans. Inf. Syst., E88-D(3):502–509, March
2005.
[210] B. Yang and M. Lugger. Emotion recognition from speech signals using new harmony features. Signal
Processing, 90(5):1415–1423, 2010.
[211] N. Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, and M. Sturge-Apple. Speech-based
emotion classification using multiclass svm with hybrid kernel and thresholding fusion. In SLT, pages
455–460. IEEE, 2012.
[212] B. Yegnanarayana. Formant extraction from linear prediction phase spectra. 63(5):1638–1640, May 1978.
[213] B. Yegnanarayana and N. G. Dhananjaya. Spectro-temporal analysis of speech signals using zero-time
windowing and group delay function. Speech Communication, 55(6):782–795, 2013.
[214] B. Yegnanarayana, M. A. Joseph, V. G. Suryakanth, and N. Dhananjaya. Decomposition of speech signals
for analysis of aperiodic components of excitation. In Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP 2011), pages 5396 –5399, May 2011.
183
[215] B. Yegnanarayana and H. A. Murthy. Significance of group delay functions in spectrum estimation.
40(9):2281–2289, September 1992.
[216] B. Yegnanarayana and K. S. R. Murty. Event-based instantaneous fundamental frequency estimation from
speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4):614–624, 2009.
[217] P. Zelinka, M. Sigmund, and J. Schimmel. Impact of vocal effort variability on automatic speech recognition. Speech Communication, 54(6):732 – 742, 2012.
[218] Z. Zeng, J. Tu, B. Pianfetti, and T. S. Huang. Audio-visual affective expression recognition through
multistream fused hmm. IEEE Transactions on Multimedia, 10(4):570–577, 2008.
[219] C. Zhang and J. H. L. Hansen. Analysis and classification of speech mode: whispered through shouted.
pages 2289–2292, Antwerp, Belgium, 2007. ISCA.
[220] S. Zhang, Y. Xu, J. Jia, and L. Cai. Analysis and modelling of affective audio visual speech based on pad
emotion space. 2008.
[221] Z. Zhang, J. Neubauer, and D. A. Berry. The influence of subglottal acoustics on laboratory models of
phonation. The Journal of the Acoustical Society of America, 120(3):1558–1569, 2006.
184
List of Publications
Papers in refereed Journals
1. V. K. Mittal and B. Yegnanarayana, “Effect of glottal dynamics in the production of shouted speech”,
The Journal of the Acoustical Society of America, vol. 133, no. 15, pp. 3050-3061, May 2013.
2. V. K. Mittal, B. Yegnanarayana and P. Bhaskararao, “Study of the effects of vocal tract constriction
on glottal vibration”, The Journal of the Acoustical Society of America, vol. 136, no. 4, pp. 19321941, Oct. 2014.
3. Vinay Kumar Mittal and Bayya Yegnanarayana, “Analysis of production characteristics of laughter”,
Computer Speech and Language, published by Elsevier, http://dx.doi.org/10.1016/j.csl.2014.08.004,
Sep. 2014.
4. Vinay Kumar Mittal and Bayya Yegnanarayana, “Study of characteristics of aperiodicity in expressive voices”, submitted to The Journal of the Acoustical Society of America, (under review since
28 July 2014).
Papers in Conferences
1. V. K. Mittal, N. Dhananjaya and B. Yegnanarayan, “Effect of Tongue Tip Trilling on the Glottal
Excitation Source”, in Proc. INTERSPEECH 2012, 13th Annual Conference of the International
Speech Communication Association, Sep. 9-13, 2012, Portland, Oregon, USA.
2. V. K. Mittal and B. Yegnanarayana, “Production Features for Detection of Shouted Speech”, in
Proc. 10th Annual IEEE Consumer Communications and Networking Conference, 2013 (CCNC’13),
pp. 106-111, Jan. 11-14, 2013, USA.
3. Vinay Kumar Mittal and B. Yegnanarayana, “Study of Changes in Glottal Vibration Characteristics
During Laughter”, in Proc. INTERSPEECH 2014, 15th Annual Conference of the International
Speech Communication Association, pp. 1777-1781, Sep. 14-18, 2014, Singapore.
185
4. Vinay Kumar Mittal and B. Yegnanarayana, “Significance of Aperiodicity in the Pitch Perception
of Expressive Voices”, in Proc. INTERSPEECH 2014, 15th Annual Conference of the International
Speech Communication Association, pp. 504-508, Sep. 14-18, 2014, Singapore.
5. V. K. Mittal and B. Yegnanarayana, “An Automatic Shout Detection System Using Speech Production Features”, in Proc. Workshop on Multimodal Analyses enabling Artificial Agents in HumanMachine Interaction, INTERSPEECH 2014 (15th Annual Conference of ISCA), (would appear in
LNAI by Springer in Dec 2014), Sep. 14, 2014, Singapore.
Other related Papers
1. P. Gangamohan, V. K. Mittal and B. Yegnanarayana, “Relative Importance of Different Components
of Speech Contributing to Perception of Emotion”, in Proc. 6th Interantional Conference on Speech
Prosody (ISCA), pp. 657-660, May 22-25, 2012, Shanghai, China.
2. P. Gangamohan, V. K. Mittal and B. Yegnanarayana, “A Flexible Analysis Synthesis Tool (FAST)
for studying the characteristic features of emotion in speech”, in Proc. 9th Annual IEEE Consumer
Communications and Networking Conference, 2012 (CCNC’12), pp. 250-254, Jan. 14-17, 2012,
USA.
186

Download Report