ANALYSIS OF NONVERBAL SPEECH SOUNDS Thesis submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in ELECTRONICS AND COMMUNICATION ENGINEERING by VINAY KUMAR MITTAL Roll Number: 201033001 [email protected] SPEECH AND VISION LABORATORY International Institute of Information Technology Hyderabad - 500 032, INDIA NOVEMBER 2014 c Vinay Kumar Mittal, 2014 Copyright All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Analysis of Nonverbal Speech Sounds” by VINAY KUMAR MITTAL, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Prof. B. YEGNANARAYANA To, my parents Shri Anil Kumar Mittal and Smt. Padamlata Mittal Acknowledgments “You see things twice, first in the mind and then in reality”, said Walt Disney. My first gratitude is towards my mother Mrs. Padamlata Mittal who saw me completing Ph.D. in her mind years back, even before I started believing in my this dream. Next, my deepest gratitude is towards Shri R. N. Saxena, father of my friend Sanjay and one of my life-coach, who convinced me about the need of Ph.D. degree and feasibility of completing it even at my late middle age. (Yes, I am 47 years young at this juncture.) Of course, the dream could not have been actualised without my wife Mrs. Ambika and our sons Yash and Shubham, who not only stood by me in all my career decisions and supported me in all my pursuits, but also endured the relative financial hardship due to this change of my profession. I feel truly blessed to have such wonderful people around me, all the time. My this goal probably would not have been achieved if Prof. B. Yegnanarayana, with whom I am associated since my M. Tech. days at IIT, Madras (1996-98), had not advised me to join as full-time researcher instead of part-time. That actually meant, I had to leave the corporate world. It was his wise counsel and magnanimity of accepting me as his student, that enabled me to leave the lure and comfort of corporate like Microsoft IDC behind, and take the plunge. In the pursuit of my this goal, kind-heartedness of Prof. Rajeev Sangal, then director of IIIT, Hyderabad, would remain etched in my mind for a long time. He offered me both the teaching job as well as Ph.D. admission in the very first meeting I had with him. That facilitated my decision. At IIIT, Hyderabad, I did get an opportunity to closely interact with some of its leading faculty members, from whom I learnt many things, not only inside the class-room, but beyond that at personal level as well. I take this opportunity to express my gratefulness towards Prof. P. J. Narayanan, Prof. Peri Bhaskararao, Prof Jayanthi Sivaswamy, Prof. C. V. Jawahar and Dr. Kishore S. Prahalad, who have been my great teachers here, while also treating me as colleagues at the same time. Hats-off to their balancing act and depth as human beings! Speech and Vision Lab at IIIT, Hyderabad did carry its legacy of ‘technical depth, discipline and sharing’ from IIT, Madras to IIIT, Hyderabad, mainly because of Prof. Yegnanarayana. But, the credit also goes to its sincere and dedicated researchers, then students at the lab, like Guru (Dr. Guruprasad Sheshadri), Dhanu (Dr. N. Dhananjaya), Anand (Mr. Anand M. Joseph) and few others. When I joined the lab at IIIT, Hyderabad in 2010, these three became my friends and did hand-hold me during my initial days of struggle in the lab, while understanding concepts, learning tools and brushing-up things forgotten. I wish them sincere thanks and good-luck in their respective pursuits currently. Gangamohan, a sincere research student in the lab and with whom I worked on a sponsored project, has been of great v vi help in several ways, and we could publish couple of joint papers on ‘emotions’, my initial topic of research work. I wish him success ahead. At this juncture, I wish to say thanks to all the lab members who directly or indirectly have contributed to whatever little I could accomplish in the last over four years. I have kept the words and place reserved for Dr. Suryakanth V. Gangashetty, who as colleague, friend and philosopher has been constantly by my side, even while me facing some ups and downs. A journey like this with roller-coaster ride, is a discovery within. But, any journey also always has the role of many part players, some of whom even remain behind the curtain most of the time. My deepest thanks to all those, who have been part of the successful near completion of this phase of my journey as a researcher. Lastly, I seek pardon from those, whose names I might have skipped a mention, but nevertheless their role is acknowledged and much appreciated. Best wishes. Vinay Kumar Mittal Abstract The nonverbal speech sounds such as emotional speech, paralinguistic sounds and expressive voices produced by human beings are different from normal speech. Whereas, the normal speech conveys linguistic message and has clear articulatory description, these nonverbal sounds also carry nonlinguistic information but without any clear description of articulation. Also, these sounds are mostly unusual, irregular, spontaneous and nonsustainable. Examples of emotional speech are shouts, happy, anger, sad etc., and of paralinguistic sounds laughter, cry, cough etc. Besides, the expressive voices like Noh voice or opera singing are trained voices to convey intense emotions. Emotional speech, paralinguistic sounds and expressive voices differ in the degree of pitch changes. Another categorisation based upon voluntary control and involuntary changes in the speech production mechanism is also possible. Production of nonverbal sounds occurs in short bursts of time and involves significant changes in the glottal source of excitation. Hence, production characteristics of these sounds differ from those of normal speech, mostly in the vibration characteristics of the vocal folds. Associated changes in the characteristics of the vocal tract system are also possible. In some cases of normal speech such as trills or emotional speech like shouts, the glottal vibration characteristics are also affected by the acoustic loading of the vocal tract system and system-source coupling. Hence, characteristics of these nonverbal sounds need to be studied from the speech production and perception points of view, to understand better their differences from normal speech. Excitation impulse sequence representation of the excitation source component of speech signal has been of considerable interest in speech research, in past three decades. Presence of secondary impulses within a pitch period was also observed in some studies. This impulse-sequence representation was mainly aimed at achieving low bit-rates of speech coding and higher voice quality of synthesized speech. But, its advantages and role in the analysis of nonverbal speech sounds have not been explored much. The differences in the locations of these excitation impulse-like pulses in the sequence and their relative amplitudes, possibly cause differences in various categories of acoustic sounds. In the case of nonverbal speech sounds, these impulse-like pulses occur also at rapidly changing or nearly random intervals, along with rapid or sudden changes in their amplitudes. Aperiodicity in the excitation component may be considered as an important feature of the expressive voices like ‘Noh voice’. Characterizing changes in the pitch perception that could be rapid in the case of expressive voices, and extracting F0 especially in the regions of aperiodicity are major challenges, which need to be investigated in detail. vii viii In this research work, the production characteristics of nonverbal speech sounds are examined from both the electroglottograph and acoustic signals. These sounds are examined in four categories, which differ in the periodicity (or aperiodicity) of glottal excitation and rapidness of changes in pitch perception. These categories are: (a) normal speech in modal voicing that includes study of trill, lateral, fricative and nasal sounds, (b) emotional speech that includes four loudness level variations in speech, namely, soft, normal, loud and shouted speech, (c) paralinguistic sounds like laughter in speech, and (d) expressive voices like Noh singing voice. The effects of source-system coupling and acoustic loading of the vocal tract system on the glottal excitation are also examined. The signal processing methods like zero-frequency filtering, zero-time liftering, Hilbert transform and group delay function are used for feature extraction. Existing methods like linear prediction (LP) coefficients, Mel-frequency cepstral coefficients and short-time Fourier spectrum are also used. New signal processing methods such as modified zero-frequency filtering (modZFF), dominant frequencies (FD ) computation using LP spectrum or group delay function and saliency computation (a measure for pitch perception) are proposed in this work. A time-domain impulse sequence representation of the excitation source is proposed, which also takes into account the pitch perception and aperiodicity in expressive voices. Using this representation, a method is proposed for extracting F0 even in the regions of subharmonics and aperiodicity, which otherwise is a challenging task. Validation of results is carried out using spectrograms, saliency measure, perceptual studies and synthesis. The efficacy of the signal processing methods proposed in this work, and the features and parameters derived, is also demonstrated through some applications. Three prototype systems are developed for automatic detection of trills, shout and laughter in continuous speech. These systems use features extracted to capture the unique characteristics of each of the respective sound categories examined. Parameters are derived from features, that help in distinguishing these sounds from normal speech. Specific databases are collected with ground truth in each case for developing and testing these systems. Performance evaluation is carried out using some measures proposed such as saliency, perceptual studies and synthesis. The encouraging results indicate feasibility of developing these systems into complete systems for diverse practical purposes and a range of real-life applications. Contents Chapter Page 1 Issues in Analysis of Nonverbal Speech Sounds . . . . . . . . . . . . . . . . . . . . 1.1 Verbal and nonverbal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Signal processing and other issues in the analysis of nonverbal sounds . . . . . . 1.3 Objective of the studies in this thesis . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Analysis tools, methodology and scope . . . . . . . . . . . . . . . . . . . . . . . 1.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 5 6 9 2 Review of Methods for Analysis of Nonverbal Speech Sounds . . . . . . . . . . . . . . . . . 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Speech production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Analytic signal and parametric representation of speech signal . . . . . . . . . . . . . 2.4 Review of studies on source-system interaction and few special sounds . . . . . . . . . 2.4.1 Studies on special sounds such as trills . . . . . . . . . . . . . . . . . . . . . 2.4.2 Studies on source-system interaction and acoustic loading . . . . . . . . . . . 2.5 Review of studies on analysis of emotional speech and shouted speech . . . . . . . . . 2.5.1 Studies on emotional speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Studies on shouted speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Review of studies on analysis of paralinguistic sounds and laughter . . . . . . . . . . . 2.6.1 Need for studying paralinguistic sounds like laughter . . . . . . . . . . . . . . 2.6.2 Different types and classifications of laughter . . . . . . . . . . . . . . . . . . 2.6.3 Studies on acoustic analysis of laughter and research issues . . . . . . . . . . . 2.7 Review of studies on analysis of expressive voices and Noh voice . . . . . . . . . . . . 2.7.1 Need for studying expressive voices . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Studies on representation of source characteristics and pitch-perception . . . . 2.7.3 Studies on aperiodicty in expressive voices such as Noh singing and F0 extraction 2.8 Review of studies for spotting the acoustic events in continuous speech . . . . . . . . . 2.8.1 Studies towards trill detection . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Studies on shout detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.3 Studies on laughter detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 15 21 21 22 25 25 27 28 29 29 30 31 31 32 33 36 36 36 37 38 3 Signal Processing Methods for Feature Extraction . . . . . . . . . . . . . . . . . . . 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Impulse-sequence representation of excitation in speech coding . . . . . . . . . . 3.3 All-pole model of excitation in LPC vocoders . . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . 39 . 39 . 40 . 43 CONTENTS x 3.4 3.5 3.6 3.7 3.8 3.9 Methods to estimate the excitation impulse sequence representation . . . . . . . . 3.4.1 MPE-LPC model of the excitation . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Methods for estimating the amplitudes of pulses . . . . . . . . . . . . . . 3.4.3 Methods for estimating the positions of pulses . . . . . . . . . . . . . . . 3.4.4 Variations in MPE model of the excitation . . . . . . . . . . . . . . . . . . Zero-frequency filtering method . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-time liftering method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods to compute dominant frequencies . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Computing dominant frequency from LP spectrum . . . . . . . . . . . . . 3.7.2 Computing dominant frequency using group delay method and LP analysis Challenges in the existing methods and need for new approaches . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 45 48 51 54 56 58 60 60 61 62 63 4 Analysis of Source-System Interaction in Normal Speech . . . . . . . . . . . . . . . 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Role of source-system coupling in the production of trills . . . . . . . . . . . . . 4.2.1 Production of apical trills . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Impulse-sequence representation of excitation source . . . . . . . . . . . 4.2.3 Analysis by synthesis of trill and approximant sounds . . . . . . . . . . . 4.2.4 Perceptual evaluation of the relative role . . . . . . . . . . . . . . . . . 4.2.5 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Effects of acoustic loading on glottal vibration . . . . . . . . . . . . . . . . . . . 4.3.1 What is acoustic loading? . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Speech data for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Features for the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Observations from EGG signal . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Discussion on acoustic loading through EGG and speech signals . . . . . 4.3.6 Quantitative assessment of the effects of acoustic loading . . . . . . . . . 4.3.7 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 . 64 . 65 . 65 . 66 . 67 . 68 . 70 . 71 . 71 . 74 . 74 . 76 . 78 . 86 . 89 . 90 5 Analysis of Shouted Speech . . . . . . . . . . . . . . . . . 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Different loudness levels in emotional speech . . . . . 5.3 Features for analysis of shouted speech . . . . . . . . . 5.4 Data for analysis . . . . . . . . . . . . . . . . . . . . 5.5 Production characteristics of shout from EGG signal . 5.6 Analysis of shout from speech signal . . . . . . . . . . 5.6.1 Analysis from spectral characteristics . . . . . 5.6.2 Analysis from excitation source characteristics 5.6.3 Analysis using dominant frequency feature . . 5.7 Discussion on the results . . . . . . . . . . . . . . . . 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 91 92 92 95 95 96 97 102 104 107 108 CONTENTS xi 6 Analysis of Laughter Sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Data for analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Analysis of laughter from EGG signal . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Analysis using the closed phase quotient (α) . . . . . . . . . . . . . . . 6.3.2 Analysis using F0 derived from the EGG signal . . . . . . . . . . . . . . 6.3.3 Inter-call changes in α and F0 . . . . . . . . . . . . . . . . . . . . . . . 6.4 Modified zero-frequency filtering method for the analysis of laughter . . . . . . . 6.5 Analysis of source and system characteristics from acoustic signal . . . . . . . . 6.5.1 Analysis using F0 derived by modZFF method . . . . . . . . . . . . . . 6.5.2 Analysis using the density of excitation impulses (dI ) . . . . . . . . . . . 6.5.3 Analysis using the strength of excitation (SoE) . . . . . . . . . . . . . . 6.5.4 Analysis of vocal tract system characteristics of laughter . . . . . . . . . 6.5.5 Analysis of other production characteristics of laughter . . . . . . . . . . 6.6 Discussion on the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 109 110 111 112 113 114 114 117 119 121 124 126 128 129 132 7 Analysis of Noh Voices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Issues in representing excitation source in expressive voices . . . . . . . . . . . . 7.3 Approach adopted in this study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Method to compute saliency of expressive voices . . . . . . . . . . . . . . . . . 7.5 Modified zero-frequency filtering method for analysis of Noh voices . . . . . . . 7.5.1 Need for modifying the ZFF method . . . . . . . . . . . . . . . . . . . . 7.5.2 Key steps in the modZFF method . . . . . . . . . . . . . . . . . . . . . 7.5.3 Impulse sequence representation of source using modZFF method . . . . 7.6 Analysis of aperiodicity in Noh voice . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Aperiodicity in source characteristics . . . . . . . . . . . . . . . . . . . 7.6.2 Presence of subharmonics and aperiodicity . . . . . . . . . . . . . . . . 7.6.3 Decomposition of signal into source and system characteristics . . . . . . 7.6.4 Analysis of aperiodicity using saliency . . . . . . . . . . . . . . . . . . 7.7 Significance of aperiodicity in expressive voices . . . . . . . . . . . . . . . . . . 7.7.1 Synthesis using impulse sequence excitation . . . . . . . . . . . . . . . 7.7.2 F0 extraction in regions of aperiodicity . . . . . . . . . . . . . . . . . . 7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 133 134 135 136 140 140 142 143 146 146 148 149 151 153 154 156 157 8 Automatic Detection of Acoustic Events in Continuous Speech . . . . . . . . . . . . 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Automatic Trills Detection System . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Shout Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Production features for shout detection . . . . . . . . . . . . . . . . . . 8.3.2 Parameters for shout decision logic . . . . . . . . . . . . . . . . . . . . 8.3.3 Decision logic for automatic shout detection system . . . . . . . . . . . 8.3.4 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Laughter Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 158 158 159 160 160 161 162 164 164 CONTENTS xii 9 Summary and Conclusions . . . . . . . . . . . . . . . . . 9.1 Summary of the work . . . . . . . . . . . . . . . . . 9.2 Major contributions of this work . . . . . . . . . . . 9.3 Research issues raised and directions for future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 . 166 . 168 . 169 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 List of Figures Figure 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 Page Schematic diagram of human speech production mechanism (figure taken from [153]) . Cross-sectional expanded view of vocal folds (figure taken from [157]) . . . . . . . . . Illustration of waveforms of (a) speech signal, (b) EGG signal, (c) LP residual and (d) glottal pulse information derived using LP residual . . . . . . . . . . . . . . . . . Schematic view of vibration of vocal folds for different cases: (a) open, (b) open at back side, (c) open at front side and (d) closed state (figure taken from [157]) . . . . . . . . Schematic views of glottal configurations for various phonation types: (a) glottal stop, (b) creak, (c) creaky voice, (d) modal voice, (e) breathy voice, (f) whisper, (g) voicelessness. Parts marked in (g): 1. glottis, 2. arytenoid cartilage, 3. vocal fold and 4. epiglottis. (Figure is taken from [60]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic block diagram of the ZFF method . . . . . . . . . . . . . . . . . . . . . . . Results of the ZFF method for different window lengths for trend removal for a segment of Noh singing voice. Epoch locations are indicated by inverted arrows. . . . . . . . . Schematic block diagram of the ZTL method . . . . . . . . . . . . . . . . . . . . . . HNGD plots through ZTL analysis. (a) 3D HNGD spectrum (perspective view). (b) 3D HNGD spectrum plotted at epoch locations (mesh form). The speech segment is for the word ‘stop’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of LP spectrum for a frame of speech signal. . . . . . . . . . . . . . . . . . Illustration of stricture for (a) an apical trill, (b) theoretical approximant and (c) an approximant in reality. The relative closure/opening positions of the tongue tip (lower articulator) with respect to upper articulator are shown. . . . . . . . . . . . . . . . . (Color online) Illustration of waveforms of (a) input speech and (b) ZFF output signals, and contours of features (c) F0 , (d) SoE, (e) FD1 (“•”) and FD2 (“◦”) derived from the acoustic signal for the vowel context [a]. . . . . . . . . . . . . . . . . . . . . . . . . . (a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and (d) Synthesized speech waveform (x13 ), for a sustained apical trill-approximant pair (trill: 0-2 sec, approximant: 3.5-5.5 sec). Source information is changed (system only retained) in synthesized speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and (d) Synthesized speech waveform (x11 ), for the sustained apical trill-approximant pair (trill: 0-2 sec, approximant: 3.5-5.5 sec). Both system and source information that of original speech are retained in synthesized speech. . . . . . . . . . . . . . . . . . . . xiii 12 12 13 14 15 56 57 58 60 61 65 66 67 68 LIST OF FIGURES xiv 4.5 Illustration of strictures for voiced sounds: (a) stop, (b) trill, (c) fricative and (d) approximant. Relative difference in the stricture size between upper articulator (teeth or alveolar/palatal/velar regions of palate) and lower articulator (different areas of tongue) is shown schematically, for each case. Arrows indicate the direction of air flow passing through the vocal tract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Illustration of open/closed phase durations, using (a) EGG signal and (b) differenced EGG signal for the vowel [a]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal, and the α contour for geminated occurrence of apical trill ([r]) in vowel context [a]. The sound is for [arra], produced in female voice. . . . . . . . . . . . . . . . . . . . . 76 Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal, and the α contour for geminated occurrence of apical lateral approximant ([l]) in vowel context [a]. The sound is for [alla], produced in female voice. . . . . . . . . . 77 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of apical trill ([r]) in vowel context [a]. The sound is for [arra], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.10 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of alveolar fricative ([z]) in vowel context [a]. The sound is for [azza], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . 80 4.11 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of velar fricative ([È]) in vowel context [a]. The sound is for [aÈÈa], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . 81 4.12 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of apical lateral approximant ([l]) in vowel context [a]. The sound is for [alla], produced in male voice. . . . . . . . . . . . . . . . . . 82 4.13 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of alveolar nasal ([n]) in vowel context [a]. The sound is for [anna], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . 83 4.14 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of velar nasal ([N]) in vowel context [a]. The sound is for [aNNa], produced in male voice. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 (a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d) glottal pulse obtained from LP residual (grx [n]) for a segment of normal speech. Note that all plots are normalized to the range of -1 to +1. . . . . . . . . . . . . . . . . . . . 93 (a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d) glottal pulse obtained from LP residual (grx [n]) for a segment of shouted speech. Note that all plots are normalized to the range of -1 to +1. . . . . . . . . . . . . . . . . . . . 93 4.6 4.7 4.8 4.9 5.1 5.2 LIST OF FIGURES xv 5.3 HNGD spectra along with signal waveforms for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in the utterance of the text ‘Please help!’. Arrows point to the low frequency regions. . . . . . . . . . . . . . 98 5.4 Energy of HNGD spectrum in low frequency (0-400Hz) region (LFSE) for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in utterances of text ‘Please help!’. The vowel regions (V) are marked in these figures. 99 5.5 Energy of HNGD spectrum in high frequency (800-5000 Hz) region (HFSE) for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in utterances of text ‘Please help!’. The vowel regions (V) are marked in these figures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.6 Distribution of high frequency spectral energy (HFSE) vs low frequency spectral energy (LFSE) of HNGD spectral energy computed in 4 different vowel contexts for the 4 loudness levels. The 4 vowel region contexts are: (a) vowel /e/ in word ‘help’, (b) vowel /6/ in word ‘stop’, (c) vowel /u/ in word ‘you’ and (d) vowel /o:/ in word ‘go’. The segments are taken from the utterances by same speaker (S4). (Color online) . . . . . . 100 5.7 Relative spread of low frequency spectral energy (LFSE) of ‘HNGD spectra’ computed over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and shout. The segment is for the vowel /e/ in the word ‘help’ in the utterance of the text ‘Please help!’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.8 Relative spread of low frequency spectral energy (LFSE) of ‘Short-time spectra’ computed over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and shout. The segment is for the vowel /e/ in the word ‘help’ in the utterance of the text ‘Please help!’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.9 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour for a segment “you” of normal speech in male voice. . . . . . . . . . . . . . . . . . . 106 5.10 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour for a segment “you” of shouted speech in male voice. . . . . . . . . . . . . . . . . . . 106 6.1 6.2 6.3 6.4 6.5 Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for three calls in a nonspeech-laugh bout after utterance of text “it is a good joke”, by a female speaker. Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for a laughedspeech segment of the utterance of text “it is a good joke”, by a female speaker. . . . . Illustration of (a) signal waveform (xin [n]), and (b) EGG signal ex [n], (c) dEGG signal dex [n] and (d) αdex , contours along with V/NV regions (dashed lines). The segment consists of calls in a nonspeech-laugh bout (marked 1-4 in (d)) and a laughed-speech bout (marked 5-8 in (d)) for the text “it is really funny”, produced by a male speaker. . (Color online) Illustration of inter-calls changes in the average values of ratio α and F0 , for 4 calls each in a nonspeech-laugh bout (solid line) and a laughed-speech bout (dashed line), produced by a male speaker and by a female speaker: (a) αave for NSL/LS male calls, (b) αave for NSL/LS female calls, (c) F0ave for NSL/LS male calls, and (d) F0ave for NSL/LS female calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of (a) acoustic signal waveform (xin [n]), (b) the output (y2 [n]) of cascaded pairs of ZFRs, (c) modified ZFF (modZFF) output (zx [n]), and (d) voiced/nonvoiced (V/NV) regions (voiced marked as ‘V’), for calls in a nonspeech-laugh bout of a female speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 111 112 114 115 LIST OF FIGURES xvi Illustration of (a) signal (xin [n]), and (b) modZFF output (zx [n]), (c) Hilbert envelope of modZFF output (hz [n]), and (d) EGG signal (ex [n] for a nonspeech-laugh call, by a female speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Illustration of (a) signal (xin [n]), and contours of (b) F0 , (c) SoE (ψ), and (d) FD1 (“•”) and FD2 (“◦”) with V/NV regions (dashed lines), for calls in a nonspeech-laugh bout of a male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Illustration of few glottal cycles of (a) acoustic signal (xin [n]), (b) EGG signal ex [n] and (c) modified ZFF output signal zx [n], for a nonspeech-laugh call produced by a male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Illustration of (a) acoustic signal (xin [n]), and spectrograms of (b) signal, (c) epoch sequence (using the modified ZFF method) and (d) sequence of impulses at all (negative to positive going) zero-crossings of zx [n] signal, for few nonspeech-laugh calls produced by a male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.10 Illustration of changes in the temporal measure for dI , i.e., φ, for NSL and LS calls. (a) Acoustic signal (xin [n]). (b) φ for NSL and LS calls, i.e., regions 1-4 and 5-8, respectively. The signal segment is for the text “it is really funny” produced by a male speaker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.11 (Color online) Illustration of distribution of FD2 vs FD1 for nonspeech-laugh (“•”) and laughed-speech (“◦”) bouts of a male speaker. The points are taken at GCIs in respective calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.12 Illustration of (a) input acoustic signal (xin [n]) and few (b) peaks of Hilbert envelope of LP residual (hp ) for 3 cases: (i) normal speech, (ii) laughed-speech and (iii) nonspeechlaugh. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.6 6.7 6.8 6.9 7.1 Results of the ZFF method for different window lengths for trend removal for a segment of voiced speech. Epoch locations are indicated by inverted arrows. . . . . . . . . . . 135 7.2 (a) Saliency plot of the AM pulse train and (b) the synthetic AM sequence. . . . . . . . 138 7.3 Saliency plots ((a),(c),(e),(g)) of the synthetic AM pulse train and the epoch sequences ((b),(d),(f),(h)) derived by using different window lengths for trend removal: 7 ms ((a),(b)), 3 ms ((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for AM sequence and the cleaned SoE sequence for 1 ms window length, respectively. . . . . . . . . . . . . 138 7.4 (a) Saliency plot of the FM pulse train and (b) the synthetic FM sequence. . . . . . . . 7.5 Saliency plots ((a),(c),(e),(g)) of the synthetic FM pulse train and the epoch sequences ((b),(d),(f),(h)) derived by using different window lengths for trend removal: 7 ms ((a),(b)), 3 ms ((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for FM sequence and the cleaned SoE sequence for 1 ms window length, respectively. . . . . . . . . . . . . 139 7.6 Illustration of waveforms of (a) input speech signal, (b) LP residual, (c) Hilbert envelope of LP residual and (d) modZFF output, and (e) the SoE impulse sequence derived using the modZFF method. The speech signal is a segment of Noh singing voice used in Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Illustration of waveforms of (a) input speech signal, (b) preprocessed signal and (c) modZFF output, and (d) the SoE impulse sequence derived using the modZFF method. The speech signal is a segment of Noh singing voice used in Fig. 3 in [55]. . . . . . . . . . 142 7.7 139 LIST OF FIGURES 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 8.1 8.2 xvii Illustration of waveforms of speech signal (in (a)), modZFF outputs (in (b),(d),(f),(h)) and SoE impulse sequences (in (c),(e),(g),(i)), for the choice of last window lengths as 2.5 ms, 2.0 ms, 1.5 ms and 1.0 ms. The speech signal is a segment of Noh voice used in Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Selection of last window length: Difference (∆Nimps )(%) in the number of impulses obtained with/without preprocessing vs choice of last window length (wlast ) (ms), for 3 different segments of Noh singing voice [55]. [Solid line: segment1, dashed line: segment2, dotted line: segment3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 (a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last window length=1 ms), for the input (synthetic) AM sequence. . . . . . . . . . . . . . 145 (a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last window length=1 ms), for the input (synthetic) FM sequence. . . . . . . . . . . . . . . 146 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 1 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 2 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Expanded (a) signal waveform, and spectrograms of (b) signal and its (e) SoE impulse sequence obtained using the modZFF method, for Noh voice segment corresponding to Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 (a) Signal waveform, and spectrograms of (b) signal, and its decomposition into (c) source characteristics and (d) system characteristics, for a Noh voice segment corresponding to Fig. 3 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Saliency plots computed with LP residual (in (a),(d),(g)), using XSX method (copied from [55]) (in (b),(e),(h)), and computed with SoE impulse sequence derived using the modZFF method (in (c),(f),(i)). The signal segments S1, S2 and S3 correspond, respectively, to the vowel regions [o:](Fig. 6 in [55]), [i](Fig. 7 in [55]), and [o](with pitch rise)(Fig. 8 in [55]) in Noh singing voice [55]. . . . . . . . . . . . . . . . . . . . . . . 152 (a) FM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence.154 (a) AM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence.155 Illustration of (a) speech signal waveform, (b) SoE impulse sequence derived using the modZFF method and (c) F0 contour extracted using the saliency information. The voice signal corresponds to the vowel region [o](with pitch rise) in Noh singing voice (Ref. Fig. 8 in [55]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Schematic block diagram of prototype shout detection system . . . . . . . . . . . . . . Schematic diagram for decision criteria (d5 , d6 , d7 ) using the direction of change in gradients of (a) F0 , (b) SoE and (c) FD contours, for the decision of (d) shout candidate segments 1 & 2 (d5 ), 3 & 4 (d6 ), and 5 & 6 (d7 ). . . . . . . . . . . . . . . . . . . . . . 159 162 List of Tables Table 4.1 4.2 4.3 4.4 4.5 Page Criterion for similarity score for perceptual evaluation of two trill sounds (synthesized and original speech) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Experiment 1: Results of perceptual evaluation. Average similarity scores between synthesized speech files (x11 , x12 , x13 and x14 ) and original speech file (x10 ) are displayed. 69 Experiment 2: Results of perceptual evaluation. Average similarity scores between each place of articulation in synthesized speech files (x21 , x22 , x23 and x24 ), and corresponding sound in original speech file (x20 ) are displayed. . . . . . . . . . . . . . . . . . . 70 Comparison between sound types based on stricture differences for geminated occurrences. Abbreviations:- alfric: alveolar fricative [z], vefric: velar fricative [È], approx/appx: approximant [l], frics: fricatives ([z], [È]), alnasal: alveolar nasal [n], venas: velar nasal [N], stric: stricture, H/L indicates relative degree of low stricture. . . . . . . 85 Changes in glottal source features F0 and SoE (ψ) for 6 categories of sounds (in male voice). Column (a) is F0 (Hz) for vowel [a], (b) and (c) are F0min and F0max for the 0 (%). Columns (e) is SoE (i.e., ψ) for vowel [a], (f) and (g) specific sound, and (d) is F∆F 0 [a] are ψmin and ψmax for the specific sound, and (h) is ψ∆ψ (%). Sl.# are the 6 categories of [a] sounds. Suffixes a, b and c in the first column indicate single, geminated or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. . . . . . . . . . . . . . . . . . . . . . 4.6 86 Changes in vocal tract system features FD1 and FD2 for 6 categories of sounds (in male voice). Column (a) is FD1 (Hz) for vowel [a], (b) and (c) are FD1min and FD1max for the ∆F specific sound, and (d) is FD1D1 (%). Columns (e) is FD2 (Hz) for vowel [a], (f) and (g) [a] are FD2min and FD2max for the specific sound, and (h) is 4.7 ∆FD2 FD2[a] (%). Sl.# are the 6 cat- egories of sounds. Suffixes a, b and c in the first column indicate single, geminated or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. . . . . . . . . . . . . . . 87 Changes in features due to effect of acoustic loading of the vocal tract system on the glottal vibration, for 6 categories of sounds (in male voice). Columns (a)-(d) show percentage changes in F0 , SoE, FD1 and FD2 , respectively. The direction of change in a feature in comparison to that for vowel [a] is marked with +/- sign. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 xviii LIST OF TABLES 4.8 5.1 5.2 5.3 5.4 5.5 5.6 5.7 6.1 6.2 Changes in features due to effect of acoustic loading of the vocal tract system on the glottal vibration, for 6 categories of sounds (in female voice). Columns (a)-(d) show percentage changes in F0 , SoE, FD1 and FD2 , respectively. The direction of change in a feature in comparison to that for vowel [a] is marked with +/- sign. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 89 The percentage change (∆α) in the average values of α for soft, loud and shout with respect to that of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below means speaker number i (i = 1 to 17). . . . . . . . . . . . . . . . . . . . . 97 The ratio (β) of the average levels of LFSE and HFSE computed over a vowel segment of a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes. Note: The numbers given in columns (a) to (d) are β values multiplied by 100 for ease of comparison.101 Average values of standard deviation (σ), capturing temporal fluctuations in LFSE, computed over a vowel segment of a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes. Note: The numbers given in columns (a) to (d) are σ values multiplied by 1000 for ease of comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 The percentage change (∆F0 ) in the average F0 for soft, loud and shout with respect to that of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below means speaker number i (i = 1 to 17). . . . . . . . . . . . . . . . . . . . . . . . . . . 104 The average values of the ratio (α) of the closed phase to the glottal period for (a) normal, (b) raised pitch (non-shout) and (c) shouted speech, respectively. Columns (d), (e) and (f) are the corresponding average fundamental frequency (F0 ) values in Hz. The values are averaged over 3 utterances (for 3 texts) for each speaker. Note: Si below means speaker number i (i = 1 to 5). F0 values are rounded to nearest integer value. . 105 Results to show changes in the average F0 and SoE values for normal and shouted speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates Shout, S# indicates Speaker number, T# indicates Text number and M/F indicates Male/Female. Note: IPA symbols for the vowels in English phonetics are shown for the vowels used in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Results to show changes in the Dominant frequency (FD ) values for normal and shouted speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates Shout, S# indicates Speaker number, T# indicates Text number and M/F indicates Male/Female. Note: IPA symbols for the vowels in English phonetics are shown for the vowels used in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Changes in α and F0EGG for laughed-speech and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average α, (d)-(f) are σα , (g)-(i) are average βα and (l)-(n) are average F0EGG (Hz) for NS, LS and NSL. Columns (j), (k) are ∆βα (%) and (o), (p) are ∆F0EGG (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Changes in F0ZF F and temporal measure for F0 (i.e., θ) for laughed-speech and nonspeechlaugh, with reference to normal speech. Columns (a)-(c) are average F0ZF F (Hz), (d)-(f) are σF0 (Hz), (g)-(i) are average γ1 and (l)-(n) are average θ values for NS, LS and NSL. Columns (j), (k) are ∆γ1 (%) and (o), (p) are ∆θ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . . . . . . . . . 120 LIST OF TABLES xx 6.3 6.4 6.5 6.6 Changes in dI and temporal measure for dI (i.e., φ) for laughed-speech and nonspeechlaugh, with reference to normal speech. Columns (a)-(c) are average dI (Imps/sec), (d)-(f) are σdI (Imps/sec), (g)-(i) are average γ2 and (l)-(n) are average φ values for NS, LS and NSL. Columns (j), (k) are ∆γ2 (%) and (o), (p) are ∆φ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . 123 Changes in SoE (i.e., ψ) and temporal measure for SoE (i.e., ρ) for laughed-speech and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average ψ, (d)-(f) are σψ , (g)-(i) are average γ3 and (l)-(n) are average ρ values for NS, LS and NSL. Columns (j), (k) are ∆γ3 (%) and (o), (p) are ∆ρ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. . . . . . . . . . . . . . 125 Changes in FD1ave and σFD1 for laughed-speech (LS) and non-speech laugh (NSL) in comparison to those for normal speech (NS). Columns (a),(b),(c) are FD1ave (Hz) and (d),(e),(f) are σFD1 (Hz) for the three cases NS, LS and NSL. Columns (g),(h),(i) are the average ν1 values computed for NS, LS and NSL, respectively. Columns (j) and (k) are ∆ν1 (%) for LS and NSL, respectively. Note: Si below means speaker number i (i = 1 to 11), and M/F indicates male/female. . . . . . . . . . . . . . . . . . . . . . . 126 Changes in average η and ση for laughed speech (LS) and nonspeech laugh (NSL) with reference to normal speech (NS). Columns (a)-(c) are average η, (d)-(f) are ση and (g)(i) are average ξ values for NS, LS and NSL. Columns (j) and (k) are ∆ξ (%) for LS and NSL, respectively. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female.129 7.1 Effect of preprocessing on number of impulses: (a) Last window length (wlast ) (ms), #impulses obtained (b) without preprocessing (Norig ), (c) with preprocessing (Nwpp ), N −Nwpp and (d) difference (∆Nimps = orig %). The 3 Noh voice segments correspond Norig to Figures 6, 7 and 8 in [55]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.1 Results of performance evaluation of shout detection: number of speech regions (a) as per ground truth (GT), (b) detected correctly (TD), (c) (shout) missed detection (MD) and (d) wrongly detected as shouts (WD), and rates of (e) true detection (TDR), (f) missed detection (MDR) and (g) false alarm (FAR). Note: CS is concatenated, NCS is natural continuous and MixS is mixed speech. . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Abbreviations AM CELP Codec dEGG DFT EGG F0 FAR FM GCI GMM HE HFSE HMM HNGD IF IDFT LFSE LP LPCs MDR modZFF MPE NGD PDF RPE SPE SPE-CELP Std dev STFT SVM TDR ZFF ZFR ZTL - Amplitude modulation Code-excited linear predictive coding Coder-decoder Differenced electroglottograph Discrete Fourier transform Electroglottograph Instantaneous fundamental frequency False alarm rate Frequency modulation Glottal closure instant Gaussian mixture model Hilbert envelope High-frequency band spectral energy Hidden Markov model Hilbert envelope of double differenced NGD spectrum Instantaneous frequency Inverse discrete Fourier transform Low-frequency band spectral energy Linear prediction Linear prediction coefficients Missed detection rate Modified zero-frequency filtering Mutli-pulse excitation Numerator of group delay function Probability density function Regular-pulse excitation Single-pulse excitation Single-pulse excitation for CELP coding Standard deviation Short-time Fourier transform Support vector machine True detection rate Zero-frequency filtering Zero-frequency resonator Zero-time liftering xxi Chapter 1 Issues in Analysis of Nonverbal Speech Sounds 1.1 Verbal and nonverbal sounds Human speech can be broadly categorized into verbal and nonverbal speech. Verbal speech is associated with a language to convey some message. Since clear description of articulation exists for these sounds, they have reproducible characteristics. Some verbal sounds have complex articulation, making analysis of their production characteristics from the speech signal both interesting and challenging. For example, trills and some consonant clusters (e.g., fricatives, nasals) are interesting sounds for analysis. Nonverbal speech sounds convey mostly nonlinguistic information, although in some cases they may also carry linguistic message. Production of these sounds is mostly involuntary and spontaneous. Hence they do not have any clear description of articulation. These sounds can be divided into three categories: emotional speech, paralinguistic sounds and expressive voices. Emotional speech communicates the linguistic message along with emotions like anger (shout), happy, sad, fear and surprise. Analysis of emotional speech involves characterization of emotion and extraction of linguistic message. Paralinguistic sounds refer to sounds produced due to laughter, cry, cough, sneeze and yawn. These sounds are produced involuntarily and hence are not easy to describe in terms of their production characteristics. These sounds occur mostly in isolation, but may be interspersed sometimes with normal speech. But, significant deviations from normal speech take place in the production of these sounds. Expressive voices are sounds produced by specially trained voices such as in opera singing or Noh voice. These sounds, being specially trained voices, are mostly artistic. Production of these voices is controlled voluntarily. They involve careful control of mostly the excitation component of speech production mechanism. It is indeed a challenging task to analyse the production characteristics of some types of verbal sounds and almost all types of nonverbal sounds. This is because, it is difficult to separate the excitation source and the vocal tract system components of speech production process from speech signal. Also, the variations in production are due to transient phenomenon. Thus the production characteristics are time varying and hence are nonstationary. Moreover, complex interaction between the source and system may occur during production of these sounds. In some cases, the vibration characteristics at the glottis may be affected due to the acoustic loading of the vocal tract system. Similarly, the resonance characteristics 1 of the vocal tract system may be affected by the vibration characteristics at the glottis. It is observed that significant changes from normal speech occur in the production characteristics of nonverbal sounds, especially in the excitation source. Hence determining the features characterizing the nonverbal speech sounds is a challenging task. But these features are needed in practice, especially for spotting the nonverbal sounds in continuous speech in particular, and in audio data in general. Sounds from the following four categories are considered for detailed comparative analysis of production characteristics from speech signals: normal speech, emotional speech, paralinguistic sounds and expressive voices. Production of normal (voiced) speech involves quasi-periodic vibration of the vocal folds. Coupling of the excitation source and the vocal tract system plays an important role in the production-source specific normal speech sounds like trills. The effect of tongue-tip trilling affects the glottal vibration, due to changes in the pressure difference across the glottis. The acoustic loading effects on the glottal vibration due to stricture in the vocal tract are also examined for a few other sounds, to compare the effects of types and extent of stricture. The voiced sounds considered for this study are: apical trill, apical lateral approximant, alveolar fricative, velar fricative, alveolar nasal and velar nasal. Distinctive features of production of trills are explored for detecting trill regions in continuous speech. In the production of emotional speech, the characteristics of normal speech sounds are modified, sometimes to a significant extent. This leads to the challenge of isolating the features corresponding to the linguistic message part and the emotion part. Among the several emotions and affective states, the case of shouted speech is considered for detailed study. It is expected that shouted speech will affect both the excitation source component as well as the vocal tract system components. These effects are studied for four different levels of loudness, to examine the characteristics of shouted speech in relation to soft, normal and loud voices. The production characteristics of shouted speech are also examined to determine methods to extract relevant features from speech signal. These features not only help in characterizing the deviation of shouted speech from normal speech, but they may also help in identifying the regions of shouted speech in continuous speech. This study may help in examining speech mixed with other types of emotions in a similar manner. Production of paralinguistic sounds involves significant changes in the glottal vibration, with accompanying changes in the vocal tract system characteristics. We consider laughter sounds for detailed study in this thesis. In laughter, significant changes occur mainly in the excitation, due to involuntary burst of activity. Production characteristics of laughter (nonspeech-laugh) are studied in relation to laughed-speech and normal speech. Analysis of laughter sounds helps in determining the unique features in the production of laughter, which in turn help in spotting regions of laughter in continuous speech or in audio data. Synthesis of laughter helps in understanding the significance of different features in the production of laughter. Expressive voices like opera singing or Noh voice convey intense emotions, but these voices have little linguistic content. Excitation source seems to play a major role in the production of these sounds. Noh voice is chosen for detailed study of the characteristics of the excitation source. The rapid voluntary changes in the glottal vibration results in aperiodicity in the excitation. These changes also result 2 in the perception of subharmonic components in these artistic voices. Modeling the aperiodicities in the excitation and extraction of these characteristics from these signals is a challenging problem in signal processing, as the resulting model should work satisfactorily for all the cases of nonverbal speech sounds. Hence, two synthetic impulse sequences modeling the amplitude modulation and frequency modulation effects are used to examine changes in the excitation source characteristics. 1.2 Signal processing and other issues in the analysis of nonverbal sounds Various research issues and challenges unique to the analysis of nonverbal speech sounds can be grouped as related to: (a) production-specific nature, (b) databases and ground truth, (c) spotting and classification, and (d) signal processing issues. Production-specific challenges are related to differences in spontaneity, control over production and extent of the pre-meditation required before producing these nonverbal sounds. Databases related issue pertain to their continuum nature (i.e., no clear boundaries separating these), quality of emoting (relevant for data collection), and absence of ground truth (reference) in most of the cases. Classification and spotting of emotions related issues are: how to (i) discriminate between normal and nonverbal speech, (ii) spot nonverbal sounds in continuous speech and (iii) identify its type/category, and (iv) what is the degree of confidence in classifying these emotions. Signal processing issues are mainly related to extracting the characteristics of the excitation source from the acoustic signal, and representing it in terms of a time domain sequence of impulse-like pulses. Differences between nonverbal and normal speech sounds are reflected in their production characteristics. Production of nonverbal sounds occurs in short-bursts of time and involves significant changes in the glottal source of excitation. Though changes in the characteristics of the vocal tract system are also likely. Hence, differences in their production characteristics need to be studied in detail. Like normal speech, the production characteristics of these nonverbal sounds also can be derived from the acoustic signal. Their analysis involves examining changes in the glottal vibration characteristics, some of which can be examined better from the EGG signal. The problem is complex, because the signal processing methods that work well for normal speech in modal voicing, exhibit limitations in the case of these nonverbal sounds. Thus, the first challenge is - (i) how to derive the excitation source characteristics from the acoustic signal, especially for nonverbal speech sounds? Impulse-sequence representation of the excitation source component of speech signal has been of considerable interest in speech coding research, in past three decades. Different speech coding methods such as waveform coders, linear prediction based source coders (vocoders) and analysis-by-synthesis based hybrid codecs etc. had attempted it, in different ways. Representation of the source information was attempted using multi-pulse [8, 6, 174, 26], or regular-pulse [94], or stochastically generated codeexcited linear predictive [165] excitation impulse sequences. Presence of secondary impulses within a glottal cycle was also indicated in some studies [8, 6, 26, 165]. This representation was aimed at achieving low bit-rates in coding and synthesis of natural-sounding speech in speech coders. But, its role and possible merits in the analysis of nonverbal speech sounds have not been explored yet. Hence, 3 the important question is - (ii) can we represent the excitation source information in terms of a timedomain sequence of excitation impulse-like pulses, for these nonverbal sounds also? The differences in various categories of sounds are possibly caused by differences in the locations of these excitation impulse-like pulses in the sequence and their relative amplitudes. For example, in the case of fricative sounds, these equivalent impulses may occur at random intervals with amplitudes of low strength. In the case of vowels or vowel-like regions in modal voicing, these impulse-like pulses occur at nearly regular intervals, with smooth changes in their amplitudes. But, in the case of nonverbal speech sounds, these impulse-like pulses are likely to occur at rapidly changing or nearly random intervals, with sudden changes in their amplitudes. It is manifested as rapid changes in the pitch-perception. In the case of expressive voices like Noh, the aperiodicity in the excitation component is an important feature, which may be attributed to unequal intervals between successive impulses and unequal strengths of excitation around these. A measure needed for aperiodicity in the excitation component could possibly be like weighted mean-squared error of the reconstructed signal, perceptual impressions or saliency of the pitch perception etc. Hence, in order to characterize and measure changes in the perception of pitch, that could be rapid in the case of expressive voices, the important question is - (iii) how to determine the locations and amplitudes of the equivalent impulse-like pulses that characterise and represent the excitation source information, in accordance with some measure of the pitch perception, especially for nonverbal sounds? Related issues here are ‘extraction of F0 in the regions of aperiodicity’, and ‘obtaining the sequence of impulses from the information of pitch-perception’, for expressive voices. In the production of different sounds by humans, the amplitude modulation (AM) and frequency modulation (FM) play a major role in the voluntary control of pitch of the sound or involuntary changes in it, for example in singing or for trill sounds, respectively. These are obviously related to changes in the characteristics of both the excitation source and the vocal tract system, due to source-system coupling. Their effect on the pitch-perception could be significant in the case of these nonverbal sounds. Hence, the effect of differences in the locations of excitation impulses and their relative amplitudes needs to be studied in the AM/FM context for these sounds. Representation of the excitation source characteristics in terms of an impulse-sequence would also facilitate its manipulation for diverse applications such as synthesis of emotional speech, spotting of these sounds in continuous speech and classification of the emotion categories. The key question here is - (iv) how to extract production features/parameters from the acoustic signal that help in spotting these sounds in continuous speech? A related issue is the significance of these features, both from perception and synthesis points of view. The four key challenges in the context of nonverbal speech sounds can be summarised as: how to (a) derive the excitation source characteristics from the acoustic signal, (b) represent this source information in terms of a time-domain sequence of impulse-like pulses, (c) characterize these sounds, and (d) extract features for spotting these in continuous speech, along with the significance of these features from both synthesis and perception points of view? In this research work, the objective is to investigate these four key questions posed and possibly answer some of these. 4 1.3 Objective of the studies in this thesis Human speech sounds can be broadly divided into voiced and unvoiced sounds [102, 99, 37]. Voicing involves excitation of the vocal tract system by the vibration of vocal folds at the glottis [48, 45]. Unvoiced sounds have either insignificant, random or no excitation at all by the glottal source. Voiced sounds are further classified into verbal and nonverbal. Verbal speech sounds have higher degree of periodicity of glottal vibration, in modal voicing [99, 48]. On the other hand, nonverbal sounds are better characterised by aperiodicity, rapid changes in the F0 , and its harmonics/subharmonics, that is manifested as rapidly changing pitch-perception. It is difficult to measure aperiodicity and pitch-perception. A measure of the rapidness of changes in F0 is an even more difficult problem. In this work, the aim is to examine these nonverbal speech sounds in terms of their production characteristics, predominantly the excitation source characteristics, and examine the role of aperiodicity in their pitch-perception. Emotional speech may be produced spontaneously or in a controlled manner (e.g., mimickery). But, production of paralinguistic sounds is mostly spontaneous and uncontrolled. Whereas, human producer of emotions is mostly conscious of their production, the production of paralinguistic sounds is rarely pre-meditated. Thus, production of emotional speech or expressive voices involves voluntary control over the production mechanism, and possibly speaker training also. But, the production of paralinguistic sounds like laughter involves only the involuntary changes in it. Human producer of paralinguistic sounds may not even be conscious of their production, sometimes. Also, apparently larger role is played by the source characteristics in production of paralinguistic sounds, than in emotional speech. Associated changes in the vocal tract system characteristics also may play role in different degree in both the cases. Hence, selecting or developing appropriate signal processing methods that can accommodate this wider range of changes in the speech production mechanism, is a major signal processing challenge here. Another major challenge in the study of these sounds is nonavailability or limited availability of appropriate databases with ground truth. Most of the existing databases are application-specific or designed for a specific purpose. Collection of natural data is yet another research issue. Hence, this thesis is also aimed at addressing these databases related and signal processing challenges. Production of expressive voices like Noh involves voluntarily controlled changes in the speech production mechanism, that occur in producing the trained voice by the artist achieved through years of practice. But, the production characteristics of paralinguistic sounds like natural laughter, may have involuntary changes in these. Hence, the role and effect of voluntary control in speech production mechanism needs to be studied, first for the normal speech and then possibly for expressive voices. The role of involuntary changes in the production characteristics of paralinguistic sounds like laughter needs to be examined in detail. The effect of source-system coupling also needs to be studied for some dynamic sounds like trills in normal speech, and shouted (emotional) speech. The understanding of shouted speech is expected to help in better understanding of emotions like anger in speech sounds. There are some characteristics that are common to all types of nonverbal speech sounds. All are: (i) nonsustainable, i.e., occur for short bursts of time, (ii) nonnormal, i.e., are deviations from normal, (iii) form a continuum, i.e., are nondiscrete, and (iv) indicate humaneness, i.e., help distinguishing be5 tween a human and a humanoid. Hence, it is possible to distinguish these sounds from normal speech with better understanding of these production characteristics, that may also help in discriminating between a human and a machine. Characterising the nonverbal speech sounds may also possibly lead to discovery of some speaker-specific identity or human-specific signature. Better understanding of the production characteristics of these sounds would help in a wide range of applications, such as their spotting in continuous speech, event detection, classification, speaker identification, man/machine discrimination and speech synthesis etc. [161, 127, 63, 83, 101]. 1.4 Analysis tools, methodology and scope Significant changes occur in the excitation source characteristics during the production of nonverbal sounds, for which direct measurement is not feasible. Hence, analysis of these changes is required first, in order to characterise these sounds. Changes in the excitation source characteristics during production of these sounds may be reflected in the glottal pulse shape characteristics such as the open/closed phases in a glottal cycle, rate of vibration of vocal folds and their relative rate of opening/closing. Representing the source characteristics in terms of a time-domain sequence of impulses, whose relative amplitudes and locations capture the excitation information, has also been a research issue. Extracting some of these features from the acoustic signal may not be adequate and not feasible sometimes. But, changes in these excitation characteristics are indeed perceived by human being in the acoustic signal of the sounds produced. Hence, there is need to explore and analyse these sounds using other signals as well, for example electroglottograph (EGG) signal, throat microphone signal, magnetic resonance imaging etc. Extracting the excitation information from these signals poses another set of challenges. Focus in the analysis of nonverbal speech sounds is mostly on the changes in the glottal vibration, which is the main source of excitation of the vocal tract system. While some characteristics of the glottal vibration can be studied through auxiliary measurements such as EGG and high-speed video, the main characteristics as produced are present only in the acoustic output signal of the production system. Standard signal processing methods assume quasi-stationarity of the speech signal and extract the characteristics of excitation source and the vocal tract system using segmental block-processing methods, like discrete Fourier transform (DFT) and linear prediction (LP) analysis. These analysis tools can help in determining the source as either voiced or unvoiced, and the periodicity in the case of voiced sounds. They also help in describing the response of the vocal tract system in terms of either smoothed short-time spectral envelope or in terms of resonances (or formants) of vocal tract. Those block-processing methods have severe limitation of time-frequency resolution issues. Moreover, they will give only averaged characteristics of the source and the vocal tract system within the analysis segment, although it is well known that both the excitation and vocal tract system vary continuously, mainly because of the opening and closing of vocal folds at the glottis in each glottal cycle. In order to study the changes in glottal vibration, it is necessary to know the open and closed regions in each glottal cycle, besides many secondary excitation due to complexity in glottal vibrations. More6 over, there also occur significant variations in the strengths of major and minor excitations within each glottal cycles as well as across cycles. This will result in significant aperiodicities in the glottal vibrations, which are due to either involuntary changes or voluntarily controlled changes in the excitation. In all such cases, the deconvolution or inverse filtering of the vocal tract system response is not possible, for extracting the source characteristics. Methods are required to extract the excitation source information directly from the signal, without prior estimation of the vocal tract system for inverse filtering. Several studies were conducted in last about a decade, to analyse and characterise the emotional speech, paralinguistic sounds and expressive voices. Characteristics of basic emotions were analysed from speech signal in [19, 20, 209, 169, 179, 149, 89, 154]. Emotion recognition and classification was carried out in [127, 167, 168, 120, 44, 156, 88]. Spotting and identification of emotions in human speech was attempted in [103, 207, 14, 208]. Characteristics of shouted speech and scream were examined in [146, 217, 219, 115]. Automatic detection of shout/scream in continuous speech was also attempted in [70, 199, 160, 132, 131, 202]. Characteristics of paralinguistic sounds like laughter were examined in [161, 83, 194, 195, 139]. Acoustic analysis of laughter was carried out in [11, 16, 134, 187, 111]. Developing systems for automatic detection of laughter in continuous speech were attempted in [83, 194, 195, 198, 84]. Expressive voices like Noh were analysed in [55, 214, 80, 78]. Analysis of Noh voice using a tool named as TANDEM-STRAIGHT was carried out in [81, 79, 77]. Other phonetic variations of modal voicing, like trills etc. were also analysed in [40, 116, 98, 178, 155, 41]. The effects of source-system coupling and acoustic loading were studied in [189, 158, 191, 193, 108, 43]. But, in the studies carried out so far on nonverbal speech sounds, mostly the spectral characteristics are focused [183, 199, 132, 131]. Characteristics of the vocal tract system are examined mostly using the spectral features like Mel-frequency cepstral coefficients (MFCCs), linear prediction coefficients (LPCs), perceptual linear prediction (PLP) and short-time spectra derived using discrete Fourier transform (DFT) [199, 131, 183, 217, 160, 101, 195]. Whereas, in the production of nonverbal sounds, intuitively more changes appear to take place in the characteristics of the glottal source of excitation than in the vocal tract system characteristics, which need to be examined in more detail. The representation of excitation source information in this form is also convenient for manipulation and control of its features for different purposes. Limitations of current signal processing methods and tools available becomes evident in the case of acoustic signals that have rapid variations in pitch. Supra-segmental and sub-segmental level studies also indicate the need of specialized tools and signal processing methods for analysing paralinguistic sounds like laughter and expressive voices like Noh voice. Some recent signal processing methods like zero-frequency filtering (ZFF) and zero-time liftering (ZTL) are proposed to overcome some of the limitations of block-processing and time-frequency resolution issues, and also to take care of rapidly time-varying characteristics of the excitation and the vocal tract system. But even these techniques are found to be inadequate for analysis of nonverbal sounds, especially laughter and expressive voices. Hence, new approaches are needed, which may involve refinement of existing methods like DFT, LP analysis, ZFF and ZTL, or entirely different methods which bring out some specific characteristics if excitation. 7 In this research work, the production characteristics of nonverbal speech sounds are examined from both the EGG and acoustic signals, using some relatively new signal processing methods. These sounds along with normal speech sounds are examined in four categories, which differ in the degree of periodicity (or aperiodicity) of glottal excitation and rapidness of changes in the instantaneous fundamental frequency (F0 ). These categories are: (a) normal speech in modal voicing that includes study of effects of source-system interaction for few sounds like trill and fricative sounds, (b) emotional speech that includes loudness level variations in speech, namely, soft, normal, loud and shouted speech, (c) paralinguistic sounds like laughter and (d) expressive voices like Noh, that convey intense emotions. The four categories of sounds examined are in decreasing order of periodicity of glottal excitation, with increasingly more rapid and wider range of changes in the F0 (and hence pitch). Therefore, different signal processing techniques are required for deriving the distinguishing features in each case. In a few cases, the production characteristics can be derived from the EGG signal [53, 50, 121, 47] that can distinguish these sounds from normal speech well, and help in classifying these [123, 124]. But, for the applications such as automatic spotting in continuous speech and classification etc., deriving these production characteristics from the acoustic signal is preferable, which is explored in detail in this work. The recently proposed signal processing methods like zero-frequency filtering [130, 216], zero-time liftering [38, 213], Hilbert transform [170, 137] and group delay function [128, 129] are used for feature extraction. Existing methods based upon LPCs [112, 114], MFCCs [114] and DFT spectrum [137] are also used. New signal processing methods such as modified zero-frequency filtering (mZFF), dominant frequencies computation using LP spectrum/group delay function and saliency computation, i.e., a measure for pitch perception in expressive voices, are proposed in this work. A representation of the excitation source information in terms of a time-domain sequence of impulses is proposed, which is also related to pitch perception of aperiodicity in expressive voices. Using this representation and saliency of pitch perception, a method is proposed for extracting the instantaneous fundamental frequency (F0 ) of expressive voices, even in the regions of harmonics/subharmonics and aperiodicity, which otherwise is a challenging task. Validation of results is carried out using spectrograms, saliency measure, perceptual studies and synthesis. Several novel production features are proposed that capture the unique characteristics of each of the four sound categories examined. Parameters capturing the degree of changes and temporal changes in these features are derived, that help in distinguishing these sounds from normal speech. Three prototype systems, namely, automatic trill detection system, automatic shout detection system and laughter detection system are developed, for spotting in continuous speech the regions of trill, shout and laughter, respectively. Specific databases with ground truth are collected in each case for developing and testing these systems. Performance evaluation is carried out using saliency measure, perceptual studies and synthesis. This work is aimed at examining the feasibility of representing the excitation source characteristics in terms of a time-domain sequence of impulses, mainly for nonverbal speech sounds. The differences in locations and respective amplitudes of impulses, relate to aperiodicity and subharmonics present in highly expressive voices like Noh singing. Only representative sounds of emotional speech and 8 paralinguistic sounds as shouted speech and laughter, respectively, are analysed. Often considered other emotions like happy or anger, or paralinguistic sounds like cry or cough are not considered in this work. Objective of the features and parameters is to derive the distinguishing characteristics of these sounds and normal speech. Hence, only the prototypes are developed for automatic detection of trill, shout and laughter in continuous speech, in order to validate the efficacy of these parameters. Performance evaluation of these prototype systems is carried out on limited databases collected especially for this study, due to nonavailability of any standard databases with EGG signals and ground truth. These systems can be further developed into online fully automated systems for real-life practical applications, that would require testing on large databases and in real-life practical scenarios. 1.5 Organization of thesis The organization of thesis on this research work carried out is as follows: Chapter 2 reviews the methods for analysis of nonverbal speech sounds. Basics of human speech production with glottal source processing are revisited. Analytical signal and parametric representation of speech signal are discussed in brief, in the context of notion of frequency. Earlier studies related to the analysis of each of the four sound categories are reviewed briefly. Studies on source-system interaction in normal speech, for a few special sounds such as trills, and effects of acoustic loading on the glottal vibration are reviewed. Following this, the studies analysing the shouted speech, laughter and Noh voice are reviewed in brief. Few studies that attempted detection of acoustic events such as trills, shouts and laughter are also discussed. Limitation of these past studies and related issues are discussed. In Chapter 3, some standard and few recently proposed signal processing methods are described that are used, modified and refined later in this thesis work. Standard signal processing techniques used in speech coding methods for representing the excitation source information in terms of an impulse sequence are reviewed briefly. Mathematical background of the need for new signal processing techniques is discussed. Then recently proposed signal processing techniques, mainly the zero-frequency filtering and zero-time liftering methods are discussed, which are used for extracting impulse-sequence representation of the source and the spectral characteristics of the acoustic signal, respectively. Methods for computing dominant frequencies of the vocal tract system are also discussed. In Chapter 4, normal speech in modal voicing is examined for dynamic sounds like trills, to study the relative role of excitation source and vocal tract system, in its production. The effect of source-system interaction, and the effect of acoustic loading on the excitation source characteristics, are examined for few consonants. Production characteristics of apical trill, apical lateral approximant, voiced fricatives and nasal sounds are examined. Both EGG and speech signals are studied in each case. Representation of the excitation source characteristics in terms of a time-domain impulse sequence is used for deriving features for the analysis. In Chapter 5, emotional speech is examined to distinguish between shouted and normal speech. The production characteristics of speech produced at four different loudness levels, namely, soft, nor9 mal, loud and shout are analysed. The effect of glottal dynamics is examined in the production of shouted speech, in particular. A set of distinguishing features and parameters are derived from the impulse-sequence representation of the excitation source characteristics, and from the vocal tract system characteristics. In Chapter 6, the production characteristics of paralinguistic sounds like laughter are studied. Three cases are considered, namely, normal speech, laughed-speech and nonspeech-laugh. A modified zerofrequency filtering method is proposed, to derive the excitation source characteristics from the speech signal consisting of laughter. A set of discriminating features are extracted and parameters are derived, representing the production characteristics, which help in distinguishing these three cases. In Chapter 7, the excitation source characteristics of expressive voices, Noh voice in particular, are examined. Significance of aperiodic component of the excitation that contributes to the voice quality of expressive voices is analysed. The role of amplitude/frequency modulation is examined using synthetic AM/FM sequences. Signal processing methods are proposed for deriving a time-domain representation of the excitation source information, computation of saliency of pitch perception and F0 extraction in the regions of aperiodicity. Spectrograms, saliency measure and signal synthesis are used for testing and validating the results. In Chapter 8, three prototype systems are developed for the automatic detection of trills, shout and laughter in continuous speech. These prototype systems are helpful in testing the efficacy of features extracted and parameters derived, in distinguishing between nonverbal speech and normal speech. Limited testing and performance evaluation is carried out on different datasets with ground truth, collected especially for the study. Lastly in Chapter 9, contributions of this research work are summarized. Issues arising out of the research work are discussed, along with scope of further work in this research domain. 10 Chapter 2 Review of Methods for Analysis of Nonverbal Speech Sounds 2.1 Overview In the production of nonverbal speech sounds, significant role is played by the glottal source of excitation. Hence, basics of speech production and the significance of glottal vibration are revisited in this chapter. Changes in the instantaneous fundamental frequency (F0 ) are expected to be increasingly more rapid in the four categories of sounds considered for analysis (in order), in this thesis. Hence, notion of frequency is also revisited with brief discussion on analytical signal and parametric representation of speech signal. Earlier studies related to analyses of the four sound categories are reviewed briefly. First, studies on the nature of production of some special sounds that involve source-system coupling effects, such as trills, are reviewed. Then studies highlighting the significance of source-system interaction and acoustic loading effects in the production of some specific sounds are reviewed. Earlier studies on emotional speech, paralinguistic sounds and expressive voices in general, and on shouted speech, laughter and Noh voice in particular, are reviewed next. Studies on the role of aperiodicity in expressive voices and F0 extraction are also reviewed. Analysis of the production characteristics of nonverbal speech sounds has helped in identifying their few unique characteristics. Exploiting these, the distinctive features and parameters can be derived to discriminate these sounds from normal speech. Hence, studies towards detecting the acoustic events such as trills, shouts and laughter in continuous speech are also reviewed. Limitation of these past studies and related issues are also discussed. The chapter is organized as follows. Human speech production with role of glottal vibration in speech production is discussed briefly in Section 2.2. In Section 2.3, analytical signal and parametric representation of speech signal are discussed while revisiting the basic concepts on frequency. Earlier studies on the production of trills are reviewed in Section 2.4. Effects of source-system interaction and acoustic loading of the vocal tract system on glottal vibration are also discussed. Studies on emotional speech sounds and shouted speech are reviewed in Section 2.5. Section 2.6, reviews earlier studies on need of analysing laughter, its different types and acoustic analysis of laughter production. Studies on motivation for studying expressive voices, representation of the excitation source component in terms of time domain sequence of impulses and issues involved are reviewed in Section 2.7. In Section 2.8, 11 Figure 2.1 Schematic diagram of human speech production mechanism (figure taken from [153]) Figure 2.2 Cross-sectional expanded view of vocal folds (figure taken from [157]) studies on the feasibility of developing systems for automatic detection of trills, shouts and laughter regions in continuous speech, are reviewed in brief. A summary of this chapter is given in Section 2.9. 2.2 Speech production Both human speech and other non-normal sounds are produced by the speech production mechanism (Fig. 2.1) studied broadly in two components, the excitation source and the vocal tract system [153]. Various parts in the vocal tract system, referred to as the system, can be grouped as lungs, larynx and the vocal tract. When the airflow from lungs is pushed through the larynx, this airflow is modulated by the vibration of vocal folds, also termed as vocal cords or the source. This is called excitation of the system. The primary source of excitation is due to vibration of the vocal folds, shown in a cross-sectional expanded view in Fig. 2.2 [157]. Since, volume velocity waveform that provides excitation to the system is termed as glottal source, it is also referred to as glottal vibration. The regular opening/closing of vocal folds, and the corresponding changes in airflow in the vocal tract result in production of different speech sounds. 12 (a) Input speech signal waveform xin[n] 1 0 −1 (b) EGG of input speech signal ex[n] 1 0 −1 (c) LP residual of speech signal rx[n] 1 0 −1 (d) Glottal pulses obtained from LP residual grx[n] 1 0 −1 0 5 10 15 20 25 Time (ms) 30 35 40 45 50 Figure 2.3 Illustration of waveforms of (a) speech signal, (b) EGG signal, (c) LP residual and (d) glottal pulse information derived using LP residual Shape of the vocal tract changes dynamically due to movement of upper and lower articulators (lips, teeth, hard/soft palate, tongue and velum), thus also changing the stricture formed along the vocal tract. Vocal folds vibration converts the airflow passing through the vocal tract into acoustic pulses, thus providing an excitation signal to the vocal tract. The vocal tract consists of oral, nasal and pharyngeal cavities, resonances of which are related to the shape (contour) of short-time spectrum of the modulated airflow signal. Changes in the shape of the vocal tract also change resonances in it. Hence, it may be inferred that speech sound is produced by the time-varying excitation of the time-varying vocal tract system. The quasi-periodic vibration (opening/closing) of the vocal folds in the case of normal speech, causes the pitch-perception. The closure of vocal folds is relatively abrupt (i.e., faster) as compared to their gradual opening. Hence, there is significant excitation around these time-instants, called glottal closure instants (GCIs). The glottal closure instants can be seen better in the electroglottograph (EGG) signal shown in Fig. 2.3(b), as compared to the corresponding speech signal shown in Fig. 2.3(a). The interval between successive GCIs, which is nearly periodic during the production of normal speech in modal voicing (e.g., vowels), is termed as glottal cycle period (T0 ). Inverse of T0 gives instantaneous fundamental frequency (F0 ), that is a characteristics of each human being. The F0 is related to pitchperception and is different for different sounds. Since excitation is time-varying, the F0 (i.e., pitch) also varies dynamically with time for each sound produced. Changes in pitch (F0 ) are less for normal speech, particularly for vowels. But, changes in F0 and pitch are quite rapid in the case of nonverbal speech. There is secondary excitation also present in the production of some particular sounds, which is due to stricture in the vocal tract. For example, production of fricative sounds ([s] or [h]) having higher frequency (1-5 kHz) content is related with this secondary excitation. The production of nonverbal speech sounds may also involve secondary excitation, which may cause significant changes in the source characteristics such as F0 (pitch). 13 Figure 2.4 Schematic view of vibration of vocal folds for different cases: (a) open, (b) open at back side, (c) open at front side and (d) closed state (figure taken from [157]) Production characteristics of normal (verbal) speech can be analysed using standard signal processing techniques such as short-time Fourier spectrum, linear prediction analysis, and spectral measures such as linear prediction cepstral coefficients (LPCCs) and Mel-frequency cepstral coefficients (MFCCs) etc. The instants of significant excitation (GCIs) can be identified using linear prediction (LP) residual (Fig. 2.3(c)), EGG signal or using a representation in terms of a time-domain sequence of impulses. The information of the excitation source is carried in the impulse-sequence, in relative amplitudes of impulses and their locations that may correspond to GCIs. This impulse-sequence representation of the excitation information though has been a research challenge. But, if the same can be achieved for nonverbal speech sounds, then that would be immensely useful in speech signal analysis, speech coding, speech-synthesis and many other applications. Production of these nonverbal speech sounds involves changes in the characteristics of the vocal tract system, studied in most of the acoustic analyses of these sounds. But, in production of these sounds, significant changes seem to occur in the excitation source characteristics, some of which may be reflected in different characteristics of glottal vibration shown schematically in Fig. 2.4 [157]. Changes could be in the glottal cycle periods T0 (hence F0 ), relative durations of the open/closed phases in a glottal cycle, and the rate of opening/closing of vocal folds etc. In the production of voiced sounds, the regular opening/closing (i.e., vibration) of vocal folds periodically interrupts the airflow from lungs passing through the vocal tract, that causes changes in the air pressure [200, 92]. Vocal folds are usually open, when no sound is being phonated. In the production of unvoiced sounds, the vocal folds are held apart. Thus airflow is allowed to pass freely and noise excitation is generated due to turbulence of airflow. During production of voiced sounds, the vocal folds close-open-close regularly. Closing of the vocal folds is by the control of adductor muscles, that bring the vocal folds together and provide resistance to the air flow from the lungs. The air pressure built-up below the closed vocal folds (i.e., subglottal air pressure) forces the vocal folds to open and allow airflow to pass through the glottis in to the vocal tract. The vocal folds are closed again by two factors: (i) elasticity of the vocal folds tissue, that forces it to regain its original configuration (closed 14 Figure 2.5 Schematic views of glottal configurations for various phonation types: (a) glottal stop, (b) creak, (c) creaky voice, (d) modal voice, (e) breathy voice, (f) whisper, (g) voicelessness. Parts marked in (g): 1. glottis, 2. arytenoid cartilage, 3. vocal fold and 4. epiglottis. (Figure is taken from [60]) position) and (ii) aerodynamic forces described by the Bernoulli effect, that causes the drop of pressure in the glottis when airflow velocity increases. After the vocal folds are closed, the subglottal air pressure builds-up again, forcing the vocal folds to open, and thus the cycles continue. Period of this cycle is the glottal cycle period T0 , and its frequency F0 (= 1/T0 ) is referred to as fundamental frequency. Changes in the opening/closing direction (front/rear) of the vocal folds and the rate of their opening/closing are related to three voicing types: (i) modal voicing, (ii) breathy voice and (iii) creaky voice. Unvoiced whisper, voicelessness and glottal stop are also possible. Different phonation types in normal speech are schematically shown in Fig.2.5 [60]. Vibration of the vocal folds can be observed only for creaky, modal and breathy voices, shown by the corrugated line between the vocal folds in Fig. 2.5(c), (d) and (e), respectively. Only the modal voicing phonation type is considered in this study, because it represents a neutral phonation with little variation in period (T0 ), over successive glottal cycles. Analysing the changes in these characteristics of glottal vibration, from the acoustic signal is still a challenge. The glottal pulse shape characteristics may be derived from LP residual of the speech signal as shown in Fig. 2.3(d), but only for few cases. Therefore, the production characteristics of these sounds are analysed in this thesis, from both EGG and acoustic signals. The research issues involved in analysing these sounds are discussed further in Section 2.4. 2.3 Analytic signal and parametric representation of speech signal (A) Notion of frequency for stationary signal (i) Notion of frequency in mechanics: Frequency (f ) of vibratory motion is defined in mechanics as the number of oscillations per unit time. Here, oscillation is a complete cycle of to-and-fro motion, starting from the equilibrium position to one end, then to other end, and then back to initial position. Harmonic motion is special type of vibratory motion, in which the acceleration is proportional to the displacement and is always directed towards the equilibrium (or centre) position. For example, if a body is in circular motion then the projection of this motion on a diameter is in harmonic motion. The displacement, velocity and acceleration of the 15 harmonic motion of this projection at an instant t can be given by (2.1), (2.2) and (2.3), respectively [18]. x(t) = a0 cos(ωt) (2.1) ′ x (t) = a0 ω sin(ωt) (2.2) x′′ (t) = − a0 ω 2 cos(ωt) = − ω 2 x(t) (2.3) where a0 is radius (or maximum displacement), and ω (= 2πf ) is uniform angular speed. The frequency f is obtained by solving the differential equation (2.3). The solution is given by y(t) = α ej2πf t (2.4) ω where α is an arbitrary constant, and frequency f is given by f = 2π . (ii) Notion of frequency for signals: A signal conveys information about the state or behaviour of a physical system. The variable representing the signal could be continuous (denoted by ‘( )’) or discrete (denoted by ‘[ ]’). Discretetime signals are represented as sequence of numbers [138]. A signal (s(t)) could be electric, acoustic, wave motion or harmonic motion etc. The stationary signal is defined usually for a causal linear timeinvariant (LTI) stable system. A linear system is defined by the principle of superposition [138]. The system is linear if and only if T {x1 [n] + x2 [n]} = T {x1 [n]} + T {x2 [n]} T {a x[n]} = a y[n] (2.5) (2.6) where x1 [n] and x2 [n are inputs to the system, y1 [n] (= T {x1 [n]}) and y2 [n] (= T {x2 [n]}) are responses (outputs) of the system, T denotes transformation by the system and a is an arbitrary constant. Two properties of the superposition principle represented by (2.5) and (2.6) are additive property and homogeneity (scaling property), respectively. Time-invariant (or shift-invariant) system is that for which a delay or shift (n0 ) of the input sequence (x[n]) causes a corresponding shift in the output sequence (y[n]) [138], i.e., if then x1 [n] = x[n − n0 ] y1 [n] = y1 [n − n0 ], ∀ n0 (2.7) A system is causal (nonanticipative) if, for every choice of n0 , the output sequence (y[n]) value at n = n0 depends only on the input sequence (x[n]) values for n ≤ n0 . A stable system has bounded output for bounded input, i.e., every bounded input sequence (x[n]) produces a bounded output sequence (y[n]) [138]. If hk [n] represents the response of the system to an input impulse δ[n − k] occurring at time n = k, then linearity is expressed as [138] ( ∞ ) X y[n] = T x[k] δ[n − k] (2.8) k=− ∞ 16 From the principle of superposition, the (2.8) gives y[n] = = ∞ X x[k] T {δ[n − k]} k=− ∞ ∞ X x[k] hk [n] (2.9) k=− ∞ From the property of time-invariance, if h[n] is the response to δ[n] then the response to δ[n − k] is h[n − k], i.e., hk [n] = h[n − k]. Hence, for a linear time-invariant (LTI) system, the response (output) is given by [138] ∞ X x[k] h[n − k] (2.10) y[n] = k=− ∞ The LTI system response (y[n]) is also called convolution sum, which can be represented as convolution of the input sequence (x[n]) and the system response (h[n]) to impulse input (δ[n]) [138], as y[n] = x[n] ∗ h[n] (2.11) (iii) Spectral decomposition and reconstruction of the signal: For a linear time-invariant causal and stable system, a signal (s(t)) can be represented as a weighted sum of harmonic vibrations [18, 138]. The spectral decomposition of the signal s(t) can be obtained using the Fourier transform (FT) [138] of the signal, which is defined as Z ∞ s(t) e−j2πf t dt (2.12) S(f ) = −∞ It is called analysis equation [18]. The signal s(t) can be reconstructed from the spectral decomposition S(f ), that completely characterises the signal s(t). The reconstructed signal s(t) can be obtained using the inverse Fourier transform (IFT) [138], which is given by Z ∞ S(f ) ej2πf t df (2.13) s(t) = −∞ It is called synthesis equation [18]. Using (2.12) and (2.13), “any stationary signal can be represented as the weighted sum of sine and cosine waves with particular frequencies, amplitudes and phases” [18]. The digital equivalent representation of the spectral decomposition (X[k]) and the reconstructed signal (x[n]) for a periodic input sequence (x[n]), for one period, is expressed as digital Fourier transform (DFT) pair [9], given by X[k] = N −1 X x[n] e N −1 X X[k] e −j2πkn N , 0≤k ≤N −1 n=0 = 0, x[n] = otherwise j2πkn N , (2.14) 0≤k ≤N −1 n=0 = 0, otherwise 17 (2.15) where N is number of sample points in one period, or in fs /2 where fs is sampling frequency. Here, X[k] (i.e., discrete Fourier transform (DFT)) represents a frequency-domain sequence, and x[n] (i.e., inverse discrete Fourier transform (IDFT)) represents a time-domain sequence. (B) Notion of frequency for nonstationary signals The frequency can be defined unambiguously for the stationary signals, but not for nonstationary signals. Attempts were made to define frequency of nonstationary signals using the term instantaneous frequency (IF). But, “there is an apparent paradox in associating the words instantaneous and frequency. For this reason, the definition of IF is controversial, application-related and empirically assessed” [18]. A summarized view of these different approaches, reviewed by B. Boashash in [18], is as follows: (i) FM based definition of IF (Carson and Fry, 1937) [25]: In the context of electric circuit theory, a frequency modulated (FM) wave is defined as Z t m(t)dt) (2.16) ω(t) = exp j(ω0 t + λ 0 where m(t) is low-frequency signal (|m(t)| ≤ 1), ω0 (= 2πfc ) is a constant carrier frequency and λ is modulation index. Using (2.16), the instantaneous angular frequency (Ω) and the instantaneous cyclic frequency (fi ) are given by (2.17) and (2.18), respectively, as Ω(t) = ω0 + λ m(t) λ m(t) fi (t) = f0 + 2π (2.17) (2.18) where Ω(t) = 2πfi (t) and ω0 = 2πf0 . (ii) AM, PM based definition of IF (Van der Pol, 1946) [201]: For a simple harmonic motion, IF can be defined by analysing the expression s(t) = a cos(2πf t + θ) (2.19) where a is amplitude, f is frequency of oscillation, phase φ(t) (= 2πf t + θ) is argument of the cosine function, and phase constant θ is constant part of the phase φ(t). The amplitude modulation (AM) and phase modulation (PM) are defined by (2.20) and (2.21), respectively, as a(t) = a0 . (1 + µg(t)) (2.20) θ(t) = θ0 . (1 + µg(t)) (2.21) where g(t) is modulating signal and µ is amplitude/phase modulation index. Using (2.20) and (2.21), the (2.19) can be re-written for nonstationary signal as Z t 2πfi (t)dt + θ (2.22) s(t) = a . cos 0 where, phase φ(t) = ( Rt 0 2πfi (t) + θ). Hence, IF is given by fi (t) = 1 d φ(t) 2π dt 18 (2.23) The IF can be defined by derivative of phase angle, i.e., IF is the rate of change of phase angle at time t. (iii) Analytic signal based definition of frequency (D. Gabor, 1946) [56]: Gabor defined the complex analytic signal using Hilbert transform, as z(t) = s(t) + j y(t) = a(t) ejφ(t) where y(t) = H(s(t)) represents Hilbert transform (HT) of s(t). It is defined as Z ∞ s(t − τ ) dτ H(s(t)) = p.v. πτ −∞ (2.24) (2.25) where, p.v. denotes Cauchy principle value of the integral. It satisfies the following properties [18]: y(t) = H(s(t)) s(t) = − H(y(t)) s(t) = − H 2 (s(t)) and s(t), y(t) contain same spectral components. (2.26) Analytic signal in time-domain, i.e., z(τ ) is defined as analytic function of the complex variable τ = t + j u, in upper half plane, i.e., Im(τ ) ≥ 0. It means, (2.26) satisfies the Cauchy-Reimann conditions. The real part of z(τ ) equals s(t) on the real axis. The imaginary part of z(τ ), which takes on the value y(t), i.e., H(s(t)), is called the quadrature signal, because s(t) and H(s(t)) are out of phase by π/2. Hence, z(τ ) = s(t, u) + j y(t, u) (2.27) when u → 0, i.e., τ → t, then we get following as in (2.24) z(t) = s(t) + j y(t) = a(t) ejφ(t) (2.28) Analytic signal in frequency domain, i.e., Z(f ) can also be defined [18], as the complex function Z(f ) of the real variable f . It is defined for f ≥ 0 such that z(τ ) is inverse Fourier transform (IFT) of Z(f ), i.e., Z ∞ Z(f ) ej2πf τ df z(τ ) = (2.29) 0 where the function z(τ ) of complex variable τ = t + j u is defined for Im(τ ) ≥ 0. Hence, analytic signal has spectrum limited to positive frequencies only, and sampling frequency can be equal to half of the Nyquist rate. The central moments of frequency of the signal is given by R ∞ ′′ 2 ∞ f |Z(f )| df ′′ R∞ (2.30) <f > = − 2 − ∞ |Z(f )| df 19 where, Z(f ) is the spectrum of the complex signal. It may be noted that if the spectrum of the real signal was used in (2.30), then all odd moments would be zero since |S(f )|2 is even, which would not be in-line with physical reality. (iv) Unification of analytic signal and IF (Ville, 1948) [204]: Ville unified the analytic signal defined by Gabor [56] with the notion of IF given by Carson and Fry [25]. For a signal expressed by s(t) = a(t) cos(φ(t)), he defined IF as 1 d (arg z(t)) 2π dt fi (t) = (2.31) where, z(t) is analytic signal given by (2.28). Ville noted that since IF was time-varying, some instantaneous spectrum should be associated with it. The mean value of the frequencies in this instantaneous spectrum is equal to time average of the IF [204, 18], i.e., < f > = < fi > where, <f > = < fi > = R∞ ∞ R−∞ f |Z(f )|2 df 2 − ∞ |Z(f )| df R∞ 2 − ∞ fi (t) |z(t)| dt R∞ 2 − ∞ |z(t)| dt (2.32) (2.33) (2.34) Here, (2.33) is averaging over frequencies and (2.34) is averaging over time. This had lead to WignerVille Distribution (WVD) [204], i.e., distribution of the signal in time and frequency, expressed as Z ∞ z(t + τ /2) z ∗ (t − τ /2)e− j2πf τ dτ (2.35) W (t, f ) = −∞ Hence, W (t, f ) is FT of the product z(t + τ /2) z ∗ (t − τ /2) w.r.t. τ . IF is obtained by first moment of the WVD w.r.t. frequency, and is given by R∞ f W (t, f )df (2.36) fi (t) = R−∞∞ − ∞ W (t, f ) df (v) Interpretation of instantaneous frequency [18]: For an analytic signal z(t) = a(t) ej φ(t) as defined in (2.28), its spectrum (Z(f )) is given by Z ∞ Z(f ) = z(t)e− j2πf t dt −∞ Z ∞ a(t) ej(φ(t)−2πf t) dt (2.37) = −∞ The largest value of this integral is at frequency fs , for which the phase is stationary. From stationary phase principle, the fs is such that, at this value, the d (φ(t) − 2πfs t) = 0 dt 1 dφ(t) fs = 2π dt 20 (2.38) (2.39) where, fs (t) (=fi (t)) is IF of the signal at time t, i.e., it is a measure of the frequency domain signal energy concentration as a function of time. It is important to note that this interpretation of IF for an analytic signal is not a unique function of time. Several variations of the interpretation (as in (2.39)) were proposed by different researchers [18]. These variations are related with (i) variations in amplitude (as m(t) ej2πf t ) due to amplitude modulation, (ii) bi-component real signal (s(t) = s1 (t) + s2 (t)) that can be expressed as z(t) = a1 ej(ω0 −∆ω/2)t + a2 ej(ω0 +∆ω/2)t , or (iii) time-varying amplitude of nonstationary signal that 2 2 can be represented as y(t) = A1 (t) e−t /α cos(2πf0 t + φ0 ), whose FT would have two Gaussian functions centred at f0 and −f0 (i.e., basic orthogonal functions). Also, as per Ville’s approach [204], “the frequency is always defined as the first derivative of the phase”, regardless of stationarity [18], as expressed in (2.31) for monocomponent signals. (vi) Generalized stationary models for nonstationary signals and IF [18]: There are many ways of generalizing the stationary models to the nonstationary signals. A model for the FM class of signals (s(t)), is given by s(t) = N X sk (t) + n(t) (2.40) k=1 where n(t) is noise (or undesirable component) and sk (t) are N single component nonstationary signals. These are described by the envelopes ak (t) and instantaneous frequency (IF) fik (t) at instant t, such that the analytic signal zk (t) (associated with signal sk (t)) is given by zk (t) = ak (t) ejφk (t) where, φk (t) = 2π Z (2.41) t fik (τ ) dτ (2.42) −∞ Here, if k = 1 then the signal is called monocomponent signal, else (for k ≥ 2) multicomponent signal. 2.4 Review of studies on source-system interaction and few special sounds The significance of changing vocal tract system and the associated changes in the glottal excitation source characteristics in production of some special sounds in normal speech such as trills are examined in this study, from perceptual point of view. The effect of acoustic loading of the vocal tract system on the glottal vibration is also studied. The speech signal along with the electroglottograph (EGG) signal [53, 50] are analyzed for a selected few consonant sounds. 2.4.1 Studies on special sounds such as trills Speech sounds are produced by the excitation of the time-varying vocal tract system. Excitation of acoustic resonators is possible through different sources including the glottal vibration, i.e., laryngeal 21 source [183]. Excitation source can also be extraglottal, such as strictural source that is involved in the production of sounds such as the voiceless fricative [s]. The major source of acoustic energy is the quasi-periodic vibration of the vocal folds at the glottis [48]. This is referred to as voicing. All the speech sounds that can be generated by the human speech production mechanism are either voiced or voiceless. Languages make use of different types of voicing, which are called phonation types [102, 99]. Out of the possible types of phonation, modal voice is considered to be the primary source for voicing in a majority of languages [99]. The production of modal voice involves excitation of the vocal tract system by the vibration of vocal folds at the glottis. Changes in the mode of glottal vibration and in the shape of the vocal tract, both contribute to the production of different sounds [49]. In the production of sounds involving specific manners of articulation, such as trills, changes in the vocal tract system affect the glottal vibration significantly [40]. Trilling is a phenomenon where the shape of the vocal tract changes rapidly with an approximate trilling rate of about 30 cycles per second. Analysis of trill sounds is limited to the study of production and the acoustic characterization in terms of spectral features. For example, the production mechanism of tongue tip trills were described and modeled in [100, 98, 116], from aerodynamic point of view. Trill cycle and trilling rate were studied in [100, 116]. Acoustic correlates of phonemic trill production are reported in [66]. A recent analysis of trill sounds [40] indicated that changes in the vocal tract system affect the vibration of the vocal folds in the production of tongue tip trilling. In that study, acoustics of trill sounds was analysed using recently proposed signal processing methods like zero-frequency filtering (ZFF) [130, 216] and Hilbert envelope of differenced numerator of group delay (HNGD) method [40, 123]. These studies indicated the possibility of source-system coupling in the production of apical trills, due to interaction between aerodynamic (airflow and air-pressure) and articulatory (upper/lower articulators in mouth) components. The significance of changing vocal tract system and the associated glottal excitation source characteristics due to trilling are studied in this thesis, from perception point of view. These studies are made by generating speech signal by either retaining the features of the vocal tract system or of the glottal excitation source of trill sounds. Experiments are conducted to understand the perceptual significance of the excitation source characteristics on production of different trill sounds. Speech sounds of sustained trill and approximant pair, and apical trills produced by four different places of articulation are considered. Details of this study are discussed in Chapter 4. 2.4.2 Studies on source-system interaction and acoustic loading In the production of sounds involving specific manner of articulation, such as trills, the changes in the vocal tract system also affect the glottal vibration. This is due to acoustic loading of the vocal tract system, causing pressure difference across the glottis, i.e., difference in air pressure in supraglottal and subglottal regions. Thus the source of excitation is affected due to the source-tract interaction. The interaction of glottal source and vocal tract system, also referred to as source-system coupling, has been a subject of study among researchers over the past several years [49, 32, 189, 192, 108]. 22 Mathematical modeling and physical modeling of this interaction were attempted. The source-system interaction has been explored in a significant way for vowel sounds [42, 43, 171, 136]. It was observed that involuntary changes in the vocal fold vibrations due to influence of vocal tract resonances occur during changes in the ‘intrinsic pitch’ (fundamental frequency F0 ) of some high vowels [43, 171, 136]. However, source-system coupling has not been explored in a significant way from the angle of the production characteristics of speech sounds other than vowels. In this study, we aim to examine the effect of acoustic loading of the vocal tract system on the glottal vibration in the production characteristics of some consonant sounds, using signal processing methods. Mode of glottal vibration can be controlled voluntarily for producing different phonation types such as modal, breathy and creaky voice. Similarly, the rate of glottal vibration (F0 ) can also be controlled, giving rise to changes in F0 and pitch. On the other hand, glottal vibration could also be affected by loading of the vocal tract system while producing some types of sounds. This happens due to coupling of the vocal tract system with the glottis. This change in glottal vibration may be viewed as involuntary change. Such involuntary changes occur during changes in ‘intrinsic pitch’ (F0 ) of some high vowels [43], i.e., when the resonances of the vocal tract influence the nature of vocal cord vibration and the way F0 may be varied [136]. The effect could be due to “acoustic coupling between the vocal tract and the vocal cords similar to that which happens between the resonances of the bugle and the bugler’s lips”, that occurs when the first formant (F1 ) is near the F0 [136]. Or, it could be due to tongue-pull [43], i.e., when “tongue, in producing high vowels, pulls on the larynx and thus increases the tension of the vocal cords and thus the pitch of voice” [136]. The effects of acoustic coupling between the oral and subglottal cavities were examined for vowel formants [144, 176, 32]. Discontinuity in the second formant frequency and attenuation in diphthongs were observed near the second subglottal resonance (SubF2) in the range of 1280-1620 Hz, due to subglottal coupling [32]. Subglottal coupling effects were found to be generally stronger during the open phase of glottal cycle than in the closed phase [32]. Studies on the source-system interaction were also carried out for other categories of speech sounds, such as fricatives and stops [180]. Subglottal resonances were measured in the case of nasalization [184]. A study of acoustic interaction between the glottal source and the vocal tract, suggests the presence of nonlinear source-filter coupling [158]. Computer simulation experiments were carried out, along with analytical studies, to study the effects of source-filter coupling [191, 30]. The acoustic interaction between the sound source in the larynx and the vocal tract airways, can take place either in linear or in nonlinear way [189]. In linear source-filter coupling the source frequencies are reproduced without being affected by the pressure in the vocal tract airways [189]. In nonlinear source-filter coupling the pressure in the vocal tract contributes to the production of different frequencies at the source [189]. The theory of source-tract interaction suggests that the degree of coupling is controlled by the crosssectional area of the laryngeal vestibule (epilarynx), which raises the inertive reactance of the supraglottal vocal tract [192]. Cooccurrence of acoustically compliant (negative) subglottal reactance and inertive (positive) supraglottal reactance was found to be most favourable for the vocal fold vibration in modal 23 register. Both subglottal and supraglottal reactances increase the driving pressures of the vocal folds and the glottal flow of air, which increases the energy level at the source [192]. The theory also mentions that the source of instabilities in vibration modes is due to harmonics passing through formants during pitch or vowel change. It was hypothesized that vocal folds vibration is maximally destabilized when major changes take place in the acoustic load, and that occurs when F0 crosses over F1 [189]. Other studies on the source-system interaction have focused mainly on the physical aspects, such as nonlinear phenomena and the source-system coupling [181, 193, 64, 221]. The physics of laryngeal behaviour and larynx modes, and physical models of the acoustic interaction of voice source with subglottal vocal tract system were studied in [190, 193, 64]. The effect of glottal opening on the acoustic response of the vocal tract system was studied in [49, 12, 13, 163]. It was observed that the first and second formant frequencies increase with increasing glottal width [13]. The effects of source-system coupling were studied in [158, 30, 189]. The nonlinear phenomenon due to this coupling, that is related with air flow across glottis during phonation, was also studied in [221, 192, 108]. The source-system interactions were observed to induce, under certain circumstances, some complex voice instabilities such as sudden frequency jumps, subharmonic generation and random changes in frequency, especially during F0 and F1 crossovers [64, 189, 192]. Studies on source-system coupling were also carried out for other categories of speech sounds, such as fricative and stop consonants [180]. Subglottal resonances were measured in the case of nasalization [184]. The effect of acoustic coupling between the oral and subglottal cavities was examined for vowel formants [32]. In that study, the discontinuity in second formant frequency and the attenuation in diphthongs were observed near the second subglottal resonance (SubF2) due to subglottal coupling. The range of SubF2 is 1280-1620 Hz, depending on the speaker. Subglottal coupling effects were found to be generally stronger for open phase than for closed phase of glottal opening [32]. In another study, a dynamic mechanical model of the vocal folds and the tract was used to study the fluid flow in it [12]. An updated version of this mechanical model with more realistically shaped laryngeal section was used to study the effect of glottal opening on the acoustic response of the vocal tract system [13]. It was observed that the first and second formant frequencies increased with increasing glottal width. The influence of acoustic waveguides on the self-sustained oscillations was studied by means of mechanical replicas [163]. The replicas were used to simulate oscillations, and gather data of parameters such as subglottal pressure, glottal aperture and oscillation frequency [163]. The role of glottal open quotient in relation with laryngeal mechanism, vocal effort and fundamental frequency was studied for the case of singing [65]. In that study, the need for controlling the laryngeal configuration and lung pressure was also highlighted. In another study, a glottal source estimation method was developed using joint source-filter optimization technique, which was claimed to be robust to shimmer and jitter in the glottal flow [57]. The technique uses estimation of parameters for the Liljencrants-Fant (LF) model of glottal source and amplitudes of the glottal flow in each period, and the coefficients of the vocal-tract filter [57]. In another study, the contribution to voiced speech by the secondary sources within the glottis (i.e., glottis-interior sources) was investigated [69]. The study 24 analyzed the effects on the acoustic waveforms by the second order ‘sources’ such as, volume velocity of air at the glottis, pressure of air from lungs, unsteady reactions due to radiating sound and vorticity in the air flow from the glottis. In a recent study [108], the effect of source-tract acoustical coupling on the onset of oscillations of the vocal folds was studied using a mechanical replica of the vocal folds and a mathematical model. The model is based on lumped description of tissue mechanisms, quasi-steady flow and one-dimensional acoustics. This study proposed that changes in the vocal tract length and cross section induce fluctuations in the threshold values of both subglottal pressure and oscillation frequency. The level of acoustical coupling was observed to be inversely proportional to the cross-sectional area of vocal tract [108]. It was also shown that the transition from a low to high frequency oscillation may occur in two ways, either by frequency jump or with a smooth variation of frequency [108]. A recent analysis of trill sounds [40] indicated that changes in the vocal tract system due to tongue tip trilling affect the vibration of the vocal folds. In that study, acoustics of trill sounds was analysed using signal processing methods like zero-frequency filtering (ZFF) [130, 216] and Hilbert envelope of differenced numerator of group delay (HNGD) method [40, 123]. In another recent study [122], the significance of changing vocal tract system and the associated changes in the glottal excitation source characteristics due to tongue tip trilling were studied from perception (hearing) point of view. Both these studies indicate the possibility of source-system interaction in the production of apical trills. However, the effect of acoustic loading of the system on the glottal source due to source-system interaction has not been explicitly studied in terms of the changes in the characteristics of the speech signal. In the current study, we examine the effect of acoustic loading of the vocal tract system on the glottal vibration for a selected set of six categories of voiced consonant sounds. These categories are distinguished based upon the stricture size, and the manner and place of articulation. The voiced sounds considered are: apical trill, alveolar fricative, velar fricative, apical lateral approximant, alveolar nasal and velar nasal. The sounds differ in the size, type and location of the stricture in the vocal tract. These consonant sounds are considered in the context of vowel [a]. Three types of occurrences, namely, single, geminated and prolonged are examined for each of the six categories of sounds. The speech signal along with the electroglottograph (EGG) signal [53, 50] are used for analysis of these sounds. Details of the study are discussed in Chapter 4. 2.5 Review of studies on analysis of emotional speech and shouted speech 2.5.1 Studies on emotional speech Emotion is an outburst or expression of state of mind, which is reflected in the behaviour of an individual that is different from his/her normal behaviour. It is caused by the reaction of the individual to external events or to conflicts/feelings developed internally within one’s own mind. Emotional state also depends on the mental and physical health condition and also on the social living condition of 25 the individual. Thus emotional state is highly specific to an individual. Different types of emotions arise in different situations, especially in communication with others. The characteristics of emotion are manifested in many ways, including in the language usage, besides facial expressions, visual gestures, speech and nonverbal audio gestures. Note that emotions are natural reflections of an individual, and are naturally perceived easily by other human beings. As long as the generation and perception of emotions are confined to human communication, there is no need to understand the characteristics of emotion in detail. But to exploit the developments of information technology for various applications, it is necessary to incorporate machine (computer) in the human communication chain. In such a case, there is a need to understand the characteristics of emotions, in order to develop automatic methods to detect and categorize emotions reflected in the sensory data such as audio and video signals. It is also equally important to generate signals from a machine with desired emotion characteristics, which human perceive as natural emotions. Characterization and generation of emotions is a technical challenge, as it is very difficult to identify and extract features from sensory data to characterize a particular emotion. Another important issue is that emotions cannot be categorized into distinct non-overlapping classes. An emotional state is a point in a continuous space of feature vectors, and usually the feature space may be a very high dimensional one. Emotion is characterized by multimodal features involving video and audio data. It is more useful in many applications to characterize emotions using audio data, as it is easily available/captured easily in most communication situations. The expression of emotion through audio is spontaneous, and is also perceived effortlessly by human listeners. The characteristics of emotion are reflected either in the audio gestures consisting of short burst like ah, oh, or in nonverbal sounds like laughter, cry, grin, etc. Several studies were made to extract features characterizing different emotions from speech signals [21, 119, 35] and from audio-visual signals [82, 205, 206]. These studies hypothesized distinct categories of emotions, such as happiness, sadness, fear, anger, disgust and surprise, or they hypothesized distinct affective states such as interest, distress, frustration and pain [142, 172, 141, 22]. All these studies focused on classifying emotions into one of the distinct categories using acoustic features [73, 143, 210, 71]. The acoustic features derived from the signal were mostly the standard features used to represent speech for various applications such as Mel-frequency cepstral coefficients (MFCC) or linear prediction cepstral coefficients (LPCCs). Most of these features represent the characteristics of the vocal tract system. These were supplemented with additional features representing the waveform characteristics and some source characteristics, such as zero crossing rate (ZCR), harmonic to noise ratio (HNR), etc. [95, 68, 177, 203] and [119, 218, 196, 17]. Attempts were made to derive emotion characteristics from audio signals collected in different application scenarios such as cell phone conversation, meeting room speech, interviews, etc. [126, 31, 148, 211] and [76, 117, 220, 33, 90]. What was strikingly missing in these past studies on emotion is, understanding emotions from human speech production point of view. Also missing is the realization of the continuum nature of emotional states and the individual-specific nature of emotion characteristics. 26 It may therefore be summarized that emotion is an expression of state of mind, which is reflected normally in burst of activity. Normally emotion characteristics cannot be sustained for long periods, as it is not a normal human activity. Emotion detection/sensing is easy for human beings due to their multimodal pattern recognition capability. Emotion characteristics are strongly reflected more in nonverbal paralinguistic sounds (discussed in separate chapter), and represent points in continuous feature space. In this research work, the emotion study is audio-focused, involving analysis of audio signal analysis. The importance of source information in production and perception is highlighted. 2.5.2 Studies on shouted speech Shout or scream, both are deviations from normal human speech. Shout contains linguistic information and may have phonated speech content, whereas scream contains neither of it. Both are associated with some degree of urgency. Shout or shouted speech can also be considered as an indicator of verbal aggression, or of a potentially hazardous situation in an otherwise normally peaceful environment. Naturally, there is growing demand for techniques that may help in automatic detection of shout and scream. Range of applications including health care, elderly care, crime detection, social behaviour and psychological studies etc. are enough motivation for attracting researchers to the study of detection of such unique audio events for different applications. Since shouted speech signal is also like any other speech signal produced by the production mechanism, it can be analysed in the terms of the excitation source and the vocal tract system features as in the case of any speech signal analysis. The excitation source characteristics are significant mostly due to voicing in the shouted speech. Hence, these are investigated by studying the changes in the frequency of the vocal fold vibrations. Thus, source characteristics are studied in terms of changes in the instantaneous fundamental frequency (F0 ) of the vocal fold vibration at the glottis. The acoustic signal is generally analysed in terms of the characteristics of the short-time spectral envelope. The deviation of the features of the spectral envelope for shouted speech in comparison with those for normal speech are used to characterize the shout. Shouted speech is perceived to have increased signal intensity, which is characterized by the instantaneous signal energy [70, 199, 160] and power [131, 132]. Change in F0 as a characteristic feature of shout has been used in different studies such as [131, 132, 202]. Mel-cepstral coefficients (MFCCs) are normally used to represent the short-time spectral features of shouted speech signal [199, 131]. The MFCCs with some variations and the wavelet coefficients are used in some shout and scream detection applications [146, 70]. In automatic speech recognition (ASR) MFCCs are used for studying the performance of the system for shouted speech recognition [132]. In some cases, the finer variations of spectral harmonics are superimposed on the spectral envelope features to include changes in F0 in the spectral representation [146]. Properties of pitch, like its presence, salience and height, along with signal to noise ratio (SNR) and some spectral distortion measures, are used for detection of verbal aggression [202]. Spectral features such as spectral centroid, spectral spread and spectral flatness are used along with MFCCs for classifying non-speech sounds including scream [104]. Wavelet transformations 27 based on Gabor functions are investigated for detection of emergency screams [115]. MFCCs with linear prediction coefficients (LPCs) and perceptual linear prediction coefficients (PLPs) are examined for shout event detection in public transport vehicle scenario [160]. In a recent study, the impact of vocal effort variability on the performance of an isolated-word recognizer is studied for the five vocal modes (i.e., the loudness levels) [217]. Analysis of shout signals have focused mostly on representing the information in the audio signal using spectral features. There are not many attempts to study the characteristics of the excitation component in the shouted speech. It is useful to study the excitation source component, especially the nature of vibration of the vocal folds at the glottis, during the production of shout. It is also useful to study the changes in the spectral features caused by the excitation in the case of shouted speech in comparison with normal speech. Ideally, it is preferable to derive the changes in the excitation component of the speech production from the speech signal. One way to do this is to derive the glottal pulse shape using inverse filtering of speech [58, 2]. This cannot be done accurately in practice due to difficulty in modeling the response of the vocal tract system for deriving the inverse filter. The difficulty is compounded when the speech signal is degraded as in distant speech. It may be noted that human perceptual mechanism can easily discriminate shouted speech from normal speech, even when the speech signal is somewhat degraded. In this study, changes in the vocal fold vibration in the shouted speech in relation to the normal speech are examined. Comparison of normal speech is also made with soft and loud speech. It should be noted that although four distinct loudness levels are considered in this study, they form a continuum, and hence it is difficult to mark clear boundaries among them. Electroglottograph (EGG) signals are collected along with the close speaking speech signals for the four different loudness levels, namely, soft, normal, loud and shout. Differences between shouted speech and normal speech can be seen clearly in the EGG signals. Since collecting the EGG signal along with the speech signal is not always possible in practice, there is need to explore features of shout, which can be derived directly from the speech signal. For this purpose, signal processing methods [40] that can represent the fine temporal variations of the spectral features are explored in this paper. It is shown that these temporal variations do indeed capture the features of glottal excitation that can discriminate shout vs normal speech. The effect of coupling between the excitation source and the vocal tract system during the production of shout is examined in different vowel contexts. Details of this study are discussed in Chapter 5. 2.6 Review of studies on analysis of paralinguistic sounds and laughter Humans use nonverbal (nonlinguistic) communication to convey representational messages like emotions or intentions [139]. Some nonlinguistic signals are associated with specific singular emotions, intentions or external referents [139]. Laughter is a paralinguistic sound, like sigh and scream, used to communicate specific emotions [36, 135]. Detection of paralinguistic events can help classifi28 cation of the speaker’s emotional state [194]. The speaker’s emotional state enhances the naturalness of human-machine interaction [194]. 2.6.1 Need for studying paralinguistic sounds like laughter Laughter, a paralinguistic event, is one of the most variable acoustic expression of a human being [135, 159]. Laughter is a special nonlinguistic vocalization because it induces positive affective state in the listeners, thereby affecting their behaviour [139, 118]. Detection of paralinguistic events like laughter can help in classification of emotional states of a speaker [194]. Hence, researchers have been attracted in last few years towards finding the distinguishing features of laughter and developing systems for detecting laughter in speech [83, 198, 84, 107, 23, 24, 86]. Laughter plays an ubiquitous role in human vocal communication [11] and occurs in diverse social situations [62, 151, 85]. Laughter has variety of functions like indicator of affection (which could be species-specific), aggressive behaviour (to laugh in someone’s face), bonding behaviour (in early infancy), play behaviour (interactive playing) or appeasement behaviour (in situations of dominance/ subordination) [159]. Laughter can also help improving the expressive quality of synthesized speech [63, 209, 186]. Laughter is a vocal-expressive communicative signal [161] with variable acoustic features [159, 85]. It is a special human vocalization, mainly because it induces a positive affective state on listeners [118]. Nonlinguistic vocalizations like laughter influence the affective states of listeners, thereby affecting their behaviour also [139]. Finding distinct features of laughter for automatic detection in speech has been attracting researchers’ attention [107, 23, 186, 24, 195, 86, 84]. Applications like an ‘audio visual laughing machine’ have also been attempted [198]. Diverse functions and applications of laughter continuously motivate researchers to strive for better understanding of the characteristics of laughter, from different perspectives. Hence, there is need to study the production characteristics of paralinguistic sounds like laughter in detail. 2.6.2 Different types and classifications of laughter Laughter characteristics are analyzed at episode, bout, call and segment levels. An episode consist of two or more laughter bouts, separated by inspirations [161]. A laughter bout is an acoustic event [161], produced during one exhalation (or inhalation sometimes) [11]. Each laugh bout may consist of one or more calls. The period of laughter vocalization contains several laugh-cycles or laugh-pulses [125], called calls, interspersed with pauses [161]. Calls are also referred to as notes or laugh-syllables [11]. Segments reflect change in the production mode in the spectrogram components that that occurs within a call [11]. A laughter bout may be subdivided into three parts: (i) onset, in which laughter is very short and steep, (ii) apex, the period where vocalization or forced exhalation occurs, and (iii) offset, the post vocalization part, where a long-lasting smile smoothly fades out [161]. Laughter with one or two calls is called exclamation laughter or chuckle [134]. The upper limit on the number of calls in a laugh bout 29 (3-8) is limited by the lung volume [152, 161]. A typical laugh bout consists up to four calls [152, 159]. Laughter at bout and call levels are analysed in this study. Categorization of laughter sounds were categorized in several studies in different ways. Laughter was categorized in [161] into three classes: (i) spontaneous laughter, in which there is an urge to laugh without restraining its expression, (ii) voluntary laughter, a kind of faked laughter to produce a sound pattern that of natural laughter, and (iii) speaking or singing laughter, in which phonation is not based on forced breathing but on well-dosed air supply that results in lesser resonance in trachea, breathiness and aspiration. Three types of laugh bouts were discussed in [11]: (i) song-like laugh, which involves voiced sounds with pitch (F0 ) modulation, sounding like giggle or chuckle, (ii) snort-like laugh, the unvoiced call with salient turbulence in nasal-cavity , and (iii) unvoiced grunt-like laugh, that includes breathy pants and harsher cackles. The three classes of vowel quality (‘ha’, ‘he’ and ‘ho’ sounds) seem to have marked variation among laugh bouts [152, 150]. Laughter was also categorized as voiced laughter and unvoiced laughter [139]. Voiced laughter occurs when the energy source is the regular vocal fold vibration as in voiced speech, and includes melodic, song-like bouts, chuckles and giggles [139]. Unvoiced laughter bouts lack the tonality associated with stable or quasi-periodic vocal fold vibration, and include open-mouth, breathy pant-like sounds, closedmouth grunts and nasal-snorts [139]. The continuum from speech to laugh was divided into three categories: speech, speech-laugh and laugh [135, 118]. In speech-laugh, the duration of vocalization was observed to increase more likely with changes in one or more features of vowel elongation, syllabic pulsation, breathiness and pitch [135]. It was shown that voiced laughter induces significantly more positive emotional responses in listeners in comparison with unvoiced laugh [10]. Laughter in dialogic interaction was also categorized as speech-smile, speech-laugh and laughter [87], in that speech-smile included lip-spreading and palatalization superimposed on speech events. Four phonetic types of laughter, namely, voiced, chuckle, breathy (ingression) and nasal-grunt were studied [24, 187]. In this study, we examine voiced speech-laughter continuum in three categories: (i) normal speech (NS), (ii) laughed speech (LS) and (iii) nonspeech laugh (NSL). Laughed speech is assumed to have linguistic vocalization interspersed with nonlinguistic laugh content [135, 118]. Only voiced laugh [139], produced spontaneously, and speaking laughter [161] are considered. 2.6.3 Studies on acoustic analysis of laughter and research issues Acoustic analysis of laughter was carried out [16], in which features such as fundamental frequency, root mean square amplitude, time duration and formant structure were used to distinguish laughter from speech. It was observed that laughs have significantly longer unvoiced regions than voiced regions. The mean fundamental frequency (F0 ) of laughter sounds as 472 Hz for (Italian/German) females with range of F0 as 246-1007 Hz was reported [159]. Average F0 of normal speech sounds was reported as 214 Hz and 124 Hz, for females and males, respectively. Acoustic features such as F0 , number of calls per bout (3-8), spectrograms and formant clusters (F2 vs F1) were used [11] to analyse temporal features of laughter, their production modes and source-filter related effects. That study had proposed a sub30 classification of F0 contours in each laugh call as flat, rising, falling, arched and sinusoidal. Two acoustic features of laughter series, the specific rhythms (duration) and changes in fundamental frequency were investigated [85] for their role in evaluation of laughter. The acoustic features of laugh-speech continuum such as formant space, pitch range and voice-quality were studied [118]. Combinations of features such as pitch and energy features, global pitch and voicing features, perceptual linear prediction (PLP) features and modulation spectrum features were used [194] to model laughter and speech. Acoustic analysis of laughter produced by congenitally deaf and normal hearing college students was carried out [111]. That study focused on features such as degree of voicing, mouth position, air flow direction, relative amplitude, temporal features, fundamental frequency and formant frequencies. Differences in the degree of variation in the fundamental frequency, intensity and durational patterning (consisting of onset, main part, pause and offset) were studied [101] to assess the degree of naturalness of synthesized laughter speech. The acoustic features (mostly perceptual and spectral) have been studied for different purposes and diverse applications such as: distinguishing the laughter types [24], speech/laughter classification in audio of meetings [84], development of ‘hot-spotter’ system [23], detection of laughter events in meetings [83], automatic laughter detection [107, 194, 86], automatic synthesis of human-like laughter [186] and ‘AV Laughter Cycle’ project [198]. Production characteristics of speech signal with laughter can be expected to change from normal speech. Production of laughter was studied from respiratory dynamics point of view [51]. In that study, laugh calls were characterized by sudden occurrence of repetitive expiratory effort, drop in functional residual capacity of lung volume in all respiratory compartments and dynamic compression of airways. “Laughter generally takes place when the lung volume is low” (p442) [109]. Production characteristics of laughter can be analyzed from the speech signal in terms of the excitation source and the vocal tract system characteristics, like for normal speech. In the production of laughter, significant changes appear to take place in the characteristics of the excitation source. The acoustic analyses of laughter have mostly been carried out using spectral and perceptual features [16, 11, 195, 111]. The voice source characteristics were investigated [118] using features such as glottal open quotient along with spectral tilt, which were derived (approximately) from the differences in the amplitudes of harmonics in the Discrete Fourier Transform (DFT) spectrum. Source features such as instantaneous pitch period, strength of excitation at epochs, and their slopes and ratio were used for analysis of laughter in [185]. In this thesis, we examine changes in the vibration characteristics of the glottal excitation source and associated changes in the vocal tract system characteristics, during production of laughter, from EGG and acoustic signals. Details of the study are discussed in Chapter 6. 31 2.7 Review of studies on analysis of expressive voices and Noh voice 2.7.1 Need for studying expressive voices Expressive voices are special sounds produced by strong interactions between the vocal tract resonances and the vocal fold vibrations. One of the main feature of these sounds is the aperiodicity in the glottal vibration. Decomposition of speech signals into excitation component and vocal tract system component was proposed in [214] to derive the aperiodic component of the glottal vibration. In this study, the characteristics of the aperiodic component of the excitation are examined in detail in the context of artistic voices such as in singing and in Noh (a traditional performing art in Japan [55]), to show the importance of timing information in the impulse-like sequence in the production of these voices. Normal voiced sounds are produced due to quasi-periodic vibration [183] of the vocal folds, which can be approximated to a sequence of impulse-like excitation. The impulse-like excitation in each cycle is due to relatively sharp closure of the glottis, which takes place after the pressure from lungs is released by the opening of the closed glottis. In expressive voices the glottal vibrations may be approximated to a sequence of impulse-like excitation, where the impulses need not be spaced at near equal intervals as in the case of modal voicing. Moreover, the strengths of successive impulses also need not be same. In addition, the coupling of the source and system produces different responses at successive impulses in the sequence. Thus the aperiodicity of the signal produced in expressive voices may be attributed to (a) unequal intervals between successive impulses in the excitation, (b) unequal strengths of excitation around the successive impulses, and (c) differences in the responses of the vocal tract system for successive impulselike excitations. Besides the differences in the excitation, the differences in the characteristics of the vocal tract system due to coupling of the source and system may also contribute to the quality of expressive voices. But the aperiodicity in the excitation may be considered as the important feature of the expressive voices, and hence the need to study the characteristics of the excitation source component of expressive voice signals. 2.7.2 Studies on representation of source characteristics and pitch-perception Representation of the excitation source component through multi-pulse excitation sequence was proposed [8, 6, 174], for the purpose of speech synthesis. Linear prediction coding (LPC) based methods were used to determine the locations and magnitudes of the pulses in successive stages, by considering one pulse at time [8] or by jointly optimizing the magnitudes of all pulses located up to that stage [173]. The role of multi-pulse excitation in the synthesis of voiced speech was also examined [26]. In that study, it was observed that “even for periodic voiced speech, the secondary pulses in the multi-pulse excitation do not vary systematically from one pitch period to another” [26]. Multi-pulse coding of speech through regular-pulse excitation was also proposed [94], using linear prediction analysis. In that study, an attempt was made to reduce the perceptual error between the original and reconstructed signal. 32 In another study, the effect of pitch perception in the case of expressive voices like Noh singing voice was examined [55, 81]. In that study, a measure of pitch perception information termed as saliency was proposed [55]. Subsequently, an approach for F0 extraction and aperiodicity estimation using the saliency information was also proposed [80]. Recently, a method for extracting the epoch intervals and representing the excitation source information through a sequence of impulses having amplitudes corresponding to the strength of excitation was proposed using zero-frequency filtering method [130, 216], for modal voicing. However, to the best of our knowledge, representing the excitation source information through a sequence of impulses which can be related to the perception of pitch in expressive voices, is not yet explored. The research efforts towards characterising the excitation source information of aperiodicity in expressive voices can be categorised according to four questions, in the context of expressive voices. (i) How to characterise and represent the excitation source component in terms of a sequence of impulselike pulses? (ii) How to characterise changes in the perception of pitch, that could be rapid in the case of expressive voices? (iii) How to measure the instantaneous fundamental frequency (F0 ) that is also guided by the perception of pitch, especially in the regions of aperiodicity? (iv) Can we obtain the sequence of excitation impulses that is related to the perception of pitch? Answers to the first question were attempted through multi-pulse and regular-pulse excitation [8, 6, 94]. A measure of pitch perception, i.e., saliency and F0 extraction based upon it were proposed [55, 81, 80] as answers to second and third questions, respectively. But, answer to the last question, along with inter-linking relations amongst answers to first three questions, has not been explored yet. In this study, we aim to examine afresh each of these four questions, using signal processing techniques, and possibly address the last and most important question by exploring ‘representation of excitation source information through a sequence of impulse-like pulses that is related to the perception of pitch in expressive voices’. Normally the excitation source characteristics of speech are derived by removing or compensating for the response of the vocal tract system [78, 128, 72]. Short-time spectrum analysis is performed, and the envelope information corresponding to the vocal tract system characteristics is extracted [79, 78]. The finer fluctuations in the short-time spectrum are used to derive the characteristics of the excitation source. The spectral envelope information can be obtained either by cepstrum smoothing or by the linear prediction (LP) analysis [164, 114, 112]. More recently, the spectral envelope information is derived effectively by using a TANDEM-STRAIGHT method [81]. Note that in all these cases an analysis segment of two pitch periods or more are used. Moreover, the residual spectrum due to excitation is derived by first determining the vocal tract system characteristics in the form of spectral envelope. The excitation component is generated by the vocal fold vibration in the case of voiced speech. The vocal tract system is then excited by this component, assuming that the interaction between the source and the system is negligible. It appears reasonable to extract the characteristics of the excitation source first, and then use the knowledge of the source to derive the vocal tract system characteristics. To study the characteristics of the source, it is preferable to derive these characteristics without referring to the system. The characteristics of the sequence of impulse-like excitation, i.e., the locations (epochs) of the 33 impulses and their relative strengths, can be extracted using a modification proposed in this work, in the zero frequency filtering (ZFF) method [130, 216]. The sequence of impulses and their relative strengths are useful to study the characteristics of the aperiodic source in expressive voices, provided they can be extracted from the acoustic signal. 2.7.3 Studies on aperiodicty in expressive voices such as Noh singing and F0 extraction In this study, a particular type of expressive voice, called Noh, is considered for studying the characteristics of the aperiodic source [55]. “In Noh, a highly theatrical performance art in Japan, a remarkably emotional message is largely conveyed by special voice quality and rhythmic patterns that the site uses in singing”. “The term site is the name used for the role that is taken by the principal player in a Noh performance” [55]. The characteristics of Noh voice quality are described in [55]. That paper also discusses the importance of analysing the aperiodicity in voice signals to describe the Noh voice quality. The aperiodicity characteristics were examined using TANDEM-STRAIGHT method to derive the vocal tract system characteristics, and the XSX (excitation structure extraction) method for deriving the fundamental frequency, period by period [55, 81, 80]. “The algorithm tries to identify approximate repetitions in the signal as a function of time. By finding multiple candidate patterns for repetition, the method proposes multiple hypotheses for the fundamental period. Using each candidate time interval as the analysis window, the method tries to compare many different fundamental frequencies (F0 ) to evaluate their consistency for exploring locally acceptable periodicity by finding the best candidate” [55]. “The candidate value is associated with an estimate of ‘salience’, which may be interpreted as the likelihood of the particular F0 value to represent the effective pitch of the voice signal in the given context” [55]. Saliency, indicating the most prominent candidate for instantaneous F0 at an instant of time, was proposed [55, 80] as a measure of effective pitch perceived in Noh singing voice. Saliency computation was proposed by using a TANDEM STRAIGHT method [55, 81, 79]. In simple terms, the saliency of an F0 candidate is approximately the value of the peak at T0 = 1/F0 in the autocorrelation function derived from the significant part of the spectrum of the excitation component. The significant part is assumed to be in the low frequency range below about 800 Hz, and is derived by low pass (cut off frequency < 800 Hz) filtering the spectrum of the excitation component [55, 79, 81, 80]. Saliency measure can thus be a means to characterize the changes in perception of pitch in expressive voices. Expressive voices like Noh singing carry information of voice quality characteristics, its production properties and pathological conditions of phonation [55]. The emotional content in expressive voices is conveyed by two constituents: (a) prosody variation of the excitation source characteristics and (b) changes in voice quality [55, 81]. Prosody changes are reflected in the F0 contour, phonological variables (tone, stress and rhythm) and tonal variables. Voice quality changes are reflected in the fluctuations of signal parameters related with resonance characteristics (spectral envelope), rise in pitch (F0 ) and stability of laryngeal configuration (amplitude/intensity) [55]. Hence, the emotional content in expressive voices can be characterised by (i) coarticulation (sequencing of quasi-stationary signals), (ii) source-system interaction and (iii) aperiodicity (voice fluctuations) [55, 81]. 34 Interestingly, human auditory system perceives these expressive sounds in terms of excitation source and resonant filter characteristics [214, 55, 80]. The source characteristics determine the prosodic properties of phonetic signals that also carry the spectral timbre information [55]. Perception of voice depends on sequential phonetic properties (distinctions among words) and suprasegmental properties (tones, accents or time-varying signal properties). The source-filter theory helps distinguishing between periodic signals and the deviation from periodicity. The periodic speech signals are usually voiced sounds, having quasi-stationary portions of vowel-like acoustic segments [183] and F0 as a physical attribute of voice quality. The deviations from periodicity can be due to random/chaotic changes in linear acoustic system (due to air turbulence), nonlinearity of vocal fold tissue and temporal variability of voiced signals. The temporal variability of signal can be due to air-flow modulation. It is related to the changes in F0 , amplitude and spectral envelope of signal waveform, within each glottal cycle [55, 79]. In strict sense, aperiodicity should not be viewed as merely the deviation from periodicity. Aperiodicity of speech signal reflects emotional content in expressive voices. Aperiodicity is due to sudden/gradual introduction of subharmonics, their abrupt appearance/disappearance and variability in harmonic frequencies [55, 79]. The nature of aperiodicity is expected to be different in natural conversation, in highly emotional speech and in Noh voice [55]. The distinguishing factors are: changes in (i) F0 (frequency modulation), (ii) amplitude (amplitude modulation), (iii) cycle to cycle fluctuation in the energy of the excitation source signal, and (iv) vocal tract transfer characteristics (due to articulatory movement) [81, 80]. The factors for deviations from periodicity, i.e., those contributing to aperiodicity, can be grouped as: (a) F0 dependent factors and (b) residual fluctuations (having random effect [81]). Periodicity estimation involves primarily the F0 extraction. The F0 in speech may change from one glottal cycle to another, hence termed as instantaneous fundamental frequency. Accordingly, the perception of pitch also changes. The intercyclic changes in F0 are very less in the case of modal voicing, for which the temporal resolution as equal to T0 (i.e., one pitch period) is nearly uniform [80]. However, in the case of expressive voices, there are rapid changes in F0 in some segments, that makes F0 extraction a challenging task. In the last two decades, several methods for F0 extraction have been evolved. Major ones of these methods can be broadly categorized as: (i) instantaneous frequency (IF) based [188, 1, 18], (ii) fixed-point analysis based [77], (iii) integrated method (involving IF and autocorrelation) [78], (iv) TANDEM-STRAIGHT (involving excitation structure extractor (XSX) analysis and IF) [79, 81, 80], (v) group delay [129], (v) DYPSA [91, 133], (vi) inverse-filtering [2, 4, 3] and (vii) zero-frequency filtering [130, 216] method. Challenge however still lies in extracting F0 in the regions of harmonics/subharmonics and aperiodicity. The fluctuation spectrum estimation is involved when either F0 is not constant in time (i.e., T0 is not uniform) or no a priori information about F0 is available. TANDEM STRAIGHT based method involves multiple F0 hypotheses and saliency based estimation of F0 [55, 81, 80]. It is apparently the sole attempt so far, to incorporate the information of pitch perception in estimating F0 for expressive voices. In this study, we analyse the characteristics of the aperiodic components in the Noh voice signal. The excitation source characteristics are represented in terms of an impulse-sequence in time-domain. 35 The irregular intervals between epochs, i.e., instants of significant excitation, along with (varying) relative amplitudes of impulses are used to characterize the unique excitation characteristics of Noh voice. The instantaneous fundamental frequency (F0 ) for expressive voices is obtained from the saliency information, i.e., a measure of pitch-perception, that is derived using few new signal processing methods proposed in this thesis. Details of this study are discussed in Chapter 7. 2.8 Review of studies for spotting the acoustic events in continuous speech 2.8.1 Studies towards trill detection Trill sounds are common in some languages around the world, for example Spanish or Todo (Indian) [66, 178]. Analysis of trills can help in understanding the differences among different dialects of a language [106, 34]. Study of trills can also have sociolinguistic implications [41, 66]. It is envisaged that automatic detection of trills in continuous speech would have a wide range of applications. Depending upon the places of the articulators, trills are termed as bilabial, dental, alveolar, post-alveolar and uvular trills [96, 122]. Among these types, dental and alveolar apical trills are more common [116]. Production of apical trills involves satisfying several constraints, categorised as aerodynamic and articulatory. Aerodynamic constraints are related to those factors which are essential for the initiation and sustenance of apical trill vibrations [175]. Articulatory constraints are related to the aspects like lingual and vocal tract configuration [155, 100, 40]. The typical rate of trilling of the tongue is in the range of 20 to 40 Hz, as measured from the acoustic waveform or spectrogram [98, 116, 122]. Two to three cycles of apical trills are commonly produced in continuous speech [105, 100, 66]. More than three cycles of apical trills can be produced in isolation as a steady sustained sound. The phonological aspects of trills were reported [110, 162, 100]. Production mechanism of tongue tip trills was modelled and described from aerodynamic point of view [98, 116, 100]. The tongue tip vibrations from the articulatory point of view were described [28, 100, 178]. Articulatory mechanics of tongue tip trilling was modelled [116]. The aerodynamic characteristics and the phonological pattern of trills across languages were studied [175]. Trill cycle and trilling rate were studied [100, 116, 105]. Mean trilling rate of 25 Hz (18-33 Hz) was reported [105]. Acoustic correlates of phonemic trill production [34] and acoustic characterisation of trill [66] were reported for Spanish language [66]. The characteristics of the voiced apical trill [r] are studied in the context of three different vowels [a], [i] and [u]. It is observed that the period of the glottal vibration changes in each trill cycle due to tongue tip trilling [40]. Recently, the effect of changing vocal tract shape and the associated changing excitation source characteristics due to trilling are studied from perception point of view [122]. It is observed that there is a coupling between both the excitation source and the vocal tract system. Both contribute to the production and perception of trills, though the role of the excitation source appears to be relatively more dominant [122]. To the best of our knowledge, not much attention has been paid so far to the auto36 matic detection of trills in continuous speech using the characteristics of trill sounds. Hence, it is worth investigating methods for automatic detection of trills, details of which are discussed in Chapter 8. 2.8.2 Studies on shout detection Automatic detection of shout or shouted speech regions in continuous speech has applications in the domains ranging from security, sociology, behaviour studies and health-care access to crime detection/investigation [132, 70, 160, 202, 199]. Hence, research in acoustic cues to facilitate detection of shout is gaining increased attention in recent times. In this paper, we aim to exploit the changes in the production characteristics of shouted speech as compared to normal speech, for developing an automatic system for detection of shout regions in continuous speech. Shouted speech consists of linguistic content and voicing in the excitation. The production characteristics of shout, in particular the excitation source characteristics, are likely to deviate from those of normal speech, especially in the regions of voicing. Associated changes also take place in the characteristics of the vocal tract system. In general, changes in the excitation source characteristics are examined by studying the changes in the frequency of vocal fold vibrations, i.e., instantaneous fundamental frequency (F0 ) [132, 160, 202, 146]. Changes in the vocal tract system characteristics are usually examined in terms of changes in the spectral characteristics of speech, such as Mel-frequency cepstral coefficients (MFCCs) [70, 199, 146, 217]. Studies on the analysis of shout or scream signals mostly used features like F0 , MFCCs and signal energy [132]-[146]. Features such as formant frequencies, F0 and signal power were studied in [132, 202, 146]. Applications of these features included automatic speech recognition (ASR) for shouted speech [132]. The MFCCs, frame energy and auto-correlation based pitch (F0 ) features were studied in [70, 160, 199]. Applications of these features included scream detection using support vector machine (SVM) classifier [70]. The MFCCs with weighted linear prediction (WLP) features were studied for detection of shout in noisy environment [146]. Spectral tilt and linear predictive coding (LPC) spectrum based features were used in [217] for studying the impact of vocal effort variability on the performance of an isolated word recognizer. The MFCCs and spectral fine structure (F0 and its harmonics) were used recently, in a Gaussian mixture model (GMM) based approach for shout detection [147]. In this study, we develop an experimental automatic shout detection system (ASDS) to detect regions of shout in continuous speech. The major challenge in developing an automatic shout detection system is the vast variability in shouted/normal speech, that could be speaker, language and application specific. Hence, a rule-based approach is used in the ASDS. The design details of ASDS and performance evaluation results are discussed in Chapter 8. 2.8.3 Studies on laughter detection Detection of paralinguistic events like laughter has potentially diverse applications such as indexing/search of audio-visual databases, healthcare and biometrics etc. Detection of laughter regions in 37 continuous speech can also help in classification of emotional states of a speaker [194]. Hence, researchers have been attracted in last few years towards finding the distinguishing features of laughter, to develop systems for detecting laughter regions in continuous speech [107, 23, 83, 24, 86, 84, 198]. The acoustic spectral and perceptual features have been used in applications such as detection of laughter events in meetings [83], distinguishing the four (phonetic) types of laughter [24], and speech/ laughter classification in meetings audio [84] etc. Other diverse applications include ‘hot-spotter’ [23], automatic laughter detection [194, 86] and ‘AVLaughterCycle’ project [198]. Automatic laughter detection in continuous speech, that exploits changes in the production characteristics (mainly source) during laughter production, is attempted in this thesis. Its details and performance evaluation results are discussed in Chapter 8. 2.9 Summary In this chapter, we have revisited the basic concepts of speech production, significance of glottal vibration and notions of frequency (F0 ), along with analytical signal and parametric representation of acoustic signal. Earlier studies related to the analysis of each of the four sound categories are reviewed briefly. Studies on the nature of sounds involving source-system coupling effects like trill, and those highlighting the significance of source-system interaction and acoustic loading effects are reviewed. Earlier studies on emotional speech, paralinguistic sounds and expressive voices are also reviewed, in particular on shouted speech, laughter and Noh voice, respectively. Studies on aperiodicity in Noh voice and F0 extraction are also reviewed. Feasibility of automatic detection of trills in continuous speech is examined. Since, it can be developed using the production characteristics of apical trills, earlier studies on trill analysis are reviewed in brief. Automatic detection of shout in continuous speech in real-life practical scenarios is important and is a challenging task. Hence, earlier studies aimed towards shout detection are reviewed in brief, prior to developing an experimental automatic shout detection system, based on changes in the production features. Feasibility of automatic detection of laughter (nonspeech-laugh or laughed-speech) in continuous speech is explored and hence related earlier studies are reviewed briefly. The reviews carried out in this chapter on the analysis of four categories of sounds, and attempts towards detecting few acoustic events highlight the research issues and challenges involved. It also highlights the limitations of signal processing methods used in these studies, and the need for new approaches. In chapters further in this thesis, each of the four sound categories is analysed separately, while exploring changes in the production characteristics, primarily the source characteristics. The impulse-sequence representation of excitation information in speech coding methods, and the recently proposed signal processing methods that are used in these analyses, are discussed in the next Chapter. Using these, the three prototype automatic detection systems developed for spotting trills, shout and laughter, are also discussed in a chapter later. 38 Chapter 3 Signal Processing Methods for Feature Extraction 3.1 Overview Analysis of normal (verbal) speech is carried out using standard signal processing techniques. But, the variations in it, as those related to source-system interaction, can be analysed better using some recently proposed signal processing methods, that are discussed in this chapter. Mainly, the zerofrequency filtering and zero-time liftering methods are discussed, which are used for deriving the excitation source characteristics in terms of the impulse-sequence representation and the spectral characteristics, respectively. Methods for deriving the resonance characteristics of the vocal tract system in terms of first two dominant frequencies, are also discussed. In general, tendency is to use the standard/recently proposed signal processing methods for the analysis of nonverbal speech sounds also. But, the analysis of emotional speech, paralinguistic sounds and expressive voices would require further specialized signal processing methods, due to peculiarities and range of feature variations involved in the production of these sounds. Hence, there is need for some modifications/refinements in these methods along with some new approaches, that are discussed in further chapters with respective contexts. Impulse-sequence representation of the excitation source component of acoustic signal has been of considerable interest in speech coding research, mainly for synthesis of natural-sounding speech. But, this representation can also help in the analysis of nonverbal speech sounds. Different speech coding methods focused primarily on achieving low bit-rate and good voice quality of synthesized speech. These methods can be broadly classified into three categories, namely, waveform coders, vocoders and hybrid codecs. Waveform coders aimed at mimicking the speech waveform to the best possible extent. Vocoders, mainly linear prediction (LP) coding or residual-excited LP source-coders, lead to the development of stochastically generated code-book excited LP (CELP) codecs. Hybrid or analysisby-synthesis codecs, such as multi-pulse/regular-pulse excited and CELP codecs, aimed at achieving intelligible speech with bit-rates ≤ 4 Kbps. Analysis approaches in these methods differed in estimating the pulse position, amplitude or phase. Presence of secondary pulses within a glottal-cycle was also indicated in some studies. Impulse-sequence representation of the excitation source component is used in characterizing the nonverbal speech sounds, in this thesis. 39 In this chapter, methods of estimating the impulse-sequence representation of the excitation information used in various speech coding methods are discussed first. Aim is to gain insight into mathematical basis of the underlying research issues, some of which are attempted in this thesis. The chapter is organised as follows. Section 3.2 discusses speech coding methods and approaches for representing the excitation source information in terms of an impulse-sequence. The standard all-pole model of excitation used in LPC vocoders, is discussed in Section 3.3. Methods of estimating the impulse-sequence representation of the excitation information used in speech coding, are discussed in Section 3.4. In Section 3.5, the zero-frequency filtering method for extracting the excitation source characteristics is discussed. Section 3.6 discusses the zero-time liftering method for extracting the spectral characteristics. Methods for computing the first two dominant frequencies of the vocal tract system resonances are discussed in Section 3.7. Research issues and challenges involved in this impulse-sequence representation are discussed in Section 3.8. The chapter is summarized in Section 3.9. 3.2 Impulse-sequence representation of excitation in speech coding Impulse-sequence representation of the excitation source information in acoustic signal has been attempted in various speech coding methods evolved in last about three decades. Different speech coding methods have been proposed for diverse applications in mobile communication systems, voice response systems and high-speed digital communication networks etc. [8]. Most of these speech coding methods focused at achieving either of the two contrasting objectives, i.e., (i) producing high quality speech to make it sound as close to natural speech and (ii) lower bit-rates of coding. Some methods also attempted to achieve these dual objectives at the same time, though with some compromise. Hence, synthesis of natural-sounding speech was aimed at low-bit rates of coding, i.e., below 4.8 Kbits/sec. Speech coding methods can be categorised into three classes, based upon evolution stages and approaches adopted. Earlier speech coders, i.e., (a) waveform coders lead to the development of (b) LPC based vocoders, which further lead to the development of (c) analysis-by-synthesis, i.e., hybrid codecs. (A) Waveform coders The waveform coders [74] were focused at reproducing (by mimicking) the speech signal waveforms, as faithfully as possible, with minimum distortion and error [8]. Most waveform coders were of types either the pulse-code modulation systems or their differential generalizations, for example, differential generalizations [74], adaptive predictive [5] and transform coders [113]. Pulse-code modulation (PCM), differential pulse-code modulation (DPCM) and delta-modulation (DM) were also attempted [74]. Waveform coders were capable of producing high-quality speech, but the problem in these was that bit-rates of the speech coding was relatively high, i.e., above 16 Kbits/sec. (B) Vocoders LPC Vocoders, i.e., source-coders [140] were aimed at reducing coding bit-rate, while achieving the intelligible speech. Using a parametric model of speech production (i.e., source-filter model), vocoders synthesize intelligible speech at bit-rates even below 2.4 Kbits/sec [113], but it is not natural-sounding 40 speech. Hence, the cost paid in vocoders is voice-quality and intelligibility. In parametric (source-filter) model [8]: (i) a linear filter models the characteristics and spectral shaping of the vocal tract [8], and (ii) the excitation source provides excitation to the linear filter. The model assumes classification of the speech signal into two classes, voiced and unvoiced speech. The excitation for voiced speech is modeled by a quasi-periodic train of delta-function type pulses located at the pitch-period intervals, and for unvoiced speech by White noise [8]. This model has limitation in producing high-quality speech even at high bit-rates, because of the inflexible way in which the excitation is generated. Applications of low bit-rate coding (≤ 4.8 Kbits/sec) in mobile communication network etc. lead to several approaches in vocoders. (i) Linear predictive coding (LPC) techniques [113] for speech analysis and synthesis provided alternative way of representing the spectral information by all-pole filter parameters. But, these used excitation similar to that in channel vocoders [52]. (ii) Voice-excited vocoders [166] involved excitation of vocal tract system in two modes, pulse-sequence (voiced) and noise (unvoiced), both estimated over short segments of speech waveform. But, the problem is - “there are regions where it is not clear whether the signal is voiced or unvoiced, and what the pitch period is...” [8]. (iii) Residual-excited linear predictive (RELP) vocoders [197] used LP residual for the excitation. But these vocoders had problems in excitation signal representation. (iv) Code-excited LPC (CELP) vocoders [165] represented the excitation sequence by a stochastically generated codebook at around 4.8 Kbits/sec. The problem in these CELP vocoders was the large amount of computations required to choose the optimum code from the stored code-book. As a solution to these limitations, the multi-pulse excited (MPE) speech coding method and its few variations were proposed [8, 140], that are discussed next. (C) Hybrid codecs Hybrid codecs [8, 140, 15, 61] are ‘Analysis-by-Synthesis’ (AbS) kind time-domain coders-decoders (i.e., codecs), that aimed at achieving the dual objectives of speech-coding, i.e., good voice quality (intelligibility) of synthesized speech and coding at low bit-rates (≤ 4.8 Kbits/sec). These codecs use the same linear-filter model of the vocal tract system that was used in LPC vocoders. But, instead of a simple two-state voiced/unvoiced model as input to this filter, the excitation signal is dynamically chosen to match the reconstructed speech signal waveform as close as possible to the original speech. The ‘Analysis-by-Synthesis’ (AbS) codecs have two parts, encoder and decoder [140]. The input speech to be coded is divided into frames of size about 20 ms. Then, for each frame the parameters for a synthesis filter are determined in encoder, and the excitation to this filter is determined in decoder. The encoder comprises of a synthesis filter and a feedback path, which consists of error-weighting and errorminimisation blocks. For each frame of speech signal s(n), the synthesis filter gives the reconstructed signal sˆ(n) as output for the excitation signal u(n) as input. The encoder analyses the input speech by synthesizing multiple approximations to it. For each frame, it transmits to decoder the information about two things, the parameters of synthesis filter and the excitation sequence. The decoder then generates the reconstructed signal sˆ(n) by passing the given excitation u(n) through the synthesis filter. The 41 objective is to determine the appropriate excitation signal u(n), for which the error e(n) between the input signal s(n) and the reconstructed signal sˆ(n) after weighting, i.e., ew (n) is minimised. The synthesis filter [8, 140] used in encoder/decoder part in Hybrid (AbS) codecs is usually an allP 1 pole short-term linear filter H(z) = A(z) , where A(z) = 1 − pi=1 ai z −i is the prediction error filter that models the correlation introduced into speech, by the action of the vocal tract. Here, the order of LP analysis p is chosen usually as 10. This filter minimizes the energy of the residual signal, while passing an original signal frame through it. The long-term periodicities present in the voiced speech (due to glottal cycles) are modeled using either a pitch-extractor in the synthesis filter, or adaptive code-book in the excitation generator. Thus, the excitation signal is Gu(n − n0 ), where n0 is estimated pitch-period. The error weighting block shapes the error signal spectrum, to reduce the subjective loudness of error between the reconstructed signal and original speech. Minimising the weighted error thus concentrates the energy of error signal in the frequency regions of speech having high energy. The objective is to enhance the subjective quality of the reconstructed speech, by emphasizing the noise in the frequency regions where the speech content is low. The selection of excitation waveform u(n) is important in hybrid codecs. That excitation is chosen, which gives the minimum weighted error between the original speech and the reconstructed signal. However, this closed-loop determination of the excitation sequence to produce good quality speech at low bit-rates, involves computationally intensive operations. Various hybrid codecs were developed, which differ in the way the excitation signal u(n) is represented and used. (A) Multi-pulse excited (MPE) codecs [8] model the ideal excitation by non-zero samples (pulses) in fixed number, separated by relatively long sequences of zero-values samples, for each frame of speech. Typically 8 pulses for every 10 ms were considered sufficient to generate different kind of speech sounds including voiced and unvoiced speech, with little audible distortion [8]. Sub-optimal methods were used to determine the positions of these non-zero pulses within a frame, and their amplitudes. Good quality reconstructed speech at a bit rate of around 10 Kbits/sec was achieved. The advantage in MPE codecs was that no a priori knowledge is required either of voiced/unvoiced decision or of the pitch-period for synthesized speech. But, the bit-rate achieved was still high in these codecs. (B) Regular-pulse excited (RPE) codecs [94] use a number of non-zero pulses to give the excitation sequence u(n), like in MPE codecs. But, here the pulses are regularly spaced at fixed interval, so the encoder needs to determine only the position of the first pulse and amplitudes of all pulses. Since, less information needs to be transmitted about pulse position, the RPE codecs can use more non-zero pulses (around 10 pulses per 5 ms), for a given bit-rate (say 10 Kbits/sec). Hence, RPE codecs give slightly better quality of reconstructed speech than MPE codecs, and are used in GSM mobile telephone systems in Europe [94]. However, computational complexity for RPE codecs is more than MPE codecs. (C) Code-excited linear prediction (CELP) codecs [165] are different from the MPE and RPE codecs, both of which provide good quality of speech only at the bit-rates ≥ 10 Kbits/sec. Both, the MPE and RPE codecs require to transmit the information about both positions and amplitudes of the excitation pulses. But, in CELP codecs the excitation signal is vector-quantized, i.e., the excitation is given by 42 an entry from the vector quantized code-book, and a gain term is used to control its power. Since, a 1024 entries code book index can be represented by 10 bits and gain can be coded by about 5 bits, the bit-rate requirement as only 15 bits now is greatly reduced, against the 47 bits required in GSM RPE codec. However, the disadvantage of CELP codecs is high complexity for any real-time implementation. As solution, the CELP codecs can be used at bit-rates ≤ 4.8 Kbits/sec by classifying the speech into voiced, unvoiced and transition frames, and then coding each type of speech segment in different way. (D) Multi-band excitation (MBE) codecs declare some regions in the frequency domains as voiced/ unvoiced and for each frame transmit the information of pitch period, spectral magnitude, phase and voiced/nonvoiced decisions for harmonics of F0 . (E) Prototype waveform interpretation (PWI) codecs use a single pitch-period information for every 20-30 ms, and then use interpolation to reproduce a smoothly varying quasi-periodic waveform for voiced speech segments. Thus, good quality speech at bit rates below 4 Kbits/sec is obtained by some variations in CELP codecs. 3.3 All-pole model of excitation in LPC vocoders (A) Generic pole-zero model [112]: A continuous-time signal s(t) sampled at sampling interval T , can be represented as a discrete-time signal in the form of a time-series s[nT ] or s[n] (denoted as sn ) [112], where n is a discrete variable and the sampling frequency is fs = 1/T . A parametric model of the system can be given by a generic pole-zero model, in which the signal sn , i.e., output of the system is predicted by a linear combination of past outputs, and past and present inputs (hence, called linear prediction) [112]. The predicted output of the system, i.e. s[n] (or sn ), is given by s[n] = − p X ak s[n − k] + G q X bl u[n − l], b0 = 1 (3.1) l=0 k=1 where, ak , 1 ≤ k ≤ p, 1 ≤ l ≤ q, and G (i.e., gain) are parameters of the system. Here, u[n] is the unknown input signal (i.e., a time-sequence). By taking z transform on both sides of (3.1), we get S(z) = − p X ak z −k S(z) + G (1 + ak z −k ) S(z) = G (1 + q X bl z −l ) U (z) l=1 k=1 S(z) H(z) = U (z) bl z −l U (z) l=0 k=1 p X q X Pq (1 + b z −l ) Ppl=1 l −k = G (1 + ) k=1 ak z (3.2) where, H(z) is the transfer function of the system, U (z) is z transform of u[n] (i.e., un ). Also, S(z) P −n . Here, H(z) in (3.2) is the is z transform of s[n] (i.e., sn ), and is given by S(z) = ∞ n=− ∞ s[n] z general pole-zero model, in which numerator polynomial gives the zeros and denominator polynomial the poles. Its two special cases are all-zero or all-pole models. If ak = 0 for 1 ≤ k ≤ p, then H(z) 43 in (3.2) is called all-zero model or moving average (MA) model. If bl = 0 for 1 ≤ l ≤ q, then H(z) in (3.2) is called all-pole model or auto-regressive (AR) model. Hence, pole-zero model is also called auto-regressive moving average (ARMA) model [112]. (B) All-pole model [112, 7]: Spectral information can be represented by the all-pole filter parameters, by using linear predictive coding (LPC) techniques in speech analysis and synthesis [7]. In an all-pole model, the signal s[n] is given by a linear combination of past output values and some input u[n], as s[n] = − p X ak s[n − k] + G u[n] (3.3) k=1 where G is gain factor and u[n] is input, i.e., the excitation sequence (or signal). By taking z transform on both sides of (3.3), we get S(z) = − p X ak z −k S(z) + G U (z) k=1 (1 + p X ak z −k ) S(z) = G U (z) k=1 H(z) = S(z) U (z) G Pp = (3.4) −k ) (1 + k=1 ak z P −k , i.e., S(z) is z transform of where H(z) is all-pole transfer function, and S(z) = ∞ n=− ∞ s[n]z s[n], and U (z) is z transform of u[n]. Here, presence of some input u[n] is assumed. (C) Method of Least Squares [112]: In case the input u[n] is completely unknown, then the output s˜[n] (or s˜n ) can be predicted from a linear weighted sum of only the past samples of the signal s[n], as s˜[n] = − p X ak s[n − k] (3.5) k=1 Then, error e[n] between the actual value s[n] and the predicted value s˜[n] is given by e[n] = s[n] − s˜[n] = s[n] + p X ak s[n − k] (3.6) k=1 Here, error e[n] is also called the residual. Parameters {ak } are obtained by minimization of the mean or total squared error (i.e., e2 [n]) with respect to each of the parameters. (D) LPC All-pole Synthesis Filter in LPC vocoders and its limitations [15, 140, 61, 8]: In speechsynthesis, ideally the excitation should be the linear-prediction residual [112, 7]. Most LPC vocoders use the excitation as either (i) a train of periodic non-zero or delta-function pulses separated by the pitch period for voiced speech, or (ii) a White noise waveform for unvoiced sounds. But, the problem is a reliable separation of speech segments into voiced and unvoiced classes [8]. Also, this rigid idealization of the excitation may be contributing to unnatural-sounding voice quality of synthesized speech [8]. Another important limitation of LPC vocoders relates to the observation in [8] that “it would be gross simplification to assume that there is only one point of excitation in the entire pitch period” [8, 67]. 44 There exists secondary excitation, apart from the main excitation occurring at glottal closure, which occurs not only at glottal opening and during the open phase, but also after the glottal closure [8, 67]. “These results suggest that the excitation for voiced speech should consist of several pulses in a pitch period, rather than just one pulse at the beginning of the period” [8]. 3.4 Methods to estimate the excitation impulse sequence representation 3.4.1 MPE-LPC model of the excitation The solution to the problem of excitation representation in LPC vocoders was proposed by Atal and Remde in [8]. A multi-pulse excitation (MPE) model was proposed [8], in which excitation signal is a combination of pulses in a glottal period, hence the name multi-pulse excitation. This model neither requires a priori knowledge of voiced-unvoiced decision nor pitch-period [8]. “The purpose of the multipulse analysis is to either replace or model the residual signal, by a sequence of pulses” [8]. Various MPE codecs differ in methods for determining the positions and amplitudes of pulses in the pulsesequence, in a given interval of time. The selection of pulse positions and amplitudes must be such that the difference between reconstructed and original speech (computed using some measure) is minimized. (i) Determining the MPE pulse-sequence using a weighting error filter: The sequence u[n] of pulses (or impulses, i.e., delta-function) is used as input to the LPC all-pole synthesis filter H(z), given by (3.4). The objective is to minimize a performance measure, i.e., the weighted-mean square error ǫ, computed from the difference e[n] between original speech s[n] and the reconstructed/synthesized speech s˜[n], given by (3.6) [8, 15]. The weighting here is accomplished by using a filter [8], as W (z) = H(γz) H(z) (3.7) where, γ is the bandwidth expansion factor, W (z) is weighting filter, H(z) is all-pole LPC synthesis filter, and H(γz) is bandwidth-expanded synthesis filter [15]. The weight is chosen such that the SNR is lower in formant regions, since noise in these regions is better masked by the voiced speech signal. The desired multi-pulse excitation d[n] (or dn ) is determined by modeling the LPC residual r[n] (or {rn }), such that weighted error between original and synthesized speech is minimized. Here, desired signal dn is obtained by passing the residual (rn ) through the bandwidth-expanded synthesis filter H(γz). d[n] = ∞ X r[k] h[n − k] (3.8) k=− ∞ where, h[n] (or hn ) is causal impulse response of the bandwidth-expanded synthesis filter H(γz). Different approaches are used for determining the positions and amplitudes of impulse-like pulses. (ii) Finding the transfer function of error-weighting filter [8]: In multi-pulse excitation model, an all-pole LPC synthesizer filter H(z) is excited by an excitation generator that gives a sequence of pulses located at times (positions) t1 , t2 , ..., tn , ... with amplitudes 45 α1 , α2 , ..., αn , ..., respectively. The pulse-sequence, referred to as vn (or v[n]), could possibly be the desired impulse sequence representation of the excitation, denoted as dn (or d[n]) in (3.8). When excited by v[n], the LPC synthesis all-pole filter H(z) produces the output synthesized speech samples s˜[n] (or s˜n ). The sampled output of the all-pole filter, when passed through a low-pass filter, produces a continuous reconstructed speech waveform sˆ(t). Note that (˜) denotes discrete samples of the synthesized speech, and (ˆ) the continuous reconstructed speech. Comparison of synthesized speech samples s˜[n] with the corresponding speech samples s[n] produces the error signal e[n] (or en ). The error signal e[n] needs to be modified to take into account the human ear’s perception of this signal. Human ear’s phenomena like masking and limited frequency resolution are considered, for deemphasizing the error in the formant regions [8]. The error signal e[n] after modifying it is used for estimating the excitation signal, i.e., locations and amplitudes of pulses in the sequence. A linear filter is used for suppressing the error signal energy in the formant regions. Thus error signal is weighted, squared and averaged over 5-10 ms intervals to produce the mean-squared weighted error ǫ [8]. The locations and amplitudes of pulses are chosen such that the error ǫ is minimized. Note that e[n] is the difference between synthesized speech samples s˜[n] and original speech samples s[n], but the ǫ is mean-squared weighted error in frequency-domain. The frequency-weighted error is given by [8] Z fs ˆ )|2 W (f ) df ǫ= |S(f ) − S(f (3.9) 0 ˆ ) are Fourier transform of original speech s(t) and synthesized speech sˆ(t), respecwhere S(f ) and S(f tively. Here, fs is sampling frequency, and W (f ) is a weighting function (in frequency-domain), which is chosen such that formant regions in the error spectrum are de-emphasized. If 1 − P (z) is LPC inverse filter in z-domain, then short-time spectral envelope of the speech (error) signal is given by 2 K Se (f ) = (3.10) 1 − P (e− 2jπf fs ) where, K is the mean-squared prediction error. The inverse filter 1 − P (z) is given by 1 − P (z) = 1 − p X ak z −k =1− p X ak e − 2jπf k fs (3.11) k=1 k=1 where {ak } are coefficients of the inverse filter, and transfer function of error-weighting filter W (z) is P 1 − pk=1 ak z −k P (3.12) W (z) = 1 − pk=1 ak γ k z −k where γ parameter controls the error weight in the formant region. The value of γ chosen between 0 and 1, decides the degree of de-emphasis of the formant regions in the error spectrum. For γ = 0 the W (z) = 1 − P (z), and for γ = 1 the W (z) = 1. A typical value of γ = 0.8 is used for fs = 8 KHz. Notice that the expression in (3.12) is in-line with the simplified version in (3.7). 46 The locations and amplitudes of the pulses in the excitation signal are determined by minimizing the weighted error over successive 5 ms intervals (for speech data averaged over 20 ms intervals and speech segments of 0.1 sec [8]). Different methods are used for minimizing the weighted mean-squared error. (iii) Error minimization procedure, to get locations and amplitudes of pulses in excitation sequence: Most of the error minimization procedures [8, 15, 140, 61] aim to minimize the weighted error, and select appropriate locations and amplitudes of pulses in the excitation signal. Pulse locations can be found by computing either one pulse at a time, or by computing the error for all possible pulse locations in a given time interval and then locating the minimum error position [8]. Amplitudes of the pulses appear as a linear factor in the error, and as a quadratic factor in the mean-squared weighted error. Hence, pulse amplitude is obtained either as a closed-form solution by setting the derivative of the mean-squared error to zero, or by determining in single step the amplitudes of all pulses by solving a set of linear equations (assuming the pulse positions are known a priori). A procedure for finding the locations and amplitudes of pulses in a given time-interval [8], is as follows: Step 1: Generate synthetic output speech, entirely from the memory of the all-pole filter from previous synthesis intervals, and without any excitation pulse. Step 2: Determine the location and amplitude of a single pulse, by subtracting the contribution of past memory from the speech signal and also minimizing the mean-squared weighted error. Step 3: Compute new error signal by subtracting the contribution of the pulse just determined. Thus continue these steps till the mean-squared weighted error is reduced below a desired threshold level. The advantage here is that the “multi-pulse excitation is able to follow rapid changes in speech waveform, such as those occurring in rapid transitions” [8]. Different methods of estimating the amplitudes and locations of pulses in the excitation sequence (signal) are discussed later in this section. (iv) Objective in MPE-LPC model [54]: The key objective in the multi-pulse excitation (MPE) LPC model of excitation is to find (i) a pulse sequence u[n], and (ii) a set of filter parameters {ak } used in (3.3). The u[n] and {ak } are found in such a way that a perceptually weighted means-squared error (MSE) e¯2 [n] is minimized w.r.t. the reference signal s[n] [54]. The synthesized signal s˜[n] in MPE-LPC model similar to (3.5),is given by s˜[n] = p X ak s˜[n − k] + u[n] (3.13) k=1 where, p is predictor order. The filter coefficients {ak } and the excitation signal u[n] are determined to minimize the error e¯2 [n] (similar to (3.6)) given by X (s[n] − s˜[n]) (3.14) e¯2 [n] = n Finding suitable {ak } and u[n] for minimum e¯2 [n] is a highly nonlinear and difficult problem. Different approaches to solve this problem, that mostly involve first determining the LPC parameters {ak } and then the excitation pulse sequence u[n], are discussed in next two sub-sections. 47 3.4.2 Methods for estimating the amplitudes of pulses Methods to determine pulse amplitude [8, 15, 140, 61] can be categorised as follows. (a) Sequential approach (no re-optimization), that uses correlation type analysis involving sequential pulse placement. (b) Re-optimization after all pulse positions are determined, which involves covariance type analysis and block edge effect. (c) Re-optimization after each new pulse position is determined, that involves jointly optimal approach and Cholesky decomposition technique to avoid square-root operations. (A) Sequential pulse placement method (no re-optimization) In the sequential pulse placement method (involving the ‘correlation’ type analysis) [15], each analysis frame is divided into blocks, in order to determine the multi-pulse excitation. Assume that each block-size is of N samples, and there are Np excitation pulses for each block. Further, assuming that the desired excitation sequence is d[n] (or dn ) as mentioned in (3.8) and the first pulse is placed at position m, then the mean-squared weighted error for the block is given by e¯2 = X (dn − Am hn−m )2 (3.15) n where, hn−m , i.e., h[n − m] is causal impulse response of bandwidth-expanded synthesis all-pole filter H(γz) in (3.7) for the impulse located at m, and Am is amplitude of pulse at position m. Optimal pulse amplitude is obtained by differentiating (3.15) w.r.t. Am and minimizing the error e¯2 (i.e., e¯2 → 0) [15]. e¯2 = X (d2n − 2dn Am hn−m + A2m h2n−m ) (3.16) n By differentiating it w.r.t. Am , we get 2 X dn hn−m = 2 Am X hn−i hn−j n n Am = P d h P n n n−m n hn−i hn−j (3.17) In the (3.17), by denoting the vector of cross-correlation terms in numerator by αm and the matrix of correlation terms in denominator by φij , we get the optimal pulse amplitude Aˆm [15], as follows: αm = X dn hn−m (3.18) hn−i hn−j (3.19) n φij = Aˆm = X n αm φmm 48 (3.20) Now, by substituting the value for optimal amplitude Aˆm [15] at mth location from (3.20), into expression for weighted mean-squared error e¯2 in (3.16), we get X αm 2 X X αm 2 ¯ 2 h2n−m hn−m + e = dn − 2 dn φ φ mm mm n n n 2 X αm αm = d2n − 2 αm + φmm (by using (3.18) and (3.19)) φmm φmm n X α2 (3.21) e¯2 = d2n − m φmm n Hence, the weighted mean-squared error now depends on only the position (m) of the pulse. The best 2 m position (m) for a pulse is given by that value of m, for which φαmm is maximum [15]. The optimal position for the next pulse is given by the expression for new sequence {d′n }, as d′n = dn − Aˆm hn−m (3.22) By putting the new value of d′n in (3.18), we get ′ αm = αm − Aˆm ˆ φmm ˆ (3.23) ′ are new values of d and α , respectively, computed after determining amplitude and Here, d′n and αm n m position of first pulse. Likewise, positions and amplitudes can be found sequentially for all pulses. This cross-correlation maximization approach is computationally more efficient than exhaustive search [15]. (B) Re-optimization after determining ‘all’ pulse positions In the re-optimization after determining ‘all’ pulse positions, covariance type analysis is used [15]. In the autocorrelation form of analysis the limits of error sum e¯2 in (3.15) are used from −∞ to +∞. The residual {rn }, referred in (3.8), is assumed to be windowed such that it is zero outside the signal block of N samples. The part of impulse-response of the synthesis filter H(z) that falls outside the N sample block, is taken into account in autocorrelation type analysis. The autocorrelation term φij in (3.19) is of the form of Topliz matrix, in which only the first row of values need to be determined. Here, the optimal pulse amplitude (Am ) does not depend upon the pulse position m. Rather, it depends on best position (values of m), for which |αm | is maximized and φmm is minimized in the error expression in (3.21). The pulse amplitude is determined by jointly-optimal pulse amplitudes. The mean square error (MSE) after getting the pulse positions up to mi for all np (i.e., Np ) pulses, is given by e¯2 = X n (dn − np X Ami hn−mi )2 (3.24) i=1 Expanding this expression, we get e¯2 = X n d2n −2 X n (dn np X i=1 Ami np np XX X hn−mi ) + ( Ami hn−mi . Ami hn−mi ) n 49 i=1 i=1 (3.25) Now, differentiating this expression w.r.t. all pulse amplitudes Ami , we get 2 X (dn n np np X i=1 np np XX X hn−mi ) = 2 ( hn−mi . Ami hn−mi ) n i=1 i=1 np np XX X X X ( hn−mi . hn−mi . Ami ) = (dn hn−mi ) n i=1 n i=1 (3.26) i=1 Replacing in (3.26) now the cross-correlation terms αmi like in (3.18), and correlation terms φmi mi like in (3.19), we get the following set of simultaneous equations: φ m1 m1 φ m2 m1 .. . φ mi m1 .. . φ m np m 1 φ m1 m2 φ m2 m2 .. . ··· ··· .. . φ m1 mj φ m2 mj .. . φ mi m2 .. . ··· .. . φ mi mj .. . φ m np m 2 · · · φ m np m j ··· ··· .. . φ m 1 m np φ m 2 m np .. . Aˆ1 Aˆ2 .. . · · · φ m i m np Aˆmi .. .. .. . . . ˆ Amnp · · · φ m np m np α1 α2 .. . = α mi . .. α m np (3.27) where, Aˆmi is optimal amplitude at position mi , and np (Np ) is number of pulses in block of N samples. Solution to this set of simultaneous equations can be obtained by using Cholesky decomposition of the correlation matrix which consists of elements φij . Two forms of pulse-amplitude re-optimization can be used - (i) after determining ‘all’ pulse positions, (ii) after determining ‘each’ pulse position. In this method [15], the optimal pulse amplitude Aˆ is determined by using φij in (3.19) and (3.20), as a matrix of covariance terms. The effect of the part of impulse response (to all-pole synthesis filter) falling outside the block of N samples (for mean-squared error e¯2 ) is ignored. For closely spaced pulses the successive optimization of individual pulses is inaccurate. Additional pulses are required to compensate for these inaccuracies introduced [173]. Hence, re-optimization of ‘all’ Np pulses is required to solve it. (C) Re-optimization after determining ‘each’ pulse position A variation of above method is to locate a pulse at any stage (mi ) by jointly-optimizing its amplitude, using amplitudes of all pulses located up to that position mi [173]. In this re-optimization, all terms for pulse amplitude, position and error are minimized after determining the location of each pulse [15, 173, 8]. In the sequential method [8], the amplitudes and locations of pulses are determined in successive stages, re-optimizing for one pulse at a time by minimizing the weighted mean-squared error e¯2 [n]. But, in the case of closely spaced pulses some additional pulses are required in the same pitch-period to compensate for the inaccuracies introduced, using which the pulse amplitudes can be re-optimized at each step of the sequential procedure instead of all Np pulses together. In this method [173], the amplitude of a pulse at location mi is jointly optimized using amplitudes of all pulses determined up to that stage, keeping amplitudes of all pulses optimal at every (ith ) stage. The result of Cholesky decomposition for Np pulses is computed by adding one row to the triangular 50 matrix determined for np − 1 pulses, without disturbing the amplitude (results) of previously determined pulses [15]. Hence, a new element is computed without disturbing the elements determined up to that stage. This method reportedly gave improvements in the SNR up to 10 dB in some speech segments [173]. 3.4.3 Methods for estimating the positions of pulses The search for the best pulse-location in the sequence of excitation pulses involves some comparisons, each of which has computational complexity or cost equal to an addition operation. Different methods have been attempted to minimize that cost. Major few of these are discussed in this section. (A) Pulse correlation method In pulse correlation method [15], the best pulse-position m for a pulse in the excitation is obtained using the correlation terms. It is the location at which the pulse amplitude Am in (3.17) is optimal, i.e., it is Aˆm (as given by (3.20)), and the weighted mean-squared error e¯2 (as in (3.15) or (3.21)) is minimum. The impulse response of the bandwidth-expanded all-pole synthesis filter dies-off quickly, due to bandwidth expansion factor [15]. Hence, this part of the impulse response can be truncated. In the autocorrelation form of analysis, the correlation term (φij ) is generated by filtering the {hn } (i.e., {h[n]}) using recursive form of bandwidth expanded all-pole synthesis filter and (3.22). In covariance form of multi-pulse analysis, the correlation term {φij } is defined recursively [15], as φi−1, j−i = φij + hN −i hN −j (3.28) The initial cross-correlation φij in this recursion in (3.28) can be computed either: (i) by direct computation using (3.20), or (ii) by filtering the {dn } using all-pole model for bandwidth-expanded synthesis filter [15]. The cross-correlation terms are updated using (3.23). The pulse optimization in (3.27) uses a modified form of Cholesky decomposition, thus avoiding the square root operations. But, the computational cost here is large memory requirement to store full coefficient matrix of size Np x Np elements. (B) Pitch-interpolation method In pitch-interpolation method [140], the pulse-position is obtained by interpolating the pitch-period so as to minimize the weighted mean-squared error. In the multi-pulse excitation model [8], the excitation signal is a combination of pulses that excites a synthesis filter to produce synthesized speech s˜[n]. Hence, sequential sub-optimum pulse search method, i.e., re-optimization after determining each pulse (discussed in section 3.4.2(C)) was considered a better solution [15, 140], than the sequential method or re-optimization after determining all pulses. This method is based on analysis-by-synthesis approach. Synthesis filter parameters {ak } are computed from the original speech using LPC analysis. The error weighting filter H(γz) is used to reduce the perceptual distortion. The pulse is determined in such a way that the weighted mean-squared error e¯2 [n] given by (3.14), is minimized. The pulse search methods based upon this error-power minimization require maximum cross-correlation αm , as given in (3.18). A simplified method of maximum cross-correlation αm gives the optimum location mi of the ith pulse, which is determined by searching the maximum absolute point of amplitude Am for the pulse at 51 location mi . In this method [140], the pulse amplitude Am at location m is given by Ami = αhs (mi ) − Pi−1 j=1 Amj . φhh (|mj − mi |) φhh (0) , 1 ≤ mi , mj ≤ N (3.29) where, N is number of samples in block, αh (mi ) is cross-correlation between weighted speech (s[n]) and weighted impulse-response (h[n − m]) of all-pole synthesis-filter, φij is autocorrelation of weighted impulse response (h[n − m]), and Am are amplitudes of previously determined pulses up to ith location. Like in (3.18) and (3.19), the correlation terms αhs and autocorrelation terms φhh are given by αhs (mi ) = X s[n] h[n − mi ] (3.30) hn−mi hn−mj (3.31) n φhh (ij) = X n Since, searching exhaustively for all possible locations of pulses, i.e., for n = 1 to M would be computationally expansive, different computationally efficient methods are used. In this method [140], the excitation signal (u[n]) for the voiced speech segments is represented by a small number of pulses (i.e., multi-pulse) in each pitch-period. Each excitation signal frame consists of several pitch-periods. The original speech (s[n]) in a frame of size 20 ms is divided into several subframes of durations of successive pitch-periods. Synthesis filter parameters are interpolated in a pitchsynchronous manner, to get smooth change in the spectral characteristics of synthesized speech [140]. Several pitch-periods are searched for selecting one suitable pitch-period. Using this chosen duration, the pitch-interpolator reproduces the excitation signal for other pitch-periods in the frame by performing a linear interpolation. Usually 4 pulses are considered in a pitch-period, for sampling frequency (fs ) of 8 KHz. By exciting a synthesis filter with this excitation signal, the synthesized speech (˜ s[n]) is produced that approximates the original speech (s[n]) in the frame. (C) SPE-CELP method In the single-pulse excitation (SPE)-CELP method [61], a single-pulse instead of multi-pulse is used in a pitch-period of voiced speech. The conventional CELP coding [165] does not provide appropriate periodicity of pulses in synthesized speech, especially for bit-rates below 4 Kbits/sec. It is because, both the small code-book size and the coarse quantization of gain factor cause large fluctuations in the spectral characteristics between two periods. In order to achieve smoothness of spectral changes, the excitation consisting of a single-pulse of fixed or slowly varying shape for each period was proposed [61]. First a LP coder classifies speech into periodic and non-periodic intervals. Then, non-periodic speech is synthesized like that in CELP [140] coding, and periodic speech using single-pulse excitation. This coder uses an algorithm for determining the pitch markers within short blocks of periodic speech. The excitation for the all-pole synthesis filter in CELP coding [165] is modeled as a sum of two vectors, an adaptive codebook that contains past excitation, and a fixed stochastic codebook. Selection of both vectors uses the criterion of minimum perceptually-weighted error between original and reconstructed speech [165, 61]. Here, repeating the past excitation is necessary for obtaining the periodic 52 excitation. But, the stochastic code-book significantly reduces the ability to produce a periodic excitation. Also, the stochastic code-book vectors of fixed block-size cause large fluctuations in non-smooth spectrum of reconstructed speech, thereby giving rise to noise in this case [61]. Hence, these problems in CELP coding are solved using a pulse-like signal with fixed or slowly time-varying shape, i.e., a pulse defined by a single delta-impulse or a cluster of several delta-impulses is considered for representing the excitation in each period. Since, a single-pulse excitation can be better described by the time-location of each pulse, along with its shape and gain, the parametric representation of a single pulse also enables the interpolation of excitation parameters. This method using both (i) a fixed delta-impulse shape of singlepulse excitation to synthesize periodic speech and (ii) CELP-like stochastic code-book to synthesize non-periodic speech, is referred to as SPE-CELP coder [61]. Determining the pitch-markers in SPE-CELP coding method [61]: In order to determine the pitch-markers, the speech signal is encoded in coding frames of size of 200 samples, i.e., 25 ms for sampling rate of 8 KHz. Each coding frame is subdivided into 4 sub-frames of 50 samples. Using long-term auto-correlation of the pre-processed speech signal within a window around a sub-frame, the periodic/non-periodic classification is made for each sub-frame. Average pitch period p¯ is determined for each periodic sub-frame. After each non-periodic-to-periodic transition, a sequence of up to 5 periodic sub-frames is created, to form an optimization frame. Pitch markers are determined for each optimization frame, that define the optimal locations for excitation by single-pulse of delta-impulse shape using an error-criterion. The error-criterion includes a cost function f (i, j, k), and is implemented efficiently by using dynamic programming in its optimization procedure [61]. Let us assume that an LPC synthesis filter is excited with a single-pulse at location n with amplitude α, for the coding frame m and maximum SNR Sopt (n). The impulse-response vector is ~h0 = [h(0), ..., h(2N − 1)]T , where N is length of a periodic sub-frame. The error between speech vector ~x = [x(0), x(1), ....x(2N − 1)]T and delayed impulse response vector ~hn multiplied by amplitude α, is ~en = ~x − α ~hn (3.32) where ~hn = [0, . . . 0, h(0), . . . h(2N −1−n)]T and n = 0, . . . , N −1. Minimization of the error-energy w.r.t. pulse amplitude α [61], gives min ~eTn ~en = ~xT ~x − α (~xT ~hn )2 ~hT ~hn (3.33) n αopt (n) = ~xT ~hn ~hT ~hn (3.34) n Sopt (n) = max SN R(n) = 10 log10 α ~xT ~x min ~eTn ~en α (3.35) Here, pulse amplitude αopt (n) and SNR Sopt (n) are computed for each time-instant n, in an optimization frame m. A candidate pitch-marker at ni is represented by a 3-tuple zi =< ni , αopt (ni ), Sopt (ni ) >, 53 where Z = {zi , i = 1, . . . , M }, and M is maximum number of optimal sub-frames in the speech block. For optimum sequence sopt the accumulated cost C is minimized using the cost function f (i, j, k). For the indices Q = {q1 , . . . , qk | qt ∈ [1, M ], nqi > nqi−1 }, the values of sopt , C and f (i, j, k) [61] are: sopt = {zq1 , . . . , zqk }, K ≥ 2 # " K X 1 f (i=ql , j=ql−1 , k=ql−2 ) fini (i=q1 ) + C = min s K (3.36) (3.37) l=2 a f (i, j, k) = Sopt (ni ) (ni − nj ) αopt (ni ) + c ln + b ln αopt (nj ) (nj − nk ) (ni − nj ) ni > nj > nk + d ln , p¯(ni ) (3.38) The cost function f (i, j, k) in (3.38), which gives the accumulated cost C, has 4 summation terms. Their purpose is to penalize (i) the candidate with low Sopt (ni ), (ii) inconsistency in amplitude of two successive pulse candidates, (iii) inconsistency in two successive pulse-intervals (ni −nj ) and (nj −nk ), and (iv) a deviation in the pulse interval (ni − nj ) from initial estimate of average pitch period p¯(ni ). The initial cost fini (i=q1 ) in (3.38) is computed as a ni fini (i) = + d ln + ff ix , Sopt (ni ) p¯(ni ) a + ff ix , = Sopt (ni ) ni > p¯(ni ) ni ≤ p¯(ni ) (3.39) where, ff ix is a constant and fini (i=q1 ) = f (i=q1 , j=q0 , k=q−1 + ff ix ). The factors a, b, c and d in (3.38) are determined empirically, for proper weighting of all 4 summation terms. The indices Q = {q1 , . . . , qk } of best sopt (optimal sequence) define locations of the pitch-markers within current frame. In this SPE-CELP method [61], the gain factors for encoding/transmission are jointly-optimized by minimizing a perceptually-weighted mean-squared error criterion like in CELP coding method [165]. The pitch-markers, used for detecting individual periods of periodic speech within coding frames, are implemented using dynamic programming. But, the limitation of this method is poor naturalness of synthesized speech, which due to the fixed-pulse shape used, still sounds buzzy for certain speakers [61]. 3.4.4 Variations in MPE model of the excitation (A) Changing pitch and duration in MPE The use of multi-pulse excitation [8] lead to significant improvement in the quality of synthetic speech, but it did not provide appropriate degree of periodicity of the synthesized signal [61]. Coding delays were also long in it. Single-pulse excitation model did reduce the coding delays also. However, the procedures for changing the pitch were not known for multi-pulse excitation. Hence, two methods were proposed for modifying the length of individual pitch periods that in turn caused changes in the 54 pitch [27]. (i) Linear scaling of the time axis of the multi-pulse excitation, which introduced more distortion due to addition/removal of pitch periods that was required to change the pitch period duration. (ii) Adding or deleting zeros in the excitation pulse sequence, which produced little distortion in the synthetic speech [27]. In the cases where additional pitch periods were created, the amplitudes of the major excitation pulse and the LPC parameters were obtained by interpolation [27]. (B) Post-filtering Other than the speech coding methods discussed so far, approaches like adaptive sub-band coding (SBC) or adaptive transform coding (ATC) were also attempted, which represented the frequencydomain coders. Post-filtering, using the auditory masking properties, was used in these coders [145]. The post-filtering scheme is based on long-term and short-term spectral information of synthesized speech. (i) The long-term correlation represented by pitch parameters gives fine spectral information. (ii) The short-term prediction of LPC coefficients gives global spectral shape information [93]. The optimal long-term and short-term predictors are expressed as HL (z) = 1 − βz −M HS (z) = 1 − M X qi z −i (3.40) (3.41) i=1 where, HL and HS are the transfer functions of the long-term and short-term predictors, respectively. The corresponding post-filter is given by 1 P ′ (z) 1 HS′ (z) = = 1 P (ǫ1/M z) 1 Hs (z/α) (3.42) where, 0 ≤ ǫ < 1 and 0 ≤ α < 1 are the parameters to vary the impulse response of post-filter between responses of an all-pass filter (α = ǫ = 0) and a low-pass filter (α = ǫ = 1.0). Thus, varying the parameters α and ǫ, the degree of noise shaping and signal distortion is changed [145]. For a suitable ǫ, the post-filtering attenuates the valley in comb filter, but disadvantage is bandwidth increase by α factor. (C) Secondary pulses and phase changes in MPE The presence of secondary excitation pulses after the glottal closure, apart from the primary excitation pulses present due to glottal opening/open phase, was also indicated in [8]. The additional pulses were introduced in pitch-periods, to reduce the inaccuracies introduced in the successive optimization of individual pulses, especially in the case of closely spaced pulses [173]. But, these secondary pulses in the multi-pulse excitation did not show periodic behaviour anywhere similar to the periodicity of input speech [6]. Hence, repetition of a single multi-pulse pattern, selected randomly across all voiced segments, was considered for the purpose of producing synthetic speech [6]. Other variations in the multi-pulse excitation, by changing pitch period and duration, were also attempted [27]. But, it was observed that secondary pulses in multi-pulse excitation do not vary systematically from one pitch period to another, even for periodic speech input [26]. The multi-pulse excitation is highly 55 Figure 3.1 Schematic block diagram of the ZFF method periodic in lower frequency bands and is less periodic in higher frequency bands [26]. Hence, replacing the original multi-pulse pattern in the excitation, with a fixed multi-pulse pattern was proposed in [26]. This fixed multi-pulse pattern was selected randomly from the multi-pulse excitation of a voiced speech segment. Spectral and phase characteristics of pulses were changed, to introduce irregularities in the fine structure of excitation. It did help in removing the buzzing sound effect present in CELP or SPECELP coded speech [26]. Six different phase conditions were examined: zero phase, constant phase, time-varying phase, frequency dependent group-delay characteristics based phase, time-varying phase of the first harmonic of the LPC residual and original phase [26]. It was observed that introducing these period-to-period irregularities is necessary to provide more naturalness to synthesized speech. [26]. 3.5 Zero-frequency filtering method The characteristics of the glottal source of excitation are derived from the acoustic signal using the zero-frequency filtering (ZFF) method [130, 216]. In ZFF, the features of the impulse-like excitation at glottal source are extracted by filtering the differenced acoustic signal through a cascade of two zerofrequency resonators (ZFRs). Steps in the ZFF method are shown in the schematic block diagram in Fig. 3.1. The key steps involved [130] are as follows: (a) A differenced speech signal s[n] is considered. This preprocessing step removes the effect of any slow (low frequency) variations during recording of the signal and produces a zero mean signal. (b) The differenced signal s[n] is passed through a cascade of two ZFRs, each of which is an all-pole system with two poles located at z = +1 in the z-plane. The output of the cascaded ZFRs is given by 4 X y1 [n] = − ak y1 [n − k] + s[n] , (3.43) k=1 where a1 = −4, a2 = 6, a3 = −4 and a4 = 1. It is equivalent to a sequence of four successive cumulative sum (integration) operations in time-domain, which leads to a polynomial-like growth/decay of the ZFR output signal. 56 (a) Input voice signal s[n] 1 0 −1 (b) ZF Resonator output 2 y [n] 13 4 x 10 2 0 2 y [n] (c) ZFF output with window length =9 msec 1 0 −1 2 y [n] (d) ZFF output with window length =7 msec 1 0 −1 2 y [n] (e) ZFF output with window length =5 msec 1 0 −1 2 y [n] (f) ZFF output with window length =3 msec 1 0 −1 20 40 60 80 100 120 Time (msec) 140 160 180 200 Figure 3.2 Results of the ZFF method for different window lengths for trend removal for a segment of Noh singing voice. Epoch locations are indicated by inverted arrows. (c) The fluctuations in the ZFR output signal can be highlighted using a trend removal operation, which involves subtracting the local mean from the ZFR output signal at each time instant. The local mean is computed over a moving window of size 2N + 1 samples. The window size is about 1.5 times the average pitch period (in samples), which is computed using autocorrelation function of a 50 ms segment of the signal. The output of the trend removal operation is given by y2 [n] = y1 [n] − N X 1 y1 [n + m] , 2N + 1 (3.44) m=−N where N is number of samples in half window size. The resultant local mean subtracted signal is called the zero-frequency filtered (ZFF) signal. An illustration of the ZFF signal (y2 [n]) for is shown in Fig. 3.2(c), which is derived from the corresponding speech signal (s[n]) shown in Fig. 3.2(a). The trend built-up is shown in Fig. 3.2(b). (d) The positive to negative going zero-crossings correspond to the instants of glottal closure (GCIs), which are also referred to as epochs [130]. The interval between successive epochs corresponds to the fundamental period (T0 ), inverse of which gives the instantaneous fundamental frequency (F0 ) [216]. (e) The slope of the ZFF signal around the epochs gives a measure of the strength of the impulse-like excitation (SoE) [130, 216]. The SoE (denoted by ψ) at an epoch thus represents the amplitude of impulse-like excitation around that instant of significant excitation (i.e., GCI). 57 Figure 3.3 Schematic block diagram of the ZTL method As can be observed in Fig. 3.2, the effect of window-length on trend-removal is not significant so long it is chosen between 1-2 times the pitch period, which is fine for the case of normal speech. But the role of window-length becomes important, when pitch-period is reduced or is changing rapidly, as is the case likely to be for sounds other than normal speech. Reducing the window length for trend removal as in Fig. 3.2(d)-(f), may not help in such cases. The need for special signal processing method and modification in ZFF method are discussed in later chapters, while discussing paralinguistic sounds and expressive voices. 3.6 Zero-time liftering method The recently proposed zero-time liftering (ZTL) method is used to capture the spectral features of the speech signal with improved temporal resolution [40, 38]. The method involves multiplying a segment of the signal starting at each sampling instant with a tapering window that gives large weight to the samples near the starting sampling instant, which we call as zero-time. The effect of the tapering window function w1 [n] in the time domain is approximately equivalent to integration in the frequency domain. This inference is derived from the fact that the operation of double integration in the time domain is 2 equivalent to filtering with a function 1−z1 −1 , which is same as multiplying the frequency response 2 with a frequency response of an ideal digital resonator 1−z1 −1 with resonance frequency at ω = 0, i.e., the zero frequency. In analogy with the zero-frequency filtering [130], the time domain signal is multiplied with w1 [n] to produce a smoothed spectrum in the frequency domain by integration. Steps involved in the ZTL method are shown in the schematic block diagram in Fig. 3.3. The following are the steps involved in extracting the instantaneous spectral characteristics using the ZTL method [40, 38]. (a) Consider differenced speech signal s[n] at a sampling rate of 10000 samples per second. The differenced signal is used to reduce the effects of slowly varying components in the signal due to recording. (b) Consider a 5 msec segment (i.e., M = 50 samples) at each sampling instant, and append it with N − M (= 1024 − 50) zeros. The segment is appended with sufficient number of zeros before 58 computing N -point discrete Fourier transform (DFT), to get adequate number of samples in the frequency domain for observing the spectral characteristics. (c) Multiply the N samples segment with a window function w12 [n], where w1 [n] = 0, = n = 0, 1 4sin2 (πn/2N ) , n = 1, 2, ...N − 1, (3.45) (This gives an approximation to four times integration in the frequency domain, as M << N . Actually a window function of 1/n4 results in integration operation in the frequency domain. But as mentioned above, the window function w1 [n] is chosen analogous to the zero-frequency filtering in the frequency domain). Multiplying the signal with the window function w1 [n] is called the zerotime liftering (ZTL) operation. This window function emphasizes the values near the beginning of the window, i.e., near n = 0, due to its heavy tapering effect. This will help in producing a smoothed function in the frequency domain. (d) The truncation effect of the signal at the sampling instant M − 1 in the time domain results in ripple in the frequency domain. This ripple is reduced by using another window w2 [n] = 4 cos2 (π n/2M ), n = 0, 1, .....M − 1, which is square of half cosine window of length M samples. The shape of this window is not critical to the results, except that it should have a tapering effect towards the end of the segment. (e) The square magnitude of the N -point DFT of the double windowed signal, i.e., of x[n] = w12 [n]w2 [n]s[n] is taken. It is a smoothed spectrum due to the effect of equivalent four times integration in the frequency domain. (f) In order to highlight the spectral features, the numerator of the group delay (NGD) function g[k] of the windowed signal (i.e., of x[n] = w12 [n]w2 [n]s[n]) is computed [215, 38]. The group delay function τ [k] of a signal x[n] is given by [212, 75] τ [k] = XR [k]YR [k] + XI [k]YI [k] , 2 [k] + X 2 [k] XR I k = 0, 1, 2, ....., N − 1 (3.46) where XR [k] and XI [k] are the real and imaginary parts of the N -point DFT X[k] of x[n], respectively, and YR [k] and YI [k] are the real and imaginary parts of the N -point DFT Y [k] of y[n] = nx[n], respectively. The numerator of the group delay function g[k] is given by [75] g[k] = XR [k]YR [k] + XI [k]YI [k], k = 0, 1, 2, ....., N − 1 (3.47) The group delay function has higher frequency resolution than the normal spectrum due to its additive property, i.e., the group delay function of a cascade of resonators is approximately the sum of the squared frequency response of the individual resonators [212]. Moreover, due to smoothed nature (cumulative effect) of the spectrum, the numerator of the group delay function displays even higher frequency resolution of the resonances (since the spectral peaks are more highlighted now) [75]. 59 (a) 3D HNGD spectrum (b) 3D HNGD spectrum at epoch locations 15 x 10 2.5 2 1.5 1 0.5 50 100 0 150 1000 2000 200 3000 250 4000 Frequency (Hz) Time (ms) 5000 Figure 3.4 HNGD plots through ZTL analysis. (a) 3D HNGD spectrum (perspective view). (b) 3D HNGD spectrum plotted at epoch locations (mesh form). The speech segment is for the word ‘stop’. (g) The resulting NGD function is differenced twice to highlight the spectral features, i.e., peaks in the spectrum corresponding to resonances (formants) of the vocal tract system. The double differencing is needed to remove the trend, and to observe the spectral features. Note that the differencing operation is performed with respect to discrete frequency variable (k), whereas the liftering operation in the time domain is equivalent to integration with respect to the continuous frequency variable (ω). (h) Due to effect of some spectral valleys, some of the low amplitude peaks do not appear well in the double differenced NGD plots. Hence, the Hilbert envelope of the double differenced NGD spectrum is computed to represent visually the spectral features better [137, 40]. The resulting spectrum is called the HNGD spectrum. A 3-dimensional (3D) HNGD plot for a segment of speech signal is shown in Figure 3.4(a). Note that the HNGD plots are shown for every sampling instant. The HNGD spectrum can also be sliced temporally at every glottal closure instant (epoch), to view it in a 3D mesh form, as shown in Figure 3.4(b). It is the temporal variations in the instantaneous HNGD spectra over the open and closed phases of glottal pulses that are exploited in this study to discriminate between shouted speech and normal speech, discussed later in Chapter 5. 3.7 Methods to compute dominant frequencies 3.7.1 Computing dominant frequency from LP spectrum The production characteristics of speech characterise the role of both the excitation source and the vocal tract system. The vocal tract characteristics are derived from the LP spectrum obtained using the 60 LP spectrum of a frame of speech signal 15 FD 2 10 log|LX| (dB) 10 5 0 −5 −10 0 1000 2000 3000 Frequency (Hz) 4000 5000 Figure 3.5 Illustration of LP spectrum for a frame of speech signal. LP analysis method [112]. The key idea in LP analysis is that each sample xn of speech signal x[n] at time instant n is predictable as a linear combination of previous p samples. x ˆn = p X ak xn−k (3.48) k=1 where {xn } are speech samples in a given frame, {ak } are the predictor coefficients and {ˆ xn } are the predicted samples. An all-pole filter H(z) given by these coefficients {ak } in the frequency domain is: H(z) = 1 (1 − Pp k=1 ak z −k ) (3.49) where p is the order of linear prediction. The energy of the prediction error signal (Ep ) is: Ep = p X X (xn − ak xn−k )2 n (3.50) k=1 This energy (Ep ) is minimized by setting the partial derivative of Ep with respect to each coefficient ak to zero. The LP spectrum in the frequency domain is obtained from the LPCs {ak } obtained above [112]. The shape of the LP spectrum represents the resonance characteristics of the vocal tract shape for a frame of speech signal. An illustration of LP spectrum for a frame of speech signal is shown in Fig. 3.5. The production characteristics of shouted speech are derived from the LP spectrum, in the Chapter 5. 3.7.2 Computing dominant frequency using group delay method and LP analysis The effect of system-source coupling is examined using the first two dominant frequencies FD1 and FD2 of the short-time spectral envelope. The features FD1 and FD2 of the vocal tract system are derived using group-delay function [128] of the linear prediction (LP) spectrum [112]. The features FD1 and FD2 represent the resonances in the vocal tract system. The steps involved are as follows: 61 (a) The vocal tract system characteristics are derived using LP analysis [112]. Let a1 , a2 ,....ap be the p LP coefficients. The corresponding all-pole filter H(z) is given by: H(z) = 1 (1 − Pp k=1 ak z −k ) (3.51) For a 5th order filter, there will be maximum two peaks in the LP spectrum corresponding to two complex conjugate pole pairs. The frequencies corresponding to these peaks are called dominant peak frequencies, and are denoted as FD1 and FD2 . These may correspond to formants, but it is not necessary. (b) The group delay function (τg (ω)) is the negative first derivative of the phase response of the all-pole filter [128, 129], and is given by dθ(ω) (3.52) τg (ω) = − dω where θ is the phase angle of frequency response H(ejw ) of the all-pole filter. Frequency locations of the peaks in plot of the group delay function give dominant frequencies (FD1 and FD2 ). The dominant frequencies FD1 and FD2 are derived for each signal frame using pitch synchronous LP analysis, anchored around GCIs. These are used mainly in the analysis of acoustic loading effects on glottal vibration in normal speech and the analysis of laughter signal, which are discussed in further chapters. 3.8 Challenges in the existing methods and need for new approaches Representing the excitation information in terms of a sequence of impulse-like pulses was attempted in various speech coding methods, with aim to reduce the bit-rate of speech coding or to increase the voice quality of synthesized speech. This representation was attempted only for modal voicing in normal speech. To the best of our knowledge, the impulse-sequence representation of the excitation information is not yet used for analysing the emotional speech, paralinguistic sounds and expressive voices. Also, the presence of secondary pulses within a pitch period was indicated in few studies on multi-pulse excitation [8, 6, 173, 27]. But, it would be interesting to explore - can these secondary pulses help in distinguishing the nonverbal speech sounds, from normal speech? In order to better capture the rapid changes in the production characteristics, mainly of the excitation source, some modified/refined/new signal processing methods would be required, that use a time-domain impulse-sequence representation of the excitation component in acoustic signal. The impulse-sequence representation of the excitation source that is guided by the pitch-perception, is yet another research challenge. Although perceptual aspect was considered in CELP coding [8], but using the pitch-perception in this representation is rarely explored. Some good measure of pitch may also required to be incorporated in the impulse-sequence representation of the excitation signal. 62 Several methods have been evolved for estimating the instantaneous fundamental frequency (F0 ) and pitch of speech signal. But, further challenge is to estimate F0 (pitch) in the regions of aperiodicity, or in the regions where the signal is apparently random. Because, this may be the likely scenario in paralinguistic sounds and expressive voices. Further, extracting F0 from the excitation impulse-sequence, guided by the pitch-perception, is another challenging problem. In this thesis, few new methods have been proposed to address some of these issues and use for analysing the nonverbal speech sounds. 3.9 Summary In this chapter, some standard techniques and few recently proposed signal processing methods that are used in this thesis work, are described. The zero-frequency filtering and zero-time liftering methods used for extracting the excitation source and the spectral characteristics, respectively, are discussed. Methods for computing the first two dominant frequencies of resonances in the vocal tract system are also discussed. These methods are used for the analysis of source-system interaction in normal (verbal) speech, and also for analysing the shouted speech in emotional speech category of sounds. But, the analysis of other nonverbal sound categories, in particular the paralinguistic sounds and expressive voices, requires some further specialized signal processing methods, that are newly proposed in this research work. These methods are discussed at relevant places in the thesis in further chapters. 63 Chapter 4 Analysis of Source-System Interaction in Normal Speech 4.1 Overview In this chapter, we examine the changes in the production characteristics of variations in normal (verbal) speech sounds. Mainly the effects of source-system coupling and acoustic loading due to sourcesystem interaction, involved in the production of some specific sounds such as trills, are examined. First, the significance of changing vocal tract system and associated changes in the glottal excitation source characteristics due to trilling are studied, from perception point of view. In these studies, speech signal is generated (synthesized) by either retaining the features of the vocal tract system or of the glottal excitation source of trill sounds. Experiments are conducted to understand the perceptual significance of the excitation source characteristics in production of different trill sounds. Speech sounds of sustained trill and approximant pair, and of apical trills produced by four different places of articulation are considered. Features of the vocal tract system are extracted using linear prediction analysis, and those of the source by zero frequency filtering. Studies indicate that glottal excitation source apparently contributes relatively more, for perception of apical trill sounds. Glottal vibration is the primary mode of excitation of the vocal tract system, for producing voiced speech. Glottal vibration can be controlled voluntarily for producing different voiced sounds. Involuntary changes in glottal vibration due to source-system coupling are significant in the production of some specific sounds. A set of six selected categories of sounds with different levels of stricture in the vocal tract are examined. The sound categories are: apical trills, apical lateral approximants, alveolar and velar variants of voiced fricatives and voiced nasals. These sounds are studied in the vowel context [a], in modal voicing. The acoustic loading effect on the glottal vibration, for each of the selected sound category, is examined through the changes observed in the source and system characteristics. Both the speech signal and electroglottograph signal are examined in each case. Features such as instantaneous fundamental frequency, strength of impulse-like excitation and dominant resonance frequencies are extracted from the speech signal using zero-frequency filtering method, linear prediction analysis and group delay function. It is observed that high stricture in the vocal tract causing obstruction to the free flow of air produces significant loading effect on the glottal excitation. 64 Upper articulator (a) Tongue tip Upper articulator (b) Tongue tip Upper articulator (c) Tongue tip Closure Opening Closure Figure 4.1 Illustration of stricture for (a) an apical trill, (b) theoretical approximant and (c) an approximant in reality. The relative closure/opening positions of the tongue tip (lower articulator) with respect to upper articulator are shown. The chapter is organized as follows. The relative role of source and system in the production of trills is discussed in Section 4.2. Effect of source-system coupling is examined using analysis by synthesis and perceptual evaluation. Effect of acoustic loading of vocal tract system on the excitation source characteristics is discussed in Section 4.3. Changes in production characteristics of six consonant sound types are examined qualitatively from the waveforms, and quantitatively from the features derived from acoustic and EGG signals. Section 4.4 discusses summary of this chapter, along with key contributions. 4.2 Role of source-system coupling in the production of trills 4.2.1 Production of apical trills Trills, a stricture type, involve vibration of an articulator (lower) against another articulator (upper) due to aerodynamic constraints. Trills involving the lower articulator as the tip of the tongue are called apical trills [100]. The tongue tip in apical trills vibrates against a contact point in the dental/alveolar region. Production of an apical trill involves several aerodynamic and articulatory constraints. Aerodynamic constraints are related to tension at the tongue tip and volume velocity of air flow through the stricture, both essential for the initiation and sustenance of the apical vibration. Articulatory constraints are related to lingual and vocal tract configuration aspects [100, 40]. Production of apical trills can be characterized by three cyclic actions: (i) Repeated breaking of the apical stricture due to interaction between tongue tension and volume velocity of air flow. (ii) Partial falling of the tongue tip to partially release the positive pressure gradient in the oral cavity. (iii) Recoiling the tongue tip to meet upper articulator to form next event of stricture. One such closureopening-closure cycle of the stricture, shown in Fig. 4.1(a), is called a trill-cycle. Typical rate of tongue 65 (a) Input speech signal s[n] 1 0 (b) ZFR output signal 1 s z [n] −1 0 −1 (c) F contour 0 F (Hz) 0 120 115 110 ψ (d) SoE impulse sequence 1 0.8 0.6 0.4 0.2 D1 F ,F D2 (Hz) (e) F D1 5000 4000 3000 2000 1000 10 30 50 and F D2 contours 70 90 Time (ms) 110 130 150 Figure 4.2 (Color online) Illustration of waveforms of (a) input speech and (b) ZFF output signals, and contours of features (c) F0 , (d) SoE, (e) FD1 (“•”) and FD2 (“◦”) derived from the acoustic signal for the vowel context [a]. tip trilling, as measured from acoustic waveform or the spectrogram, is about 20-30 Hz [98, 116]. Two to three cycles of apical trills are common in continuous speech, whereas more than three cycles may be produced in sustained production of the sound [40]. When the lower articulator (tongue tip) does not touch (or tap) the upper articulator completely, the production of trill is like in Fig. 4.1(b). However, due to aerodynamic and articulatory constraints, the production of trill in this case, is mostly, as shown in Figure 1(c). This sound is called approximant. Apical trills are common among some languages like Telugu, Malayalam and Punjabi (Indian) languages, whereas approximants are more common in some languages, like English. The contact point of the upper articulator, against which the tongue tip vibrates, can be in different regions, like bilabial, dental, alveolar and post-alveolar. These are called in this study as ‘trill sounds produced at different places of articulation’. 4.2.2 Impulse-sequence representation of excitation source The characteristics of the glottal source of excitation are derived from the acoustic signal using the zero-frequency filtering (ZFF) method [130, 216]. In ZFF, the features of the impulse-like excitation at glottal source are extracted by filtering the differenced acoustic signal through a cascade of two zerofrequency resonators (ZFRs). Details of ZFF method are discussed in Section 3.5. An illustration of the ZFF signal (zs [n]) for the vowel context [a] is shown in Fig. 4.2(b), which is derived from the corresponding speech signal (s[n]) shown in Fig. 4.2(a). An illustration of F0 contour 66 Trill Approximant F0 (Hz) 140 120 100 80 SoE 1 0 −1 1 0.8 0.6 0.4 0.2 so[n] si[n] (a) Speech signal 1 0 −1 (b) F0 contour (c) SoE contour (d) Synthesized speech 1 2 3 Time (sec) 4 5 Figure 4.3 (a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and (d) Synthesized speech waveform (x13 ), for a sustained apical trill-approximant pair (trill: 0-2 sec, approximant: 3.5-5.5 sec). Source information is changed (system only retained) in synthesized speech. and SoE impulse sequence is given in Fig. 4.2(c) and (d), respectively. The SoE impulse sequence represents the glottal source excitation, in which location of each impulse corresponds to an epoch (GCI) and its amplitude indicates relative strength of excitation around that epoch. The F0 contour reflects the changes in successive epoch intervals. 4.2.3 Analysis by synthesis of trill and approximant sounds In order to study the relative significance of the dynamic vocal tract system and the excitation source in the perception of trill sounds, synthetic speech signals are generated by controlling the system and source characteristics of trill sounds separately. For this the natural trill sounds are analyzed to extract the source characteristics in terms of epochs and the strength of impulses at epochs. The dynamic characteristics of the vocal tract shape are captured using LP analysis on a frame size of 20 msec with a frame shift of 5 msec. Four scenarios are considered for synthesis of trills. Retained the characteristics of (i) both source and system, (ii) only source, (iii) only system. (iv) Changed both source and system. Perceptual evaluation of the synthesized speech in each of these four scenarios of selective retention is carried out. Changes in the glottal excitation source characteristics are made by first disturbing the fundamental frequency (F0 ) information and then the amplitude information. For each epoch, the next epoch is located at an interval corresponding to the average of several pitch periods around this epoch. The new impulse sequence reflects the averaged pitch period information. The amplitude of each impulse corresponds to the average of the SoE around that epoch. This new impulse sequence is referred to in this paper as ‘changed excitation source’ information. The impulse sequence with changed source information is used as excitation for generating speech for scenarios (iii) and (iv) of selective retention. The effect of this averaging is shown in Fig. 4.3 for a trill and for an approximant. Fig. 4.3 shows the 67 Trill Approximant F0 (Hz) 140 120 100 80 SoE 1 0 −1 1 0.8 0.6 0.4 0.2 so[n] si[n] (a) Speech signal 1 0 −1 (b) F0 contour (c) SoE contour (d) Synthesized speech 1 2 3 Time (sec) 4 5 Figure 4.4 (a) Signal waveform, (b) F0 contour and (c) SoE contour of excitation sequence, and (d) Synthesized speech waveform (x11 ), for the sustained apical trill-approximant pair (trill: 0-2 sec, approximant: 3.5-5.5 sec). Both system and source information that of original speech are retained in synthesized speech. changed source characteristics, i.e., the F0 and SoE contours of excitation sequence in Fig. 4.3(b) and Fig. 4.3(c), respectively. These changes are more evident for the trill (first) sound as compared to the approximant (second) sound. This can be contrasted with the corresponding contours of the F0 and SoE of excitation sequence, for the original trill and approximant sounds, shown in Fig. 4.4. Since the trill cycle is of 20-30 Hz, the LPCs computed using a frame size of 100 msec can be considered as the ‘changed characteristics of vocal tract system’. The changed characteristics of the system is used for generating speech for scenarios (ii) and (iv) of selective retention. To establish the significance of coupling between the system and source characteristics (in scenario (i)), the scenario where both the source and system information are changed (i.e., scenario (iv)) is used for comparison. Perceptual evaluation of the 4 synthesized speech files (one for each of the 4 scenarios of selective retention), with reference to original speech file is carried out. Similarity score criterion as given in Table 1, is used. 4.2.4 Perceptual evaluation of the relative role Two experiments were conducted in this study, each with 4 scenarios of selective retention of source/system information. Since most databases of continuous speech have very limited trill data suitable for this study, the required speech sounds of sustained trill-approximant pair and the trills with 4 different places of articulation, are recorded with the help of an expert male phonetician. Experiment 1 was conducted with a trill-approximant pair speech file (x10.wav). From this reference file (x10 ), the features of the glottal excitation and the vocal tract system are extracted. Using these source and system features, the four synthesized speech files (x11 , x12 , x13 and x14 ) are generated [153]., one for each of the 4 scenarios of selective retention of source/system information. Fig. 4.4 68 Table 4.1 Criterion for similarity score for perceptual evaluation of two trill sounds (synthesized and original speech) Perceptual similarity both sound very much similar both sound quite similar both sound somewhat similar both sound quite different both sound very much different Similarity score 5 4 3 2 1 Table 4.2 Experiment 1: Results of perceptual evaluation. Average similarity scores between synthesized speech files (x11 , x12 , x13 and x14 ) and original speech file (x10 ) are displayed. x11 vs x10 (Source, System retained) 3.95 x12 vs x10 (Source only retained) 3.48 x13 vs x10 (System only retained) 2.82 x14 vs x10 (Source, system changed) 1.75 shows the sustained apical trill-approximant pair with F0 and SoE contours of excitation sequence, and synthesized speech for scenario (i) (i.e., retaining both source and system information). Fig. 4.3 shows the changed source characteristics as in scenario (iii) (i.e., retaining only the system information) of the trill region. The effect of ‘changed source information’ can be observed in F0 and SoE contours of excitation sequence in Fig. 4.3, as compared to those in Fig. 4.4. Perceptual evaluation is carried out by comparing each of the synthesized speech file (x11 to x14 ) with reference original speech (x10 ). A total of 20 subjects, all speech researchers from Speech and Vision Lab at IIIT-Hyderabad, participated in this evaluation. The subjects were asked to give the similarity scores for each of the 4 synthesized trill-approximant pairs. The averaged scores of perceptual evaluation for Experiment 1 are given in Table 4.2. Another experiment (Experiment 2) was conducted with speech file (x20.wav) consisting of trill sounds corresponding to the 4 different places of articulation, namely, bilabial, dental, alveolar and post-alveolar. From this reference speech file (x20 ), the features of the glottal excitation and vocal tract system are extracted. Four synthesized speech files (x21 , x22 , x23 and x24 ) are generated for each of the 4 scenarios of selective retention of source/system information. Perceptual evaluation was carried out by comparing each of the trill sound in a synthesized speech file (x21 to x24 ), with the corresponding original trill utterance (x20 ). Similarity score for each of the 4 different places of articulation with respect to corresponding original speech was obtained. All the 20 subjects gave similarity scores for each of the 4 synthesized speech files (x21 , x22 , x23 and x24 ), for each place of articulation. The results of perceptual evaluation for Experiment 2 are given in Table 4.3. In Table 4.2 the high average score in column 1 is due to the fact that source and system information both are retained in the synthesized speech, which is perceptually close to the original speech. The lower average score in column 4 confirms the fact that when both the source and system are changed, 69 Table 4.3 Experiment 2: Results of perceptual evaluation. Average similarity scores between each place of articulation in synthesized speech files (x21 , x22 , x23 and x24 ), and corresponding sound in original speech file (x20 ) are displayed. File name: synthesized vs reference speech x21 vs x20 x22 vs x20 x23 vs x20 x24 vs x20 Bilabial trill 3.15 2.55 2.25 1.20 Dental trill 3.55 2.90 2.45 1.30 Alveolar trill 3.58 2.85 2.30 1.40 Post-alveolar trill 3.13 2.85 2.30 1.40 Average score (for all 4 trills) 3.35 2.79 2.33 1.33 the resulting sound is different from the original trill utterance. The trill sound is perceptually close to an approximant, in this case. The lower average score in column 4 in contrast to high average score in column 1, is indicative of the fact that vocal tract system and glottal excitation source information both jointly contribute to the production and perception of trill sounds. The relatively high average score in column 2 (for x12 ), where only source information is retained (system information is changed), and relatively low average score in column 3 (for x13 ), where only system information is retained (source information is changed) are interesting results. These scores indicate the relatively higher significance of the glottal excitation source information in the perception of apical trills. In Table 4.3, the last column gives the average similarity scores across all the 4 different trill sounds. These results are in line with the results of Experiment 1 (in Table 4.2). The average scores in row 3 (for x22 ) in Table 4.3, where only source information is retained, are relatively higher in comparison to the average scores in row 4 (for x23 ), where only system information is retained. This pattern is consistent for each of the 4 places of articulation. It reconfirms the inference drawn from the results of Experiment 1 (in Table 4.2) that source information contributes relatively more as compared to the system information, in perception of tongue tip trills. The relatively higher average scores in 3rd and 4th columns (in Table 4.2, for dental and alveolar trills, respectively), in row 2 especially (for x21 ) where both source and system are retained, also indicate relatively better perceptual closeness of dental and alveolar synthesized trill sounds to the corresponding natural trill sounds. The least average scores in last row (for x24 ) in Table 4.3, when both source and system are changed, and high average scores in row 2 (for x21 ), when both source and system are retained, are similar to the results of Experiment 1 (in Table 4.2). These average scores highlight the fact that there is some amount of system-source coupling in the production and perception of the tongue tip trilling. It also indicates that production of tongue tip (apical) trilling does affect the glottal excitation source due to coupling with the vocal tract system. 4.2.5 Discussion on the results The effect of tongue tip (apical) trilling on glottal excitation source is indicated by the fact that system alone or source alone information is not sufficient for production and perception of apical trills. Both 70 the source and system are involved in some coupled way, in the production/perception of apical trills, due to interaction between aerodynamic and articulatory components. Glottal excitation source appears to contribute relatively more, for perception of apical trills, as indicated by the perceptual evaluation results of experiments 1 and 2. Also, the synthesized apical dental/alveolar trills are perceptually closer to the corresponding natural trill sounds. This study can be useful further in automatic spotting of trills, synthesis/modification of trill sounds and trill-based discrimination of different languages and dialects. The study can also be helpful in distinguishing the trill sounds and approximants from signal processing point of view, and in understanding the production/perception of different apical trill sounds at different places of articulation. 4.3 Effects of acoustic loading on glottal vibration Speech sounds are produced by excitation of the time-varying vocal tract system. The major source of excitation is the quasi-periodic vibration of the vocal folds at the glottis [48], referred to as voicing. Languages make use of different types of voicing, called phonation types [102, 99]. Among the phonation types, modal voice is considered to be the primary source for voicing in majority of languages [99]. Both the mode of glottal vibration and the shape of the vocal tract contribute to the production of different sounds [49]. The mode of glottal vibration can be controlled voluntarily for producing different phonation types such as modal, breathy and creaky voices. Similarly, the rate of glottal vibration (F0 ) can also be controlled, giving rise to changes in pitch. Glottal vibration may also be affected due to coupling of the vocal tract system with the glottis. This change in the glottal vibration may be viewed as involuntary change. In this study, we examine the involuntary changes in the glottal vibration, due to the effect of acoustic loading of the vocal tract system, for a selected set of six categories of voiced consonant sounds. These categories are distinguished based upon the size, type and location of the stricture along the vocal tract, which is influenced by the manner and place of articulation. Three types of occurrences, namely, single, geminated and prolonged are examined for each of the six categories of sounds, in modal voicing, in the context of vowel [a]. The speech signal along with the electroglottograph (EGG) signal [53, 50] are used for analysis of these sounds. Changes in the system characteristics are analyzed using two features termed as dominant frequencies FD1 and FD2 . Dominant frequencies are derived from the speech signal, using linear prediction analysis [112] and the group-delay analysis [128]. Source features such as the instantaneous fundamental frequency (F0 ) and strength of impulse-like excitation (SoE) are extracted from the speech signal using the zero-frequency filtering method [130, 216]. 4.3.1 What is acoustic loading? Studies have indicated that there exists physical coupling between the glottal source and the vocal tract system, i.e., source-system coupling. During the production of some specific speech sounds, such 71 Upper articulator (a) Lower articulator Upper articulator (b) Lower articulator Closure Opening Closure Upper articulator (c) Lower articulator Upper articulator (d) Lower articulator Time Figure 4.5 Illustration of strictures for voiced sounds: (a) stop, (b) trill, (c) fricative and (d) approximant. Relative difference in the stricture size between upper articulator (teeth or alveolar/palatal/velar regions of palate) and lower articulator (different areas of tongue) is shown schematically, for each case. Arrows indicate the direction of air flow passing through the vocal tract. as ‘high vowels’ and trills, this coupling leads to acoustic loading of the vocal tract system. The air pressure difference across glottis (i.e., between supraglottal and subglottal regions) affects the vibration of the vocal folds at the glottis. It results in the source-system interaction, which seems to manifest as acoustic loading, i.e., changes in the vocal tract resonances affect the glottal vibration characteristics. According to source-filter theory of speech production, speech wave is the response of the vocal tract filter system to one or more sources of excitation [46]. A number of interconnected filter sections are represented by different vocal tract cavities such as pharynx, mouth and nasal cavities. In general, the first formant is associated with the resonance of pharynx (back) cavity and the second formant with mouth (front) cavity [46]. In this study, we focus mainly on the glottal source of excitation. During the open phase of the glottis, the acoustic excitation at the glottis can be represented by a volume velocity source with a relatively high acoustic impedance [182]. Glottal vibrations are related to the pressure across glottis, the configuration of glottis and the compliance of the vocal folds [182]. The study of the effect of constriction in the vocal tract on glottal vibration had indicated that in the case of voiced fricative or stop consonant sounds, a narrow constriction or closure at some point along the length of the vocal tract, may cause substantial effect on the glottal vibration [182]. Similar effect may intuitively be possible in the production of voiced nasal sounds also. Hence, more investigation is needed in the effect of acoustic loading of the vocal tract system on the excitation source characteristics, in relation to the effect of constriction in the vocal tract, for different sounds. 72 In this study, we examine the effect of acoustic loading of the vocal tract system on glottal vibration characteristics for a set of voiced consonant sounds. The different sound categories selected for this study differ in cross-sectional area of stricture, besides place of articulation (i.e., stricture location) [46, 29] during production. Following [29], differences in the stricture for stop, trill, fricative and approximant sounds are schematically represented in Fig. 4.5(a), (b), (c) and (d), respectively. Both, the location of constriction/closure point along the length of the vocal tract, and the air flow and air pressure between the upper and lower articulators, can influence the extent of acoustic loading effects. In the production of apical trill ([r]) sound, the oral stricture opens and closes periodically (as shown in Fig. 4.5(b)), at the rate of 25-50 Hz [116, 40]. This periodic closing/opening of the oral cavity affects the acoustic loading of the vocal tract in each cycle. In recent studies on the production of apical trills, the loading effect of the system on the source characteristics [40] and the role of source-system coupling [122] were examined. In this study, we examine the excitation source characteristics of apical trills ([r]) using electroglottograph (EGG) signals along with the corresponding speech signal. Production of fricatives involves narrow constriction of the vocal tract at some point along its length (Fig. 4.5(c)), which may cause acoustic loading of the vocal tract, and thereby affect the glottal vibration characteristics. Different locations of the constriction point along the vocal tract may cause the glottal vibration characteristics to change differently. Two variants of fricatives are examined, namely, alveolar fricative ([z]) and velar fricative ([È]), which involve two different locations for the points of constriction of the vocal tract. In the production of the apical lateral approximant ([l]) sound, the lateral stricture is relatively wide open for the entire steady-state duration (Fig. 4.5(d)), unlike that for [r] sound (Fig. 4.5(b)). If the glottal vibration characteristics of the trill sounds are changed to normal modal vibration, then trills may sound like approximants [122]. Hence, apical lateral approximant ([l]) sounds are examined to understand the differences in their excitation characteristics from those of trills ([r]). Nasal sounds involve closure at some location in the oral tract, while the nasal tract is kept open. Two variants of nasal sounds are examined, namely, alveolar nasal ([n]) and velar nasal ([N]), to study whether the high stricture (nearly closed constriction) along the vocal tract, concurrent with the open nasal tract, has any effect on the glottal vibration. Production of consonants ([r], [l], [z], [È], [n] and [N]) sounds in the context of vowel [a] are considered in this study. The sounds selected for studying the effect of acoustic loading are only representative of a few sound categories. The single, geminated and prolonged occurrences of sounds are included in each category. The analysis of acoustic loading effects is carried out using the geminated occurrence type for each of the six categories of sounds, as in this case the consonants are produced in a sustained manner. The single cases are considered as these are the cases that occur in normal speech, and prolonged cases are studied to examine whether such prolongation has any considerable deviation. 73 4.3.2 Speech data for analysis In natural production, speech sounds are produced as part of one or more syllables of the structure /CV/, /VCV/ or /VCCV/, consisting of vowels (/V/) and consonants (/C/). If the vowel on both sides is in modal voicing, then it is easier to distinguish the vowel/consonant regions for analysis. Consonants in the context of the open vowel [a] are considered in this study. Sometimes, changes in the production characteristics may not be highlighted in a single occurrence of consonant in the vowel context (/VCV/). Hence, sustained production of the consonants is considered. Sustained production of consonants can be either geminated (double) or prolonged (longer than geminated), i.e., in the form of /VCCV/ or /VCC...CV/ sound units, respectively. The distinctive characteristics of consonants may fade sometimes in their prolonged production. Hence, geminated type is analysed in more detail, although data is collected and studied for each of the three occurrence types. Data was collected for the following six categories of voiced speech sounds: (1) apical trill ([r]), (2) alveolar fricative ([z]), (3) velar fricative ([È]), (4) apical lateral approximant ([l]), (5) alveolar nasal ([n]) and (6) velar nasal ([N]). All these sounds are considered in the context of vowel [a] on both sides, in modal voicing. For each category of sound, three types of occurrences are considered: single, geminated and prolonged occurrence. Utterances for each type of each of the 6 categories were repeated 3 times. Thus the data consists of total 54 (=6x3x3) utterances. Sustained production of some sounds like velar fricative ([È]), trill ([r]) or velar nasal ([N]) is needed for detailed analysis of the effects of acoustic loading. It is also preferable, in general, that “an international standard of phonetic pronunciation norms could be established by reference to a few selected speakers” [48]. Hence, the data was collected in the voice of a male expert phonetician so as to have reliable and authentic data of production of these sounds. The data was also collected in the voice of a (less trained) female phonetics research student. Thus, total data has 108 (=54+54) utterances. The data was recorded in a sound treated recording room. Simultaneous recordings of the speech signal and the electroglottograph (EGG) signal [53, 47, 50] were obtained for each utterance. The speech signal was recorded on a digital sound recorder with a high quality condenser microphone (Zoom H4n), kept at a distance of around 10 cm from the corner of the mouth. The EGG signal was recorded using an EGG recording device [121]. The audio data was acquired at a sampling rate of 44100 samples/sec, with 16 bits/sample. The data was downsampled to 10000 samples/sec before analysis. The collected data is available for download at the link: http://speech.iiit.ac.in/svldownloads/ssidata/ . 4.3.3 Features for the analysis (A) Glottal excitation source features The features of the glottal source of excitation are derived from the speech signal using the zerofrequency filtering (ZFF) method [130, 216]. In ZFF, the features of the impulse-like excitation of the glottal source are extracted by filtering the differenced speech signal through a cascade of two zerofrequency resonators (ZFRs). Details of ZFF method are discussed in Section 3.5 and Section 4.2.2. 74 (a) EGG signal e[n] 0.5 0 −0.5 e d [n] (b) Differenced EGG signal 0.15 0.1 0.05 0 −0.05 0 2 4 T C 6 8 10 12 14 16 Time (ms) 18 20 T O Figure 4.6 Illustration of open/closed phase durations, using (a) EGG signal and (b) differenced EGG signal for the vowel [a]. An illustration of F0 contour and SoE impulse sequence is given in Fig. 4.2(c) and (d), respectively. The SoE impulse sequence represents the glottal source excitation, in which location of each impulse corresponds to an epoch and its amplitude indicates relative strength of excitation around the epoch. The F0 contour reflects the changes in successive epoch intervals. (B) Vocal tract system features The vocal tract system characteristics are studied using the first two dominant frequencies FD1 and FD2 of the short-time spectral envelope. The features FD1 and FD2 of the vocal tract system are derived from the speech signal using group-delay function [128] of the linear prediction (LP) spectrum [112]. The dominant frequencies FD1 and FD2 are derived for each signal frame using pitch synchronous LP analysis, anchored around GCIs. An illustration of FD1 and FD2 contours for the vowel segment [a] is shown in Fig. 4.2(e), in which the features FD1 and FD2 can be seen as almost steady. (C) Closed phase quotient (α) from EGG signal The electroglottograph (EGG) signal [53, 47] and differenced EGG (dEGG) signal [50] are used for studying the changes in the characteristics of the glottal pulse during production of speech. Features of the glottal pulse shape are extracted in terms of opening/closing phase durations of the glottal slit, as shown in Fig. 4.6. The closed phase quotient, denoted as α, is computed for each closing/opening cycle of the glottis [123], as follows: TC (4.1) α= TC + TO where TC and TO are the closed and open phase durations (in ms), respectively. (D) Features used for the analysis In this study, observations from the EGG/speech signals and the derived features are analysed for the six categories of sounds. Both qualitative and quantitative observations are discussed, using the signals and feature contours. First, qualitative observations are made using the waveforms of raw/processed 75 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 e d [n] (c) dEGG signal 0.4 0.2 0 −0.2 (d) α contour α 0.6 0.4 0.2 10 30 50 70 Time (ms) 90 110 Figure 4.7 Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal, and the α contour for geminated occurrence of apical trill ([r]) in vowel context [a]. The sound is for [arra], produced in female voice. signals. Four waveforms: (i) speech signal, (ii) EGG signal, (iii) dEGG and (iv) ZFF output are used for visual observations/comparisons in each case. Next, quantitative changes are measured from the features extracted using speech signals, for the six sound categories, each in three occurrence types. Total four features, (i) F0 , (ii) SoE (ψ), (iii) FD1 and (iv) FD2 are used. It is generally observed that changes in EGG, dEGG and F0 reflect the effect of glottal vibration, changes in FD1 and FD2 reflect the changes in the vocal tract system, and changes in the speech signal waveform, ZFF and SoE may reflect changes in both the source and vocal tract system characteristics. 4.3.4 Observations from EGG signal The effect of acoustic loading of the vocal tract on the vibration characteristics during production of apical trills [r] can be seen better from the changes in closed phase quotient (α). An illustration of the EGG signal (e[n]), differenced EGG (dEGG) signal (de [n]), and α contour for apical trill sound [r] is shown in Fig. 4.7(b), (c) and (d), respectively. The corresponding acoustic signal s[n]) is shown in Fig. 4.7(a). Changes in the stricture between the alveolar/palatal region (upper articulator) and apical region of tongue (lower articulator) as shown in Fig. 4.5(b), are also reflected in the changes in feature α, as shown in Fig. 4.7(d). These changes are difficult to observe in the EGG signal, shown in Fig. 4.7(b). In the production of trills, the apical stricture gets periodically broken to release the pressure gradient built in the oral cavity, due to the interaction between tongue-tension and volume-velocity of the air flow from glottis [29]. Apical stricture is formed again by recoiling of the apex of tongue to meet the upper articulator, due to Bernoulli effect. The pressure gradient across glottis reduces during the closed phase 76 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 e d [n] (c) dEGG signal 0.4 0.2 0 −0.2 (d) α contour α 0.6 0.4 0.2 10 30 50 70 Time (ms) 90 110 130 Figure 4.8 Illustration of waveforms of (a) input speech signal, (b) EGG signal and (c) dEGG signal, and the α contour for geminated occurrence of apical lateral approximant ([l]) in vowel context [a]. The sound is for [alla], produced in female voice. in the trill cycle, and increases during the open phase. These changes due to trilling effect of the tongue tip are manifested as changes in the rate of vibration of vocal folds and in the excitation strength [40]. The effect of changes in the air flow and the transglottal pressure gradient on the rate of glottal vibrations is also reflected as changes in the closed phase quotient (α), that can be seen in Fig. 4.7(d). Changes in the closed phase quotient (α) are also helpful in understanding the difference in production characteristics of apical trill [r] and apical lateral approximant [l] sounds. An illustration of the acoustic signal, EGG and dEGG signals, and α contour for apical lateral approximant sound [l] is shown in Fig. 4.8(a), (b), (c) and (d), respectively. Since, the stricture between alveolar/palatal region (upper articulator) and apical tongue region (lower articulator) as shown in Fig. 4.5(c) is wide enough to allow air flow in the vocal tract to pass smoothly, no cyclic changes in the stricture occur in the case of apical lateral approximant [l], unlike those for trill [r]. This difference in strictures for trill ([r]) and approximant ([l]) (shown in Figure 4.5(b) and (c)) can also be observed from the difference in α contours in Fig. 4.7(d) and Fig. 4.8(d). The closed phase quotient (α for [l] (in Fig. 4.8(d)) does not change like that for [r] (in Fig. 4.7(d)), because there is not much change in the rate of glottal vibrations during the production of lateral approximant [l] sounds. Changes in α contours are useful for the analysis of trills ([r]) and for discriminating between trills ([r]) and approximants ([l]). But, α contours are not useful for the study of other sound categories considered ([z], [È], [n] and [N]). Hence, the waveforms of EGG and dEGG signals themselves are considered for the analyses of these sound categories. 77 4.3.5 Discussion on acoustic loading through EGG and speech signals In this section the effects of acoustic loading in the production of different categories of sounds are examined in terms of observed and derived characteristics from the EGG and speech signals. Acoustic loading effects caused by the size, type and location of the stricture in the vocal tract are discussed in detail. The cross-sectional area of the opening at the stricture determines the size of the stricture, which in turn determines the extent of the (high, low, no) stricture. Very narrow to closed constriction in the vocal tract corresponds to high stricture, which occurs, for example, in trills ([r]) and alveolar fricatives ([z]), as in Fig. 4.5(b) and Fig. 4.5(c), respectively. The intermediate case of relatively wider opening corresponds to low stricture, which occurs in the case of the approximant ([l]) (Fig. 4.5(d)) and the velar fricative ([È]). A completely open vocal tract corresponds to no stricture, as in the case of open vowel [a]. The two different types of strictures considered in this study are cyclic (as in [r], in Fig. 4.5(b)) and steady ([z] and [l], in Fig. 4.5(c) and Fig. 4.5(d), respectively). Two different locations of strictures in the vocal tract are considered, namely alveolar (as in [z]) and velar (as in [È]). In addition, the effects of closed vocal tract (high stricture) during production of nasal sounds are considered for any possible acoustic loading effect in terms of the observed and derived characteristics from EGG and speech signals. Two different locations of the stricture for nasals are considered, namely, alveolar nasal ([n]) and velar nasal ([N]). All these categories of sounds are produced in the context of the open vowel [a], where there is no stricture. Only geminated utterances of these different categories of sounds are analysed in this section, as gemination of consonants produces sustained steady-state that facilitates study of the effects of acoustic loading while also eliminating possible effects of vowel-consonant transition. Figures in Fig. 4.9 to Fig. 4.14 show the waveforms of speech and EGG signals for the six categories of sounds chosen for analysis in this section. Each figure displays, besides the waveforms of speech, EGG, differenced EGG (dEGG) and ZFF output signals, the contours of two source features, namely, instantaneous fundamental frequency (F0 ) and strength of excitation (SoE), and two system related features, namely, the two dominant resonance frequencies of the vocal tract (FD1 and FD2 ). In the following, the acoustic loading effects caused by the strictures in each of the six categories of sounds are examined in detail. (A) Apical trill ([r]) In the production of apical trill ([r]) sound, the high stricture is formed due to narrow opening between the alveolar/palatal region and the apical region of the tongue, as shown in Fig 4.5(b). This stricture gets broken, releasing the air pressure built in the oral cavity, and it is formed again due to Bernoulli effect [29]. Thus this stricture is cyclic in nature, due to opening and closing of the stricture in each trill cycle (Fig 4.5(b)). The cyclic high stricture affects the rate of vibration of the vocal folds and the strength of excitation [40]. These effects are reflected in the F0 and SoE contours in Fig. 4.9(e) and Fig.4.9(f), respectively. This is because the pressure gradient across the glottis reduces during the closed phase of the 78 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 (c) dEGG signal de[n] 0.2 0 (d) ZFF output 1 s z [n] −0.2 0 F0 (Hz) 1 0.8 0.6 0.4 0.2 D1 F ,F D2 (Hz) 140 120 100 80 ψ −1 (e) F0 contour (f) SoE contour (g) FD1 and FD2 contours 5000 4000 3000 2000 1000 0 0.05 0.1 0.15 0.2 Time (sec) 0.25 0.3 0.35 Figure 4.9 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of apical trill ([r]) in vowel context [a]. The sound is for [arra], produced in male voice. trill cycle, which in turn reduces F0 . Thus the acoustic coupling effect of the system on the source is significant in this case. There are also changes in the resonances of the vocal tract system due to changes in the shape of the tract during each trill cycle. These changes are seen as cyclic variations of FD1 and FD2 (Fig. 4.9(g)), where FD1 is higher during closed phase of the trill cycle. In the case of trill, the effects of acoustic loading due to dynamic vocal tract configuration can be seen in the waveform of EGG, dEGG and the speech signal, shown in Fig. 4.9. The contrast between the steady vowel region and the trill region can be seen in all the signals and features derived from these. (B) Alveolar fricative ([z]) Production of alveolar fricative ([z]) also involves narrow opening of the constriction between the upper articulator (alveolar ridge) and the lower articulator (tongue tip), as shown in Fig. 4.5(c). The constriction is narrow enough to produce frication or turbulence. Thus [z] is produced by a high steady stricture, unlike the high cyclic stricture in the case of [r]. There is pressure build-up behind the constric79 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 (c) dEGG signal de[n] 0.2 0 (d) ZFF output 1 s z [n] −0.2 0 F0 (Hz) 0.6 0.4 0.2 D1 F ,F D2 (Hz) 120 110 100 ψ −1 (e) F0 contour (f) SoE contour (g) FD1 and FD2 contours 5000 4000 3000 2000 1000 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 Figure 4.10 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of alveolar fricative ([z]) in vowel context [a]. The sound is for [azza], produced in male voice. tion, causing pressure differential across the glottis. Thus in this case, the acoustic loading effect can be seen in the signal waveform, as well as in the source and system features derived from the signals. The amplitudes of the speech signal, EGG, dEGG and ZFF is low, relative to the adjacent vowel, in these waveforms (Fig. 4.10). The acoustic loading effect results in lowering of F0 and SoE values, relative to the adjacent vowel region, as can be seen in Fig. 4.10(e) and Fig. 4.10(f), respectively. Due to frication, both the dominant frequencies (FD1 and FD2 ) show high values, compared to those in the vowel region (Fig. 4.10(g)). The acoustic loading effects are similar to the trill case ([r]), except that in the case of alveolar fricative ([z]) the effects are steady (not cyclic). (C) Velar fricative ([È]) The production of velar fricative([È]) involves steady but relatively lower stricture due to more open constriction between the upper and lower articulators, than for the alveolar fricative ([z]). Since the constriction area has to be small enough to produce turbulence, this stricture may be termed as steady 80 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 (c) dEGG signal (d) ZFF output 1 s z [n] de[n] 0.2 0 −0.2 0 −1 F0 (Hz) ψ (e) F0 contour 120 115 110 0.8 0.6 0.4 D1 F ,F D2 (Hz) (f) SoE contour (g) FD1 and FD2 contours 5000 4000 3000 2000 1000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time (sec) 0.35 0.4 0.45 0.5 Figure 4.11 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of velar fricative ([È]) in vowel context [a]. The sound is for [aÈÈa], produced in male voice. high-low stricture, and the acoustic loading effects are expected to be similar to those for the alveolar fricatives. While there are no significant changes in the EGG signal waveform, relative to the adjacent vowel region (Fig. 4.11(b)), changes can be seen better in the waveform of dEGG signal (Fig. 4.11(c)). Acoustic loading effects can be seen in the derived source information, i.e., F0 contour (Fig. 4.11(e)) and SoE contour (Fig. 4.11(f)). However, the changes are less evident in the ZFF signal (Fig. 4.11(d)). The changes in the speech signal waveform for velar fricative ([È]) relative to that for vowel [a] can be attributed to the changes in the vocal tract characteristics. Turbulence generated at the stricture is lower in the case of [È], than in the case of [z], because of wider constriction in the vocal tract (for [È]). As a result, the FD1 is lower than for the vowel [a] (Fig. 4.11(g)). Accordingly, the frication effect is not as high as in the case of [z], and hence the behaviour of FD1 and FD2 are more vowel-like, in the 81 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 (c) dEGG signal de[n] 0.2 0 (d) ZFF output 1 s z [n] −0.2 0 F0 (Hz) −1 (e) F0 contour 120 110 100 (f) SoE contour ψ 1 D1 F ,F D2 (Hz) 0.5 (g) FD1 and FD2 contours 5000 4000 3000 2000 1000 0 0.1 0.2 0.3 0.4 Time (sec) 0.5 0.6 0.7 Figure 4.12 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of apical lateral approximant ([l]) in vowel context [a]. The sound is for [alla], produced in male voice. sense that they are in the same range as that for vowel [a] (Fig. 4.11(g)), unlike for [z] where both FD1 and FD2 are high (Fig. 4.10(g)). (D) Approximant ([l]) An apical lateral approximant ([l]) is formed by a closure between the alveolar/palatal region (upper articulator) and the apical tongue region (lower articulator), along with a simultaneous lateral stricture. In this case, the lateral stricture is wide enough, as shown in Fig. 4.5(d), to allow the free flow of air in the vocal tract. Thus the stricture is low, i.e., relatively more open than for the high stricture cases considered so far ([r] and [z]), and also steady, i.e., not cyclic as in the case of trill [r]. Since, there is no significant pressure gradient built-up in this case, the acoustic loading effect is negligible in comparison with the high stricture case. Hence, one does not notice any significant changes in the amplitudes in the waveforms of speech signal, EGG, dEGG and ZFF output, relative to the adjacent vowel regions, as can be seen in Fig. 4.12(a) to Fig. 4.12(d), respectively. 82 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 (c) dEGG signal de[n] 0.2 0 (d) ZFF output 1 s z [n] −0.2 0 −1 F0 (Hz) (e) F0 contour 120 110 D1 F ,F D2 (Hz) ψ (f) SoE contour 0.8 0.6 0.4 (g) FD1 and FD2 contours 5000 4000 3000 2000 1000 0 0.05 0.1 0.15 0.2 0.25 0.3 Time (sec) 0.35 0.4 0.45 0.5 Figure 4.13 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of alveolar nasal ([n]) in vowel context [a]. The sound is for [anna], produced in male voice. There are no major changes in the excitation features as well, such as in F0 and SoE contours in Fig. 4.12(e) and Fig. 4.12(f), respectively. However, due to wider lateral opening, the corresponding change in the shape of the vocal tract affects the first two dominant resonance frequencies (Fig. 4.12(g)). The FD1 is reduced and FD2 is increased, relative to the values in the neighbouring vowel region. This shows that if the stricture is not high, the acoustic loading effects are negligible. (E) Alveolar nasal ([n]) and velar nasal ([N]) Nasal sounds are produced with complete closure of the vocal tract at some location in the oral cavity, and simultaneous flow of air through the nasal tract, which is facilitated by the velic opening. Here, the constriction along the vocal tract is like a high stricture case, but due to the coupling of the nasal tract there is hardly any obstruction to the egressive flow of air. Hence, nasal sounds are considered in this study, to examine whether the high stricture in the vocal tract has any acoustic loading effect on the 83 (a) Input speech signal s[n] 1 0 −1 (b) EGG signal e[n] 1 0 −1 (c) dEGG signal (d) ZFF output 1 s z [n] de[n] 0.2 0 −0.2 0 −1 F0 (Hz) (e) F0 contour 120 115 110 (f) SoE contour ψ 1 D1 F ,F D2 (Hz) 0.5 (g) FD1 and FD2 contours 5000 4000 3000 2000 1000 0.05 0.15 0.25 0.35 Time (sec) 0.45 0.55 Figure 4.14 (Color online) Illustration of waveforms of (a) input speech signal, (b) EGG signal, (c) dEGG signal and (d) ZFF output, and features (e) F0 , (f) SoE, (g) FD1 (“•”) and FD2 (“◦”) for geminated occurrence of velar nasal ([N]) in vowel context [a]. The sound is for [aNNa], produced in male voice. glottal excitation. Two variants of the high stricture along the vocal tract are considered, corresponding to two different locations, namely, alveolar nasal ([n]) and velar nasal ([N]). Fig. 4.13 and Fig. 4.14 show the waveforms and other features for [n] and [N], respectively. Due to absence of acoustic loading effect on the glottal vibration, there are no visible changes in EGG and dEGG waveforms, in relation to the adjacent vowel. Also, there is hardly any significant change in the F0 contours (Fig. 4.13(e) and Fig. 4.14(e)), indicating that the glottal vibration is not affected. However, there is reduction in the amplitude of the waveform of the speech signal in both the cases of [n] and [N], as shown in Fig. 4.13(a) and Fig. 4.14(a), respectively. This primarily is due to narrow constricted (turbinated) path of the nasal tract, especially at the nares. The effect of this constriction can also be seen in the significantly lower amplitudes of ZFF output (Fig. 4.13(d) and Fig. 4.13(f)) and SoE contour (Fig. 4.14(d) and Fig. 4.14(f)), as compared to the adjacent vowel [a]. As expected, the resonance frequency due to nasal tract coupling is significantly lower than the vowel, as can be seen in 84 Table 4.4 Comparison between sound types based on stricture differences for geminated occurrences. Abbreviations:- alfric: alveolar fricative [z], vefric: velar fricative [È], approx/appx: approximant [l], frics: fricatives ([z], [È]), alnasal: alveolar nasal [n], venas: velar nasal [N], stric: stricture, H/L indicates relative degree of low stricture. Qualitative observations Quantitative observations (using waveforms) (using features) Categories of Main causes for Sl. difference in effects (a) (b) (c) (d) (f) (g) sounds (h) (e) # of acoustic loading considered s[n] e[n] de [n] zs [n] F0 ψ FD1 FD2 cyclic high stric:[r] trill vs vowel steady no stric:[a] ([r] vs [a]) 1. • • • • X X X • steady high stric:[z] alfric vs trill cyclic high stric:[r] 2. ([z] vs [r]) X • X X X X X • steady Hlow stric:[È] vefric vs alfric steady high stric:[z] • ([È] vs [z]) 3. • X • • ◦ X ◦ steady low stric:[l] approx vs vefric steady Hlow stric:[È] • ([l] vs [È]) 4. ◦ • ◦ • X • X approx vs vowel steady low stric:[l] steady no stric:[a] ([l] vs [a]) 5. • ◦ ◦ ◦ ◦ ◦ X • cyclic high stric:[r] trill vs approx steady low stric:[l] ([r] vs [l]) 6. • X X • X X X • frics vs trill/appx high strics:[z],[r] 7. ([z],[È] vs [r],[l]) H/L low strics:[È],[l] • • X • • X X ◦ nasal low stric:[n],[N] nasals vs vowel steady no stric:[a] X 8. ([n],[N] vs [a]) ◦ • X ◦ X X • nasals vs approx nasal low stric:[n],[N] ([n],[N] vs [l]) steady low stric:[l] • 9. ◦ ◦ • ◦ X • X nasal high stric:[n], alnasal vs venas nasal Hlow stric:[N] ◦ ([n] vs [N]) 10. ◦ • ◦ ◦ ◦ X • Legend:- X: mostly evident, •: sometimes/less evident, ◦ : rarely/not evident changes the FD1 contours in Fig. 4.13(g) and Fig. 4.14(g), for [n] and [N], respectively. In fact, even the FD2 is also lower in both the cases, but this change is clearly visible in the case of [N] (Fig. 4.14(g)). In summary, in the case of nasal sounds ([n] and [N]), though there is a complete closure in the oral cavity, the high stricture in the vocal tract does not cause any acoustic loading effect on the glottal excitation. However, there are significant changes in the speech signal, ZFF signal, SoE, FD1 and FD2 , relative to the adjacent vowel. These changes are primarily due to narrow constriction in the nasal tract. There are no significant changes in the source characteristics of alveolar ([n]) and velar ([N]) nasal sounds. In Table 4.4, comparisons among different sound categories and the vowel context ([a]) are made, based upon the level of stricture in the vocal tract. In each case, the difference in the stricture size, type and location in the vocal tract, that causes the difference in the acoustic loading effect, is highlighted. The signal waveforms and the derived features that are mostly/sometimes/not affected for each sound type, are marked to provide a comprehensive view. 85 Table 4.5 Changes in glottal source features F0 and SoE (ψ) for 6 categories of sounds (in male voice). Column (a) is F0 (Hz) for vowel [a], (b) and (c) are F0min and F0max for the specific sound, and 0 (d) is F∆F (%). Columns (e) is SoE (i.e., ψ) for vowel [a], (f) and (g) are ψmin and ψmax for the 0 [a] specific sound, and (h) is ψ∆ψ (%). Sl.# are the 6 categories of sounds. Suffixes a, b and c in the [a] first column indicate single, geminated or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. (d)(%) (h)(%) (a) (e) (b) (g) (c) (f) ∆F0 Sound ∆ψ Sl. Sound F0[a] F0[a] ψ[a] ψ[a] F0min F0max ψmin ψmax # category Symbol 1a 1b 1c 2a 2b 2c 3a 3b 3c 4a 4b 4c 5a 5b 5c 6a 6b 6c 4.3.6 trill trill trill alfric alfric alfric vefric vefric vefric lateral lateral lateral alnasal alnasal alnasal venasal venasal venasal [ara] [arra] [arr...ra] [aza] [azza] [azz...za] [aÈa] [aÈÈa] [aÈÈ...Èa] [ala] [alla] [all...la] [ana] [anna] [ann...na] [aNa] [aNNa] [aNN...Na] 112.1 111.1 111.4 111.4 110.6 111.5 112.7 112.2 112.1 114.9 113.3 114.6 112.7 113.0 115.7 112.7 114.8 115.6 88.5 85.8 89.4 95.9 94.6 95.6 111.1 110.9 111.7 117.6 114.0 112.8 114.7 113.4 113.2 114.3 115.0 117.1 117.7 118.3 118.5 117.0 116.3 117.7 117.7 119.1 117.7 119.1 119.8 120.1 118.6 119.1 117.7 119.0 119.8 119.9 26.00 29.26 26.06 18.91 19.59 19.85 5.81 7.34 5.30 1.22 5.11 6.42 3.48 5.08 3.85 4.15 4.20 2.46 .820 .665 .734 .617 .509 .641 .813 .608 .798 .787 .720 .616 .744 .748 .793 .722 .828 .748 .237 .119 .146 .074 .057 .075 .627 .393 .538 .889 .819 .582 .299 .304 .251 .311 .331 .315 .987 .753 .634 .751 .566 .740 .943 .735 .957 .933 .903 .707 .814 .861 .813 .862 .879 .880 91.34 95.31 66.50 109.6 99.96 103.8 38.90 56.23 52.57 5.67 11.64 20.24 69.09 74.39 70.81 76.30 66.14 75.59 Quantitative assessment of the effects of acoustic loading The effects of acoustic loading examined in the previous section, are for geminated cases of the six sound categories, where the production of specific sound is sustained. The cross-sectional area of a stricture is not expected to be significantly affected by the duration of the sound, whether it is single, geminated or prolonged occurrence. It would be interesting to observe changes in the features for single and prolonged occurrences of these six sound categories, relative to the geminated occurrences. The degree and nature of changes in the vibration characteristics of the vocal folds due to acoustic loading of the vocal tract system, are examined in this section using the average values of the features for single and prolonged occurrence types, along with geminated occurrences of the six sound categories. Changes in the features F0 , SoE, FD1 and FD2 for the six categories of sounds in the context of the vowel [a] are examined, in terms of the average values computed in two ways. First, the average values of the features are computed over glottal cycles (at GCIs) in the regions of consonant or vowel context, demarcated manually, for each of the three occurrence types (single, geminated and prolonged). These are discussed using the average values given in Tables 4.5 and 4.6, for source and system features, re86 Table 4.6 Changes in vocal tract system features FD1 and FD2 for 6 categories of sounds (in male voice). Column (a) is FD1 (Hz) for vowel [a], (b) and (c) are FD1min and FD1max for the specific sound, ∆F and (d) is FD1D1 (%). Columns (e) is FD2 (Hz) for vowel [a], (f) and (g) are FD2min and FD2max for [a] the specific sound, and (h) is ∆FD2 FD2[a] (%). Sl.# are the 6 categories of sounds. Suffixes a, b and c in the first column indicate single, geminated or prolonged occurrences, respectively. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. (d)(%) (h)(%) (a) (e) ∆FD1 ∆FD2 (b) (f) (g) (c) Sound Sl. Sound FD1 FD2 F F D D F F 1 2 F F D1min D2min D2max D1max [a] [a] [a] [a] # category Symbol 1a 1b 1c 2a 2b 2c 3a 3b 3c 4a 4b 4c 5a 5b 5c 6a 6b 6c trill trill trill alfric alfric alfric vefric vefric vefric lateral lateral lateral alnasal alnasal alnasal venasal venasal venasal [ara] [arra] [arr...ra] [aza] [azza] [azz...za] [aÈa] [aÈÈa] [aÈÈ...Èa] [ala] [alla] [all...la] [ana] [anna] [ann...na] [aNa] [aNNa] [aNN...Na] 761 763 791 856 844 886 735 867 889 804 862 815 1137 1103 1084 1177 1119 1118 525 402 397 278 227 288 363 342 359 562 480 361 327 241 267 261 244 214 1499 1837 1882 2723 2873 2749 913 968 1031 732 652 495 1410 1387 1240 1316 1149 1195 128.1 188.0 187.7 285.6 313.6 277.7 74.9 72.2 75.6 21.3 20.0 16.4 95.3 103.9 89.7 89.6 80.8 87.7 2022 2006 2399 2892 3092 3250 3195 3131 3114 2301 2361 2774 3483 3440 3513 3335 3277 3329 1377 1655 1863 3612 3678 3875 3245 3062 3182 1748 2349 2589 2454 2491 2842 2438 2778 2713 3506 3933 3793 4451 4538 4536 3752 3884 3855 3725 4229 3746 4010 3523 3694 3925 3650 3689 105.3 113.6 80.5 29.0 27.8 20.4 15.9 26.3 21.6 85.9 79.6 41.7 44.7 30.0 24.3 44.6 26.6 29.3 spectively. Second, changes in the features are examined by computing the average values over the three types of occurrences (single, geminated and prolonged) for each sound category. These are discussed using the average values given in Tables 4.7 and 4.8. Changes in F0 and SoE in comparison to those for the vowel context [a] are given in Table 4.5. The average values of F0 for vowel [a], minimum F0 and maximum F0 (i.e., F0[a] , F0min and F0max ) for each sound category are given (in Hz) in columns (a), (b) and (c), respectively. The values are rounded −F F off to a single decimal. The percentage change in F0 relative to F0[a] , i.e., ∆F0 /F0[a] (= 0maxF0 0min %) [a] is given in column (d). Likewise, the values of the feature SoE (denoted as ψ) are given in columns (e)(h). The features are normalized relative to those for the vowel context (i.e., F0[a] and ψ[a] ), to facilitate comparison across sound categories. Since, number of points (GCIs) for feature values is less in some cases of sounds such as single or geminated occurrences of trill ([r]) sounds, the range of deviation in the feature is computed using minimum and maximum values, rather than computing the standard deviation. 87 Table 4.7 Changes in features due to effect of acoustic loading of the vocal tract system on the glottal vibration, for 6 categories of sounds (in male voice). Columns (a)-(d) show percentage changes in F0 , SoE, FD1 and FD2 , respectively. The direction of change in a feature in comparison to that for vowel [a] is marked with +/- sign. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. (d)(%) (a)(%) (b)(%) (c)(%) IPA Sl. Sound ∆F ∆FD2 ∆ψ ∆F0 D1 # category Symbol 1 2 3 4 5 6 trill alfric vefric lateral alnasal venasal [r] [z] [È] [l] [n] [N] -27.1 -19.5 +6.1 +4.3 +4.1 +3.6 -84.8 -104.5 -49.2 +12.5 -71.4 -72.7 +167.9 +292.3 -74.2 -19.2 -96.3 -86.0 +99.8 +25.7 +21.2 +69.1 -33.0 -33.5 Significant changes in F0 and SoE in comparison to the vowel context [a] can be observed for apical trill ([r]) and alveolar fricative ([z]), in columns (d) and (h) in Table 4.5. A dip in F0 for alveolar fricative ([z]), is due to acoustic loading of the nearly closed vocal tract (to produce fricative noise for sound [z]) on the vibration of the vocal folds. The strength of excitation (ψ) is reduced for [z] due to constriction in the vocal tract. A sharp change (a dip) in SoE for both nasals ([n] and [N]) can be observed from column (h). This is because of the constriction in the nasal tract, and not due to the acoustic loading on the glottis, as in the case for [r] or [z]. Absence of acoustic loading for nasals is evident from the negligible changes in the F0 values. In Table 4.6, the average values of FD1 for the vowel [a] (FD1[a] ), minimum FD1 ( FD1min ) and maximum FD1 (FD1max ), for the six sound categories are given in columns (a), (b) and (c), respectively. Percentage changes in FD1 for these sounds relative to FD1[a] (for vowel context), i.e., ∆FD1 /FD1/a/ FD −FD 1max 1min %) are given in column (d). Likewise, the average values of FD2 are given in (= F0[a] columns (e)-(h). The values of FD1 and FD2 are rounded off to the nearest integers, and the percentage changes to a single decimal. Large changes can be observed in FD1 for trill [r] and alveolar fricative [z], relative to the vowel [a]. In comparison, the changes in FD1 for velar fricative ([È]) and lateral approximant ([l]) are relatively low. The changes in FD1 for nasals ([n] and [N]) are significant, due to lowering of the first formant of the nasal tract. Changes in FD2 are high mainly for the trill ([r]) sound, as can be seen in column (h). A summary of percentage changes in F0 , SoE, FD1 and FD2 , relative to those for the vowel [a], for the six categories of sounds for the male voice, is given in Table 4.7. The average values of the changes in these feature computed across the three types of occurrences are given for each sound category. In each case, the relative increase or decrease (i.e., the direction of change) in the average values of these features, in comparison to those for the vowel [a], is marked as (+) or (-), respectively. The significant changes are in F0 due to acoustic loading effect on the glottal vibration, and in FD1 due to changes in the system characteristics. The summary table clearly illustrates the changes in different sound categories as discussed before. 88 Table 4.8 Changes in features due to effect of acoustic loading of the vocal tract system on the glottal vibration, for 6 categories of sounds (in female voice). Columns (a)-(d) show percentage changes in F0 , SoE, FD1 and FD2 , respectively. The direction of change in a feature in comparison to that for vowel [a] is marked with +/- sign. Note: ‘alfric’/ ‘vefric’ denotes alveolar/velar fricative and ‘alnasal’/‘venasal’ denotes alveolar/velar nasal. (d)(%) (a)(%) (b)(%) (c)(%) IPA Sl. Sound ∆F ∆FD2 ∆ψ ∆F0 D1 # category Symbol 1 2 3 4 5 trill alfric lateral alnasal venasal [r] [z] [l] [n] [N] +15.5 -29.4 +2.9 +6.6 +6.6 -49.1 -116.2 +35.0 +46.4 +22.5 +72.4 +273.3 -51.7 -108.7 -72.4 +30.9 +43.8 +17.1 -21.0 -26.3 The summary of changes in features F0 , SoE, FD1 and FD2 for a subset of the data collected for a female voice is given in Table 4.8 in columns (a), (b), (c) and (d), respectively. Tables 4.7 and 4.8 show some differences. The extent of changes in F0 , SoE, FD1 and FD2 for [r] seem to be less for the female speaker. The signs of changes in SoE (∆ψ) for both nasals ([n] and [N]), as compared to the vowel context [a], are also different for both the speakers. It could possibly be related to higher average pitch of female voice in comparison to male voice. Another reason for the differences in Tables 4.7 and 4.8 could be that the data for Table 4.7 was obtained from an expert phonetician, whereas the data for Table 4.8 was obtained from a research scholar with basic training in phonetics. 4.3.7 Discussion on the results In this preliminary study, we have examined the effect of acoustic loading of the vocal tract system on the vibration characteristics of the vocal folds. Involuntary changes in glottal vibrations in the production of some specific categories of sounds are examined, which are due to acoustic loading of the vocal tract system and system-source interaction. A selected set of six categories of sounds are considered for illustration. The sounds differ in the size, type and location of stricture in the vocal tract. We have considered features describing the glottal vibration source and the vocal tract system to demonstrate the effect of system-source coupling. Further, this study concentrates on only the sounds uttered in the context of vowel [a]. Single, geminated and prolonged occurrences are examined for each sound category. The speech signal along with EGG signal were studied for each case. Changes were examined in the amplitudes of the waveforms of speech signal, EGG, dEGG and ZFF output, and in features F0 , SoE, FD1 and FD2 . The glottal source features F0 and SoE are derived from the speech signal using the zero-frequency filtering method. The vocal tract system characteristics are represented through the two dominant resonance frequencies FD1 and FD2 . The acoustic loading effect on the glottal vibration depends on the size, type and location of the stricture in the vocal tract. The general observation is that the glottal vibration is not affected when there 89 is a relatively free flow of air from the lungs passing through the vocal tract system, as is seen from the EGG, dEGG and F0 contours for apical lateral approximant ([l]) and nasals ([n] and [N]). However, the speech signal waveform and the feature SoE could be affected by changes in both the vocal tract system and in the glottal source of excitation. Significant changes occur in the glottal vibration, mainly when there is acoustic loading of the vocal tract, as in the case of apical trill ([r]) and alveolar fricative ([z]) sounds. The stricture in the vocal tract is cyclic/steady high (i.e., constriction is narrow) in these cases. Glottal vibration is affected less in the case of velar fricative ([È]), because of lesser effect of acoustic loading due to relatively more open constriction in the vocal tract. Associated changes in the dominant resonance frequencies FD1 and FD2 are primarily due to changing shape of the vocal tract system, during production of these consonant sounds. 4.4 Summary In this chapter, we have examined the role of source-system interaction in the production of some special sounds in normal (verbal) speech. First, the relative role of source and system, and the sourcesystem coupling effect in the production of trills is examined using analysis by synthesis approach and perceptual evaluation. Experiments are conducted to understand the perceptual significance of the excitation source characteristics in production of speech sounds of sustained trill and approximant pair, and apical trills produced by four different places of articulation. The glottal excitation source seems to contribute relatively more than the vocal tract system component, in the perception of apical trill sounds. Glottal vibration that can be controlled voluntarily for producing some sounds, may also have significant involuntary changes in the production of some specific sounds. Hence, the effects of acoustic loading on the glottal vibration characteristics is examined, for production of six consonant types of sounds. Apical trills, apical lateral approximants, alveolar and velar variants of voiced fricatives and voiced nasals are studied in the vowel context [a] in modal voicing. Qualitative observations are made from the waveforms, and quantitative examination is carried out using features derived from both the acoustic and EGG signals. Features such as instantaneous fundamental frequency, strength of impulse-like excitation and dominant resonance frequencies are extracted from the speech signal using zero-frequency filtering method, linear prediction analysis and group delay function. Results indicate that the high stricture in the vocal tract causing obstruction to the free flow of air, produces significant acoustic loading effect on the glottal excitation, for example in the production trill ([r]) or alveolar fricative ([z]) sounds. The study examines the nature of involuntary changes in the glottal vibration characteristics due to the acoustic loading effect, along with associated changes in the vocal tract system characteristics, only for a few sounds. More variety of sounds and their variants need to be studied further. Also, different vowel contexts need to be examined. Different features may also be needed to understand the differences in variations of sounds from the production point of view. Production of nonverbal speech sounds also involves source-system coupling and involuntary changes in the glottal vibration. Hence, this study would be helpful for the analysis of nonverbal speech sounds, studied in further chapters. 90 Chapter 5 Analysis of Shouted Speech 5.1 Overview Production of shout can be emotional or situational. Shouted speech is normally produced when the speaker is emotionally charged during interaction with other human beings or with a machine. Some emotions, like anger or disgust, may also be related to the production of shouted speech. Production of shouted speech can also be situational, for example to warn or to raise alarm or to seek help. Irrespective of whether the production of shouted speech is emotional or situational, its production characteristics are expected to deviate significantly from those of normal speech of a person. Hence, in the emotional speech category of nonverbal speech sounds, shouted speech in particular is considered for the detailed analysis. Most of these changes in the production characteristics of shout are expected to be in the source of excitation. These characteristics are reflected in the speech signal, and they are perceived and discerned well by human listeners. Production characteristics of shout appear to be changing significantly in the excitation source, mainly due to differences in the vibration of the vocal folds at the glottis. Hence, production characteristics of shout are examined from both EGG and speech signals, in this study. The effect of coupling of the excitation source with the vocal tract system is examined for four levels of loudness. The sourcesystem coupling leads to significant changes in the spectral energy in low frequency region relative to that in the high frequency region for shouted speech. Their ratio as well as the degree of fluctuations in the low frequency spectral energy reduce for shout as compared to those for normal speech. The chapter is organized as follows: Different loudness levels of shouted speech considered for analysis are described in 5.2. In Section 5.3, the features related to both the excitation source and vocal tract system characteristics of shouted speech are described. Section 5.4 gives details of the data collected for analysis in this study. In Section 5.5, the EGG signals for normal and shouted speech are analysed to study changes in the characteristics of glottal vibration during shouting. In Section 5.6, features are derived from the speech signal to discriminate shouted speech from normal. Some of these features can be related to the characteristics of the glottal vibration. Analysis results are discussed in Section 5.7. Finally, a summary of this chapter is given in Section 5.8, along with research contributions. 91 5.2 Different loudness levels in emotional speech Production of emotional speech at different volume levels, especially in the case of shout, depends upon the context information. Out of numerous possible contexts and scenarios, it is relatively easier to investigate the volume level of production in the context of vowel regions, due to relatively steady behaviour of the vocal tract and excitation features in these regions. Hence, the production characteristics of shouted speech signal are examined in different vowel contexts in this study. In this study, we examine the production characteristics of shouted speech, also termed as shout, which is a kind of extreme deviation from normal speech in terms of loudness level. In order to understand the characteristics of shout relative to those of normal speech, the loudness levels are further sub-classified into whisper, soft, normal, loud and shout, in the increasing order of loudness level [219]. Among these, whisper is mostly unvoiced speech, and the other four, i.e., soft, normal, loud and shout, are voiced speech. Although many subtle variations at the intermediate levels of loudness can possibly be produced by people, these four coarse loudness levels in voiced speech are examined in detail from the production point of view, with focus on discriminating shout from normal speech. Variation in loudness is used in normal conversation to communicate different meaning, context, semantics, emotion, expression and intent. “We produce differences in loudness by using greater activity of our respiratory muscles so that more air is pushed out of lungs. Very often when we do this we also tense the vocal folds so that pitch goes up” [97]. The vibrations of the vocal folds are caused by the air pressure from the lungs, which pushes them apart and brings them together, repeatedly. In the case of shout it is possible that the vocal folds vibrate faster. This increased rate of vibration perhaps leads to the perception of higher pitch for shouted speech compared to normal speech. Changes in the vibrations of the vocal folds also affect the open and closed phases in each glottal cycle. The degree of these changes in the case of shout in relation to normal speech is expected to be much larger than the changes in the case of soft or loud speech with respect to normal. There exists coupling between the excitation source and the vocal tract system. This source-system coupling is reflected in the changes in the characteristics of both the excitation source and the vocal tract system. 5.3 Features for analysis of shouted speech In this study, apart from the known features like F0 and signal energy, features like the proportion of the close phase region within glottal cycle, ratio of energies in the low and high frequency regions of the short-time spectrum, and the standard deviation of the temporal fluctuations in the low frequency spectral energy are investigated. Among these, the last feature, namely the temporal fluctuations in the low frequency spectral energy, seems to play a major role in discriminating shout from normal speech. Methods used for extracting these features are described in this section. It is to be noted (as explained in later sections) that while most of these features may be obtained using the conventional speech processing methods, such as short-time spectrum analysis and linear prediction analysis, some 92 (a) Input speech signal waveform in x [n] 1 0 −1 (b) EGG of input speech signal x e [n] 1 0 −1 (c) LP residual of speech signal x r [n] 1 0 −1 (d) Glottal pulses obtained from LP residual rx g [n] 1 0 −1 0 5 10 15 20 25 Time (ms) 30 35 40 45 50 Figure 5.1 (a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d) glottal pulse obtained from LP residual (grx [n]) for a segment of normal speech. Note that all plots are normalized to the range of -1 to +1. (a) Input speech signal waveform in x [n] 1 0 −1 (b) EGG of input speech signal x e [n] 1 0 −1 (c) LP residual of speech signal x r [n] 1 0 −1 (d) Glottal pulses obtained from LP residual rx g [n] 1 0 −1 0 5 10 15 20 25 Time (ms) 30 35 40 45 50 Figure 5.2 (a) Signal waveform (xin [n]), (b) EGG signal (ex [n]), (c) LP residual (rx [n]) and (d) glottal pulse obtained from LP residual (grx [n]) for a segment of shouted speech. Note that all plots are normalized to the range of -1 to +1. new methods like ZTL described in this section seem to help in highlighting some of the excitation features better than by the conventional methods. (A) Excitation source features The glottal excitation source features F0 and SoE are extracted from the speech signal using the zero-frequency filtering (ZFF) method [130, 216] discussed in Section 3.5, and are shown in Fig. 4.2. (B) Glottal pulse shape features The EGG and the differenced EGG (dEGG) signals are used to study the changes in the features of the glottal pulse for shouted speech. The open and closed phase regions of glottal cycles of an EGG signal are shown in Figure 4.6. The open and closed phase regions are identified using the differenced EGG signal shown in Figure 4.6(b). Note that the amplitude of the EGG signal is larger during the closed phase region due to low impedance across the vocal folds. 93 The features of the glottal pulse are also present in the acoustic signal. But it is not always possible to derive the shape of the glottal pulse from the speech signal using inverse filtering, except in a few cases of clean signals [50, 58, 59]. Linear prediction (LP) residual signal is obtained by passing the speech signal through the inverse filter derived from LP analysis [112]. A 12th order LP analysis on the signal sampled at 10000 samples per second is used. The sequence of glottal pulses is obtained by integrating the LP residual signal twice, as the differenced (preemphasized) signal is used for LP analysis. Figures 5.1 and 5.2 show the speech signal waveform, the EGG signal, the LP residual and the derived glottal pulse, for normal and shouted speech signals, respectively. The glottal pulse shapes obtained from the EGG signals are shown in Figures 5.1(b) and 5.2(b), and those obtained from the LP residual are shown in Figures 5.1(d) and 5.2(d). It can be seen that it is difficult to clearly demarcate the closed and open phase regions of glottal pulses in Figures 5.1(d) and 5.2(d). On the other hand the EGG waveforms in Figures 5.1(b) and 5.2(b) are clearer in terms of the closed and open phase regions as discussed with reference to Figure 4.6(b). Note that improvements in inverse filtering for deriving the glottal pulses do not seem to give the open and closed phase characteristics of the EGG waveform, mainly because of the difficulty in cancelling out the effect of the vocal tract system by inverse filtering [2, 4, 3]. (C) Spectral features The excitation source seems to contribute to the discrimination between shouted and normal speech. The variations due to excitation source can be captured well if the temporal variations of the spectral features are extracted. Ideally, it is desirable (if possible) to derive the spectral information in the signal around each sampling instant. Spectrograms derived using short-time spectrum analysis of speech are generally used to provide visual representation of information of both the excitation source and the vocal tract system. The spectrograms do reveal differences between normal and shouted speech, in the energy distributions in the low and high frequency regions, and also in the separation of the pitch harmonics. But it is difficult to isolate the contributions of the excitation source and the vocal tract system effectively in the normal spectrograms, due to trade-off between spectral and temporal resolution. In this study, the ZTL method [40, 38] discussed in Section 3.6 is used, to capture the spectral features of the speech signal with improved temporal resolution. A 3-dimensional (3D) HNGD plot for a segment of shouted speech signal is shown in Figure 3.4(a), in which the HNGD values are shown for every sampling instant. The HNGD spectrum sliced temporally at every glottal closure instant (epoch) is shown in Figure 3.4(b), in a 3D mesh form. The temporal variations in the instantaneous HNGD spectra over the open and closed phases of glottal pulses are exploited in this study to discriminate between shouted and normal speech, discussed further in Section 5.6.1. (D) System resonance features The production characteristics of speech characterise the combined role of both the excitation source and the vocal tract system. The vocal tract resonance characteristics can be derived from the LP spectrum obtained using the LP analysis method, as discussed in Section 3.7. The shape of the LP spectrum represents the resonance characteristics of the vocal tract shape for a frame of speech signal, as illus94 trated for a frame of speech signal, as illustrated in Fig. 3.5. These resonance features extracted for shout and normal speech are used in the analysis further in Section 5.6.3. 5.4 Data for analysis Speech data for this study was collected from a total of 17 speakers (10 males and 7 females), each is a researcher in the Speech and Vision Lab at IIIT, Hyderabad. The data was collected in an ordinary laboratory environment, where other members of the laboratory were working during collection of data. The ambient noise level was about 50 dB. Simultaneous recordings were made using close speaking microphone (Sennheiser ME-03) and the electroglottograph (EGGs for Singers [121]). The data was recorded using a sampling rate of 48000 samples per second and 16-bits per sample. The data was downsampled to 10000 samples per second for analysis using ZFF, LP and ZTL to derive the HNGD spectra. Each of the 17 speakers was asked to speak 3 sentences, each in 5 distinct volume (loudness) levels of utterances. The 3 sentences are: (i) “Where were you?” (ii) “Don’t stop, go on.” (iii) “Please help!” The 5 different loudness levels in which each speaker produced utterances for each text are: (a) normal, (b) soft, (c) whisper, (d) loud and (e) shout. Each speaker repeated the utterances twice, and the utterances of the best (judged by listening) 5 distinct levels of loudness were chosen for each speaker for each text. The reason for this choice is that some speakers found it difficult to produce speech at some loudness levels. In spite of this, some speakers found it difficult to produce soft speech. This is reflected in the objective evaluations described in Sections 5.5 and 5.6. The data in whisper mode was collected only for differentiating it from soft voice, while recording. Since whispered speech is unvoiced and/or unaspirated most of the time, that data is not used in this study. Thus the database consists of a total of 204 utterances (17 speakers, 3 sentences and 4 levels of loudness). The text of the utterances was chosen to encompass different vowel contexts. 5.5 Production characteristics of shout from EGG signal The EGG signal is directly proportional to the current flow across the two electrodes placed on either side of the glottis. Since the resistance encountered is lower during the closed phase of the vocal folds (shown in Fig. 4.6), the current flow is higher as compared to that in the open phase in each glottal cycle. The same can also be verified from the dEGG signal, where the location of the positive peak corresponds approximately to the instant of glottal closure. Shout signals are produced by the vibration of the vocal folds at the glottis. It is possible that these vocal folds vibrate faster, with a changed rate of their opening/closing, in the case of shout. This increased rate of vibration, manifested as shorter period compared to normal, perhaps leads to the perception of higher pitch. It is also possible that if the vocal folds at the glottis are opening/closing at a changed rate, it reflects in the corresponding change in the ratio of the open/closed phase region to the period of the glottal cycle for shouted speech. Hence, a comparative study was carried out to examine the glottal 95 pulse characteristics, especially the relative durations of open/closed phase regions in each glottal cycle, for normal and shouted speech. Differences in the production characteristics of normal and shouted speech are clearly visible in the EGG/dEGG signals, as shown in Figures 5.1(b) and 5.2(b), respectively. Increase in the instantaneous fundamental frequency (F0 ) is evident in Fig. 5.2(b). In the EGG waveforms, the closed phase regions are identified automatically as shown in Fig. 4.6. That is, the region between the positive peak and the negative peak for each glottal cycle in the differenced EGG is marked as closed phase region. The average values of the ratio (α) in percentage (%) of the closed phase region to the period of the glottal cycle for different utterances and by different people are computed for soft, normal, loud and shouted speech. These α values are computed over only the voiced regions of the utterance in each case. Then average α value is computed for each speaker for all 3 utterances (of 3 texts) for a specific loudness level. The percentage changes (∆α) in the average values of α for soft, loud and shout with respect to normal speech are given in columns (a), (b) and (c) in Table 5.1, respectively, for each speaker. From these results, it is evident that the change in ratio (α), i.e., ∆α, generally increases in the case of loud or shouted speech, and is mostly reduced (i.e., -ve) in the case of soft voice. In a few cases like for speaker 6 (S6(F)), speaker 16 (S16(F)), and speaker 4 (S4(F)) (particularly female speakers), the ∆α value is observed to be positive for soft as compared to normal speech. This may possibly be due to difficulty in the production of soft speech by the speaker. However, for two of these three speakers the values of ∆α for shout are observed to be higher. For one speaker (S16(F)), of course the ∆α values are lower for shout in comparison with soft voice, as the speaker was not able to control the voice at different loudness levels. In general, it was observed that speakers expressed difficulty in producing speech in soft voice, whereas most speakers felt that it is easier to produce shouted speech. The values for individual speakers and utterances were computed to observe the variability, and then the average values of the features were computed over all utterances for each speaker. The average values across all speakers and utterances for each loudness level indeed show that for shouted speech the proportion of closed phase in a glottal cycle is much higher than for normal, and this and related features are exploited in discriminating shouted speech from normal. 5.6 Analysis of shout from speech signal As mentioned earlier, it is difficult to derive the features of the glottal pulse shape from the speech signal itself by using inverse filter. Non-availability of EGG signal may restrict the application of the α feature in practice. Moreover, speech signal carries much more information about the excitation than the EGG, due to the effect of the glottal vibration on the pressure of air from the lungs. In this section we shall examine methods of deriving features related to the excitation information and some other features from the speech signal in order to discriminate shouted speech from normal speech. 96 Table 5.1 The percentage change (∆α) in the average values of α for soft, loud and shout with respect to that of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below means speaker number i (i = 1 to 17). (a) Speaker # (b) (c) ∆αSof t (%) ∆αLoud (%) ∆αShout (%) (M/F) S1 (M) S2 (M) S3 (F) S4 (F) S5 (M) S6 (F) S7 (F) S8 (M) S9 (M) S10 (M) S11 (F) S12 (M) S13 (M) S14 (F) S15 (M) S16 (F) S17 (M) Average 5.6.1 -13.0 2.0 1.7 9.7 -15.0 18.5 -7.5 -7.2 -2.6 -10.4 -0.7 -12.6 -16.8 -14.9 -10.6 13.4 -13.4 -4.71 % 10.1 16.3 9.4 10.7 7.2 2.2 13.8 0.6 6.3 11.2 4.9 7.0 -1.0 9.2 4.4 5.5 11.5 7.53 % 14.9 29.9 6.1 25.7 8.1 25.8 20.3 8.5 6.4 24.1 1.6 13.7 7.4 16.9 5.5 2.2 25.5 14.29 % Analysis from spectral characteristics Variation in the value of α for speech at different loudness levels does have an effect on the spectral characteristics. Coupling of the sub-glottal cavities with the supra-glottal cavities during the open phase of the vocal folds results in a resonance in the low frequency (< 400 Hz) region due to increase in the effective length of the vocal tract. This in turn results in higher energy in the low frequency region for soft and normal speech, compared to loud and shouted speech. The lower energy in the low frequency region for loud and shouted speech is also because of lesser open region in the glottal cycles in these cases, compared to soft and normal speech. To examine the effects of excitation from the speech signal, spectral features with good temporal resolution are derived. For this purpose, the HNGD spectrum (as described in Section 3.6) is computed at each sampling instant of time using a window of 5 msec. The HNGD spectrum is normalized, by dividing the spectrum values for the speech segment at every instant of time by their sum value. The normalized HNGD spectra at every sampling instant of time are shown in Fig. 5.3, for segments of speech in soft, normal, loud and shout modes. The HNGD spectrum plots clearly bring out the distribution of energy in the low frequency region in the speech signal. The higher temporal resolution of the spectral features help in observing the effects of open and closed regions of glottal cycles on the spectrum. The presence of low frequency (< 400 Hz) spectral energy due to increased open phase of the glottal cycle can be seen clearly for soft speech 97 Figure 5.3 HNGD spectra along with signal waveforms for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in the utterance of the text ‘Please help!’. Arrows point to the low frequency regions. (pointed by arrows in Fig. 5.3). The low frequency region has higher energy (darker regions) and larger fluctuations (less uniform distribution of energy) in the case of soft mode in comparison with that for shout mode. The diminishing low frequency spectral energy is due to gradual reduction in the open phase for loud and shout cases, as compared to that for normal speech. Examining these source features through the HNGD spectra of the speech signal appears to be a promising proposition. The energies in the low (0-400 Hz) and high (800-5000 Hz) frequency bands of the normalized HNGD spectra are averaged over a moving window of 10 msec duration for every sampling instant. The smoothing effect of this window helps to highlight the gross temporal variations in energies, for different cases of loudness levels. The temporal variations in the low frequency spectral energy (LFSE) and the high frequency spectral energy (HFSE) over the duration of a word for soft, normal, loud and shouted speech are shown in Figures 5.4 and 5.5, respectively. It may be observed from these figures that there is gradual decrease in the average level of the low frequency HNGD energy (the dotted line region in Fig. 5.4), and increase in the average level of the high frequency HNGD energy (the dotted line region in Fig. 5.5) in the vowel regions, for the 4 cases of speech with increasing loudness levels. There is gradual reduction in the amplitudes of the fluctuations in the LFSE and gradual increase in the frequency of these fluctuations for the 4 cases of loudness levels, although the spread of these fluctuations are not seen clearly in Figures 5.4 and 5.5 due to smoothing of the LFSE and HFSE contours. The ratio (β) of the average levels of LFSE and HFSE of the HNGD spectra is computed over the vowel region (marked in Figure 5.4) for different texts and for different speakers. As an illustration, the values of β computed for 2 different vowel contexts (/6/ in word ‘stop’ and /e/ in word ‘help’) 98 (a) Soft (b) Normal LFSE 0.6 0.4 (c) Loud 0.6 V 0.4 (d) Shout 0.6 0.6 0.4 0.4 V 0.2 0 0 0.2 50 100 150 0 0 0.2 50 100 0 150 0 Time (ms) 0.2 V 100 V 200 0 0 50 100 150 Figure 5.4 Energy of HNGD spectrum in low frequency (0-400Hz) region (LFSE) for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in utterances of text ‘Please help!’. The vowel regions (V) are marked in these figures. HFSE (a) Soft (b) Normal (c) Loud 1 1 1 0.8 0.8 0.8 (d) Shout 1 0.8 V 0.6 0.6 0.4 V 0.2 0 50 0.6 V 0.4 100 150 0.2 0 50 100 V 0.6 0.4 150 0.2 0 Time (ms) 0.4 100 200 0.2 0 50 100 150 Figure 5.5 Energy of HNGD spectrum in high frequency (800-5000 Hz) region (HFSE) for a segment of (a) soft, (b) normal, (c) loud and (d) shouted speech. The segment is for the word ‘help’ in utterances of text ‘Please help!’. The vowel regions (V) are marked in these figures. are given for 8 different speakers in Table 5.2 for soft, normal, loud and shouted speech. It may be observed from the Table 5.2, that since the low frequency energy of the HNGD spectrum decreases and the high frequency energy increases, their ratio (β) decreases fairly consistently with increasing levels of loudness. The visual representation of the feature β can be seen in the distribution of HFSE vs LFSE computed over the vowel region (/e/ in ‘help’), as shown in Fig. 5.6(a). The distributions of HFSE vs LFSE computed for different vowels (/6/ in ‘stop’, /u/ in ‘you’ and /o:/ in ‘go’) are shown in Figures 5.6(b), (c) and (d). All these figures show distinct clusters for the 4 different levels of loudness. Standard deviation (σ) values computed for the LFSE are given in Table 5.3, for the 4 different loudness levels and in 2 different vowel contexts (/6/ in word ‘stop’ and /e/ in word ‘help’) for 8 different speakers. It can be observed from Table 5.3 that the fluctuations in the LF energy (LFSE) are far less in the case of shout as compared to that for normal speech. A similar trend is observed for fluctuations in the HF energy (HFSE) of the HNGD spectrum, although these are not as prominent as in the LFSE. As illustration, the values of β and σ features are given in Tables 5.2 and 5.3 for 2 vowel contexts (/6 in ’stop’ and /e in ’help’) for 8 speakers. Similar trends are observed for other vowel contexts in the 99 (a) For vowel region /e/ in ‘help’ (b) For vowel region /6/ in ‘stop’ Distribution of HFSE vs LFSE (vowel region) Distribution of HFSE vs LFSE (vowel region) 1 1 Soft Normal Loud Shout Centroids 0.9 0.8 0.7 0.7 0.6 0.6 HFSE HFSE 0.8 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 LFSE 0.6 0.7 0.8 0.9 Soft Normal Loud Shout Centroids 0.9 0 0 1 (c) For vowel region /u/ in ‘you’ 0.1 0.4 0.5 LFSE 0.6 0.7 0.8 0.9 1 Distribution of HFSE vs LFSE (vowel region) 1 1 Soft Normal Loud Shout Centroids 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.1 0.2 0.3 0.4 0.5 LFSE 0.6 0.7 0.8 0.9 Soft Normal Loud Shout Centroids 0.9 HFSE HFSE 0.3 (d) For vowel region /o:/ in ‘go’ Distribution of HFSE vs LFSE (vowel region) 0 0 0.2 1 0 0 0.1 0.2 0.3 0.4 0.5 LFSE 0.6 0.7 0.8 0.9 1 Figure 5.6 Distribution of high frequency spectral energy (HFSE) vs low frequency spectral energy (LFSE) of HNGD spectral energy computed in 4 different vowel contexts for the 4 loudness levels. The 4 vowel region contexts are: (a) vowel /e/ in word ‘help’, (b) vowel /6/ in word ‘stop’, (c) vowel /u/ in word ‘you’ and (d) vowel /o:/ in word ‘go’. The segments are taken from the utterances by same speaker (S4). (Color online) 100 Table 5.2 The ratio (β) of the average levels of LFSE and HFSE computed over a vowel segment of a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes. Note: The numbers given in columns (a) to (d) are β values multiplied by 100 for ease of comparison. Vowel Speaker # (a) (b) (c) (d) context (M/F) βSof t βN ormal βLoud βShout x 100 x 100 x 100 x 100 /6/ S1 (M) 62.9 22.8 10.8 6.7 /6/ S2 (M) 63.8 22.5 11.6 11.4 /6/ S3 (F) 19.4 12.0 5.2 5.0 /6/ S4 (F) 92.8 33.3 11.3 5.7 /6/ S5 (M) 62.6 35.5 26.2 8.6 /6/ S6 (F) 17.4 10.1 5.1 3.7 /6/ S7 (F) 17.6 11.7 6.0 8.5 /6/ S8 (M) 29.9 13.3 17.5 17.9 /e/ S1 (M) 43.9 15.1 6.0 3.6 /e/ S2 (M) 46.2 10.8 11.6 5.6 /e/ S3 (F) 15.1 11.9 8.0 6.3 /e/ S4 (F) 72.8 33.2 10.8 5.5 /e/ S5 (M) 83.2 42.4 30.4 16.6 /e/ S6 (F) 16.0 7.8 3.6 2.0 /e/ S7 (F) 18.1 17.1 5.4 3.0 /e/ S8 (M) 57.9 41.7 26.7 32.2 data, like for vowel /u/ in the word ‘you’, /i/ in the word ‘please’ and /o:/ in the word ‘go’, across all the 17 speakers. Thus, these features (β, σ) are useful to discriminate shout from normal speech. It is tempting to infer that similar discriminating characteristics can be observed in the usual shorttime spectra computed using 5 msec window at every sampling instant. The normalized short-time spectra are computed for the same segments considered in Fig. 5.3. The gross features, such as the increase in the high frequency energy and pitch frequency with increase in loudness can be observed in the normalized short-time spectra also. But the finer details of the spectral variations caused by the glottal vibrations, especially in the low frequency region, are not visible prominently in the short-time spectra in comparison with the HNGD spectra. The difference in one such feature, namely the temporal variation of energy in the low frequency band (0-400 Hz) of the normalized HNGD spectra is considered here for illustration. The low frequency spectral energy (LFSE) is computed at every sampling instant. The spread of the LFSE computed from HNGD spectra is shown in Fig. 5.7, by plotting the histogram of the LFSE values in the vowel region for each of the four cases of loudness levels. The relative spread of the LFSE is lowest for the shouted speech and is largest for the soft speech. This can also be observed from the spread of the LFSE values from the cluster plots shown in Fig. 5.6. Fig. 5.8 gives the relative spread of the LFSE computed for the same segments using short-time spectra. The square root of the short-time spectra are used to reduce the dynamic range of the spectra. In this case the discrimination among the four cases of loudness levels is not as clear as in Fig. 5.7. Similar difficulty in discrimination was observed in the cluster plots as in Fig. 5.6 when they are obtained 101 Table 5.3 Average values of standard deviation (σ), capturing temporal fluctuations in LFSE, computed over a vowel segment of a speech utterance in (a) soft, (b) normal, (c) loud and (d) shout modes. Note: The numbers given in columns (a) to (d) are σ values multiplied by 1000 for ease of comparison. (b) Vowel Speaker # (a) σSof t (c) (d) x 1000 σN ormal context (M/F) σLoud x σShout x 1000 1000 x 1000 /6/ S1 (M) 88.6 35.1 16.7 13.5 /6/ S2 (M) 95.9 44.4 18.0 23.5 /6/ S3 (F) 35.2 16.4 11.3 10.7 /6/ S4 (F) 67.6 27.5 11.9 9.2 /6/ S5 (M) 96.3 42.3 40.5 13.3 /6/ S6 (F) 25.3 15.2 16.4 11.2 /6/ S7 (F) 14.2 9.6 9.5 9.1 /6/ S8 (M) 72.8 36.2 42.6 32.6 /e/ S1 (M) 86.1 23.6 17.5 5.4 /e/ S2 (M) 71.8 16.0 15.6 8.1 /e/ S3 (F) 30.7 22.9 11.2 15.2 /e/ S4 (F) 66.5 34.1 9.7 6.0 /e/ S5 (M) 93.4 50.4 51.4 29.3 /e/ S6 (F) 31.2 13.2 11.0 4.6 /e/ S7 (F) 20.6 21.9 14.8 6.7 /e/ S8 (M) 69.7 67.3 66.3 72.0 from short-time spectra for different loudness levels. Thus these comparative studies between HNGD spectrum and short-time spectrum bring out the importance of the temporal resolution of the spectral features in highlighting the features of excitation, especially the effect of coupling of the excitation source and the vocal tract system. 5.6.2 Analysis from excitation source characteristics The instantaneous fundamental frequency (F0 ), representing pitch, is computed from the speech signal using ZFF method. The average F0 values for each speaker are computed over the voiced regions for utterances in soft, normal, loud and shout modes. In Table 5.4, the percentage change in the average pitch frequency (∆F0 ), relative to that for normal speech, are given in columns (a), (b) and (c), respectively for soft, loud and shouted speech. Note that while the shouted speech is associated with rise in pitch with respect to the pitch of normal speech, the opposite is not true. The rise in pitch does not necessarily increase the loudness level, nor the audio with raised pitch alone would sound like shouted speech. To verify this assertion, speech from 5 speakers (3 males, 2 females) for utterances of 3 different texts was recorded with pitch raised intentionally, but without letting it sound as loud or shout. For comparison, the speech was also recorded in normal and shout mode for each speaker. The EGG output was also collected along with the acoustic signal. Table 5.5 gives the average values of the ratio (α) of closed phase region to the period of glottal cycle for normal, high-pitch (non-shout) and shouted speech, in columns (a), (b) and (c), respectively, 102 Distribution of LFSE(vowel) Soft Normal Loud Shout 80 70 60 Count 50 40 30 20 10 0 0 0.05 0.1 0.15 0.2 0.25 0.3 LFSE(vowel) 0.35 0.4 0.45 0.5 Figure 5.7 Relative spread of low frequency spectral energy (LFSE) of ‘HNGD spectra’ computed over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and shout. The segment is for the vowel /e/ in the word ‘help’ in the utterance of the text ‘Please help!’. Distribution of LFSE(vowel) Soft Normal Loud Shout 100 90 80 Count 70 60 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 LFSE(vowel) 1 1.2 1.4 Figure 5.8 Relative spread of low frequency spectral energy (LFSE) of ‘Short-time spectra’ computed over a vowel region segment (LF SE(vowel)), for the 4 loudness levels, i.e., soft, normal, loud and shout. The segment is for the vowel /e/ in the word ‘help’ in the utterance of the text ‘Please help!’. for 5 different speakers. The corresponding average values of the instantaneous fundamental frequency (F0 ) are given in columns (d), (e) and (f). It can be observed from Table 5.5 that utterances with raised pitch (F0 in column (e)) result in reduced average values of α (column (b)), as compared to that for the normal speech (column (a)). These α values for normal and high pitched voices are in sharp contrast with the corresponding values for the shouted speech, given in column (c) in Table 5.5. The average F0 and SoE values for normal and shouted speech, are given in Table 5.6, in columns (d) and (e), and (f) and (g), respectively. These values are given for 5 different vowel contexts from the dataset, as representative cases. The changes in F0 and SoE for shouted speech, i.e., ∆F0 (= F0Shout − F0N ormal ) and ∆SoE (= SoEShout − SoEN ormal ) are given in columns (h) and (j), respectively. The changes in F0 and SoE, in terms of the percentage of their values for normal speech, are given in columns (i) and (k), respectively. It may be observed from these results that F0 always increases in the case of shouted speech, with respect to F0 of normal speech. However, SoE may increase in some cases 103 Table 5.4 The percentage change (∆F0 ) in the average F0 for soft, loud and shout with respect to that of normal speech is given in columns (a), (b) and (c), respectively. Note: Si below means speaker number i (i = 1 to 17). (a) (b) (c) Speaker # ∆F0Sof t (%) ∆F0Loud (%) ∆F0Shout (%) (M/F) S1 (M) S2 (M) S3 (F) S4 (F) S5 (M) S6 (F) S7 (F) S8 (M) S9 (M) S10 (M) S11 (F) S12 (M) S13 (M) S14 (F) S15 (M) S16 (F) S17 (M) Average -16.7 -8.0 3.6 -6.9 -10.5 -3.8 -11.4 -3.4 -3.6 -2.3 -5.9 -28.5 -3.6 -3.9 -4.0 -7.4 -1.0 -6.94 % 29.9 26.2 10.6 28.4 10.6 22.2 7.6 16.7 30.4 8.5 3.1 20.3 13.1 17.0 15.0 -0.7 18.0 16.29 % 84.6 82.4 17.5 49.1 58.4 53.2 11.1 27.6 68.3 82.9 1.9 38.5 57.9 69.7 43.8 1.4 36.6 46.24 % and may decrease in some other cases, though the significant amount of changes in SoE, i.e., |∆SoE| is another indication of shouted speech. 5.6.3 Analysis using dominant frequency feature It is generally difficult to see the changes in the characteristics of excitation through spectral features, as these features represent the combined effect of the characteristics of both the vocal tract system and the excitation, and also the vocal tract characteristics dominate in it. It is difficult to derive the spectral characteristics with good temporal resolution to provide discrimination between closed and open phase regions of glottal vibration [40]. However, the manifestation of the dominating vocal tract characteristics along with the changes in excitation characteristics can be observed in the spectral characteristics like LP spectrum. But the challenge remains - ‘how to achieve good temporal resolution’, since LP spectrum is computed only for a frame length of the speech signal. It is observed from the comparison of the LP spectra of corresponding frames of speech signal for normal and shout modes, that there are changes in the locations and magnitudes of spectral peaks in the case of shouted speech. The locations of spectral peaks in the LP spectrum, represented in terms of frequencies, indicate the combined effect of resonance characteristics of the vocal tract as well as of the excitation source characteristics. The location of these spectral peaks may sometimes be closely related to the formants but it is not always necessary. It is also observed from the comparison of LP spectra that 104 Table 5.5 The average values of the ratio (α) of the closed phase to the glottal period for (a) normal, (b) raised pitch (non-shout) and (c) shouted speech, respectively. Columns (d), (e) and (f) are the corresponding average fundamental frequency (F0 ) values in Hz. The values are averaged over 3 utterances (for 3 texts) for each speaker. Note: Si below means speaker number i (i = 1 to 5). F0 values are rounded to nearest integer value. Speaker # (M/F) (a) αN ormal (b) αHighP itch (c) αShout (d)(Hz) F0N ormal (e)(Hz) F0HighP itch (f)(Hz) F0Shout S1 (M) S2 (F) S3 (M) S4 (F) S5 (M) 0.445 0.368 0.515 0.496 0.447 0.368 0.337 0.303 0.461 0.397 0.555 0.385 0.535 0.540 0.499 151 219 160 289 116 416 430 502 622 336 264 313 293 395 211 Table 5.6 Results to show changes in the average F0 and SoE values for normal and shouted speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates Shout, S# indicates Speaker number, T# indicates Text number and M/F indicates Male/Female. Note: IPA symbols for the vowels in English phonetics are shown for the vowels used in this study. (i)(%) (k)(%) (f)(Hz) (g) (c)Speaker, (d)(Hz) (e) (j) (h)(Hz) ∆F0 (a)Vowel (b) ∆SoE Text(M/F) F0N m SoEN m F0Sh SoESh ∆F0 F0N m ∆SoE SoEN m context Word /6/ /stop/ S1,T2(M) 115 .5250 213 .2744 98 85.22 -.2506 -47.73 /e/ /help/ S4,T3(F) 230 .7592 378 .2284 148 64.40 -.5308 -69.92 /u/ /you/ S2,T1(M) 181 .4066 278 .8556 97 53.86 .4490 110.43 /oU/ /go/ S6,T2(F) 204 .1788 292 .3959 88 43.23 .2171 121.42 /i:/ /please/ S5,T3(M) 149 .1790 241 .6472 92 61.84 .4682 261.56 the location of the highest peak of the LP spectrum changes for the shouted speech as compared to that for normal speech. This change is more prominent for certain frames taken at different instants of time, from speech signals in shout and normal modes. The location of the highest peak in the LP spectrum seems to have the dominating effect and characterises the frame of signal at that particular instant of time. Hence, we have termed it as dominant frequency (FD ) in this paper. The dominant frequency (FD ) at some time instant t in speech signal can be obtained from the location of the highest peak in the LP spectrum of a signal frame at t, as is shown in Fig. 3.5. The FD value derived from speech signal for each sampling instant provides a production characteristic of speech signal with good temporal resolution. The illustrations of changes in FD for normal and shouted speech are shown in Figures 5.9(d) and 5.10(d), respectively. The changes in the corresponding F0 and SoE contours are also shown in each figure. It may be observed that changes in the characteristics of the excitation source (F0 , SoE), and that of the system and the source combined (FD ), are significant in certain segments of normal and shouted speech. These changes are more prominent around vowel regions. The mean and standard deviation of these temporally changing FD values are computed, to help detection of shouted speech. 105 xin[n] (a) Signal waveform 1 0 −1 (b) F contour 0 F (Hz) 0 400 300 200 100 (c) SoE contour SoE 1 0.8 0.6 0.4 0.2 (d) Dominant frequency (F ) contour D F (Hz) D 2500 2000 1500 1000 500 0.6 0.65 0.7 0.75 Time (sec) 0.8 0.85 Figure 5.9 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour for a segment “you” of normal speech in male voice. xin[n] (a) Signal waveform 1 0 −1 F0 (Hz) SoE (b) F0 contour 400 300 200 100 1 0.8 0.6 0.4 0.2 (c) SoE contour FD (Hz) (d) Dominant frequency (FD) contour 2500 2000 1500 1000 500 0.6 0.65 0.7 0.75 Time (sec) 0.8 0.85 Figure 5.10 Illustration of (a) signal waveform, (b) F0 contour, (c) SoE contour and (d) FD contour for a segment “you” of shouted speech in male voice. The changes in FD values for corresponding segments of normal and shouted speech are given in Table 5.7, for the words in the context of 5 different vowels in English phonetics. The mean and standarad deviation (σ) of FD values for normal speech are given in columns (d) and (e), and for shouted speech in columns (f) and (g). The change in the mean of FD values, i.e., (∆FD = FDShout − FDN ormal ), and the percentage change in FD with respect to FDN ormal , i.e., (∆FD /FDN ormal ) are given in columns (h) and (i), respectively. The changes in the ratios of standard deviation (σ) to mean FD values from normal to shouted speech, i.e., (σFD Shout /FDShout − σFD N ormal /FDN ormal ) are given in column (j). A 5th order LP analysis is used to obtain the LP spectrum at each time instant. The FD values are derived from the LP spectra for different frame sizes (5, 10, and 20 msec) taken at each instant of time. The frame size of 10 msec is considered better. It may be observed from the Table 5.7, that changes in the mean FD values are consistent across all the vowel contexts. Similar changes are also observed for words in other vowel contexts in the dataset. 106 Table 5.7 Results to show changes in the Dominant frequency (FD ) values for normal and shouted speech, for 5 different vowel contexts. Notations: Nm indicates Normal, Sh indicates Shout, S# indicates Speaker number, T# indicates Text number and M/F indicates Male/Female. Note: IPA symbols for the vowels in English phonetics are shown for the vowels used in this study. (j)(%) (i)(%) σ FD σ FD (c)Speaker, (d) (Hz) (e)(Hz) (f)(Hz) (g)(Hz) (h)(Hz) ∆FD (a)Vowel (b) Sh Sh σ σ F F FDSh F F F F DSh DSh DN m DN m Text(M/F) ∆FD DSh DN m context Word /6/ /stop/ S1,T2(M) 775 295 945 193 171 22.04 -17.73 /e/ /help/ S4,T3(F) 538 224 1689 999 1150 213.77 17.61 /u/ /you/ S2,T1(M) 424 230 1448 658 1024 241.28 -8.72 /oU/ /go/ S6,T2(F) 859 331 1162 962 303 35.29 44.24 /i:/ /please/ S5,T3(M) 595 337 1122 758 526 88.40 10.90 The changes in the range of fluctuations in FD values, represented by the standard deviation (σ) of FD values, do not exhibit any definite trend. However, the changes in the mean FD values for shouted speech as compared to that for normal speech, indicate that changes in the mean FD value could be a key characteristic of shouted speech. Combined evidences of changes in the mean FD along with F0 and SoE, may help detection of shouts in continuous speech. It may be noted that these discriminating features are relatively simple to derive, and are computationally less expensive. A decision logic can be devised using these production features for automatic detection of shouted speech. 5.7 Discussion on the results From the above discussion, the following are the features that are useful to discriminate shouted speech from normal speech: • Instantaneous fundamental frequency (F0 ) • Ratio (α) of closed phase region to the period of glottal pulse cycle • Ratio (β) of LFSE to HFSE computed over vowel region, i.e., βvowel , where LFSE and HFSE are the energies of the normalized HNGD spectra in the low (0-400 Hz) and high (800-5000 Hz) frequency regions of the vowel segments, respectively. • Standard deviation (σ) of the low frequency energy of the normalized HNGD spectrum, computed over a vowel region, i.e., σ for LF SE(vowel) • Dominant frequency (FD ) of resonances in vocal tract system Features of speech production mechanism that contribute to the characteristics of shout are examined in this study. Production characteristics of shout appear to be changing significantly in the excitation source, mainly due to differences in the vibration of the vocal folds at the glottis. The consistent increase in the average ratio (α) of the closed phase region to the glottal cycle period for increasing levels of loudness confirms that the vocal folds at the glottis indeed have a longer closed phase for higher levels 107 of loudness. The vocal folds also are more tensed and vibrate faster in the case of shout, thereby leading to the perception of higher pitch. However, mere increase in pitch does not necessarily indicate that it is due to shout. The low frequency spectral energy (LFSE) decreases and the high frequency spectral energy (HFSE) increases, with increase in the level of loudness, due to increased proportion of the closed phase within each glottal cycle. The ratio (β) of LFSE to HFSE is lower for shouted speech as compared to that for normal speech. The degree of temporal fluctuations (σ) in the low frequency spectral energy (LFSE) also reduces with increasing loudness level. 5.8 Summary Features of speech production mechanism that contribute to the characteristics of shout are examined in this study. Production characteristics of shout appear to be changing significantly in the excitation source, mainly due to differences in the vibration of the vocal folds at the glottis. The consistent increase in the average ratio (α) of the closed phase region to the glottal cycle period for increasing levels of loudness confirms that the vocal folds at the glottis indeed have a longer closed phase for higher levels of loudness. The vocal folds also are more tensed and vibrate faster in the case of shout, thereby leading to the perception of higher pitch. However, mere increase in pitch does not necessarily indicate that it is due to shout. The effect of coupling of the excitation source with the vocal tract system is examined for 4 levels of loudness. This coupling leads to significant change in the spectral energy in low frequency region relative to that in the high frequency region for shouted speech. The change is consistent in all the vowel regions. The low frequency spectral energy decreases and the high frequency spectral energy increases with increasing loudness level, due to increased closed phase quotient in each glottal cycle. Hence, the ratio (β) of LFSE to HFSE is lower for shouted than for normal speech. The degree of temporal fluctuations (σ) in the LFSE also reduces with increasing loudness level. To study the effect of coupling between the system and the source characteristics, it is necessary to extract the spectral characteristics of speech production mechanism with high temporal resolution, which is still a challenging task. At present the computational complexity of the HNGD spectrum is a limiting factor for practical applications of the proposed features. As a solution towards developing system for automatic shout detection in continuous speech, computing the dominant frequency (FD ) capturing the resonances in vocal tract system is proposed. The features proposed in this study, along with the features like F0 and signal energy, may be useful for detection of shout in continuous speech. These studies may also help in understanding the role of excitation component in the production of shout-related emotions like anger or disgust. In the next chapter, we examine the role of these characteristics in the production of paralinguistic nonverbal speech sounds such as laughter. 108 Chapter 6 Analysis of Laughter Sounds 6.1 Overview Production of variations in normal speech and emotional speech sounds involves the effects of source-system interaction. Also, changes in the characteristics of excitation source in particular can be controlled voluntarily in these cases. But, in the case of paralinguistic sounds like laughter, rapid changes occur in the source characteristics, and the changes are produced involuntarily. Hence, different signal processing techniques are needed, for the analysis of this type of nonverbal speech sounds. Production characteristics of laughter sounds are expected to be different from normal speech. In this study, we examine changes in the characteristics of the glottal excitation source and the vocal tract system, during production of laughter. Laughter at bout and call levels is analysed. Production characteristics of the laughter-speech continuum are analysed in three categories, two of laughter as (i) nonspeech-laugh (NSL) and (ii) laughed-speech (LS), and third (iii) normal speech (NS) as reference. Only voiced nonspeech-laugh, produced spontaneously, is considered. Data was recorded for natural laugh responses. In each case, both EGG [50] and acoustic signals are examined. Changes in the glottal vibration are examined using features such as closed phase quotient in glottal cycles [123] and F0 , both derived using differenced EGG signal [58]. Excitation source features are extracted also from the acoustic signal using a modification proposed in this study in the ZFF method [130]. First, the excitation source is represented in terms of a time-domain sequence of impluselike excitation pulses. Then, features such as instantaneous fundamental frequency (F0 ), strength of impulse-like excitation (SoE) around glottal closure instants, and density of excitation impulses (dI ) are derived. Associated changes in the vocal tract system characteristics are examined using first two dominant frequencies (FD1 and FD2 ) [124], derived from the acoustic signal using LP analysis [112] and group delay function [128]. Production features are also derived in terms of sharpness of the peaks in the Hilbert envelope [170] of LP residual [114] of acoustic signal. The decision of voiced/nonvoiced regions [39] uses framewise energy of the modified ZFF output signal. Parameters are derived to measure the degree of changes and temporal changes in the production features that discriminate well the three cases NS, LS and NSL. Performance evaluation is carried out on a database with ground truth. 109 The chapter is organized as follows. Section 6.2 discusses the details of data collected for this study. Changes in the glottal source characteristics during production of laughter are examined from the EGG signal in Section 6.3. A modified zero-frequency filtering method to study the excitation source characteristics of laughter sounds is proposed in Section 6.4. Section 6.5 discusses changes in the source and the system characteristics examined from the acoustic signal. Excitation source characteristics are derived using the modified zero-frequency filtering method, associated changes in the vocal tract system using LP and group delay analysis, and some production characteristics using the sharpness of peaks in the Hilbert envelope of LP residual of the acoustic signal. Results of the study are discussed in Section 6.6. A summary and contributions of this chapter are given in Section 6.7. 6.2 Data for analysis Laughter consists of both nonspeech-laugh and laughed-speech. Nonspeech-laugh (NSL), referred to as ‘pure-laugh’ in [135], may have both voiced and unvoiced regions, but only voiced regions are considered in this study. Laughed-speech (LS), referred to as ‘laugh-speech’ in [118], has regions of laughter interspersed with speech, the degree of which is difficult to predict or quantify. Hence, wide variations in the acoustic features of a laughed-speech bout are probable. The data was collected, eliciting natural laughter responses of the subjects, by playing hilarious/ comedy audio-visual clips or jokes audio clips, sourced from online media resources. The subjects were asked to express their responses, in case they really liked the comedy or joke, as one of the following three texts: (i) “It is a good joke.” (ii) “It is really funny.” (iii) “I have enjoyed.” The idea of using predefined texts was to help subjects express their natural responses with minimal text-related variability in acoustic features. The laughter (LS and NSL) responses and normal speech (NS) were recorded in each case. Normal speech for speakers was also recorded for the utterance of a fourth (neutral) text (iv) “This is my normal voice.” Both acoustic and EGG signals were recorded in parallel, in each case. Data was recorded for 11 speakers (7 males and 4 females), all research students at IIIT, Hyderabad. The data has total 191 (NSL/LS) laugh calls in 32 utterances, by 11 speakers. The nonspeech-laugh calls occur mostly prior to (or sometimes after) the laughed-speech calls for a text. The data also consists of 130 voiced segments of normal speech in 35 utterances, by 10 speakers. Thus the database has 191 natural laugh (NSL/LS) calls and 130 (NS) voiced segments in total 67 utterances, by 11 speakers. EGG signal was recorded using a EGG recording device [EGGs for Singers [121]]. The device records the current flow between two disc shaped electrodes, across the glottis. Thus changes in the glottal impedance during vocal folds vibration are captured in the EGG signal, but not the changes in air pressure in glottal region. Corresponding acoustic signal was recorded using a close-speaking standmounted microphone (Sennheiser ME-03), kept at a distance of around 15 cm from the speaker. The data was collected in normal room conditions, with ambient noise at about 50 dB, using a sampling rate of 48 KHz with 16 bits/sample. The data was downsampled to 10 KHz for the analysis. The ground truth for the data of nonspeech-laugh and laughed-speech was established by listening to each audio file. 110 (a) Input speech signal waveform in x [n] 1 0 −1 (b) α contours (with V/NV regions) αdex 1 0.5 0 F0 (Hz) SoE (c) F0 contours (with V/NV regions) 500 400 300 0.4 0.2 (d) SoE contours (with V/NV regions) 2.6 2.65 2.7 2.75 2.8 2.85 2.9 Time (sec) 2.95 3 3.05 3.1 Figure 6.1 Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for three calls in a nonspeech-laugh bout after utterance of text “it is a good joke”, by a female speaker. xin[n] (a) Input speech signal waveform 1 0 −1 (b) α contours (with V/NV regions) α dex 1 0.5 0 F0 (Hz) SoE (c) F0 contours (with V/NV regions) 500 400 300 0.4 0.2 (d) SoE contours (with V/NV regions) 0.4 0.5 0.6 0.7 0.8 0.9 Time (sec) 1 1.1 1.2 1.3 Figure 6.2 Illustration of (a) speech signal, and (b) α, (c) F0 and (d) SoE contours for a laughed-speech segment of the utterance of text “it is a good joke”, by a female speaker. 6.3 Analysis of laughter from EGG signal In the production of laughter, the vocal folds vibrate in a manner similar to normal speech. Hence, the glottal vibration characteristics of laughter are examined from the EGG signal [50]. Changes are examined in the closed phase quotient (α) in each glottal cycle for both the cases of laughter, i.e., nonspeech-laugh and laughed-speech, with reference to normal speech. The open/closed phase durations are computed using the differenced EGG (dEGG) signal [58], as illustrated in Fig. 4.6. Peaks and valleys in the dEGG signal (dex [n]) correspond nearly to the positive going and negative going zerocrossings in the EGG signal (ex [n]), respectively. Peaks in the dEGG indicate glottal closure instants (GCIs). The region in a glottal cycle for peak to valley in dEGG is considered as closed phase of duration Tc , and the region for valley to peak in dEGG as open phase of duration To . The proportion α is computed as α = Tc /(To + Tc ). An illustration of α contours (obtained from the dEGG signal (dex [n])) 111 in x [n] (a) Input acoustic signal waveform 1 0 −1 x e [n] (b) EGG signal (with V/NV regions) 1 0 −1 x e d [n] (c) dEGG signal (with V/NV regions) 0.4 0.2 0 −0.2 α d ex (d) α contour (with V/NV regions) 0.8 0.6 0.4 0.2 1 0.1 2 0.3 3 0.5 4 0.7 6 1.1 1.3 Time (sec) 5 0.9 7 1.5 8 1.7 1.9 2.1 Figure 6.3 Illustration of (a) signal waveform (xin [n]), and (b) EGG signal ex [n], (c) dEGG signal dex [n] and (d) αdex , contours along with V/NV regions (dashed lines). The segment consists of calls in a nonspeech-laugh bout (marked 1-4 in (d)) and a laughed-speech bout (marked 5-8 in (d)) for the text “it is really funny”, produced by a male speaker. for the calls in nonspeech-laugh and laughed-speech (voiced) segments produced by a female speaker is shown in Fig. 6.1(b) and Fig. 6.2(b), respectively. A comparative study is carried out for the calls in nonspeech laugh and laughed-speech bouts. An illustration of the acoustic signal (xin [n]), EGG signal (ex [n]), dEGG (dex [n]) and the α contour is given in Fig. 6.3(a), Fig. 6.3(b), Fig. 6.3(c) and Fig. 6.3(d), respectively, for calls in a NSL bout and a LS bout. The NSL calls are marked in Fig. 6.3(d) as regions 1, 2, 3, 4 and the LS calls as 5, 6, 7, 8. It may be observed in Fig. 6.3(d) that the spread (fluctuation) of α is distinctly larger for nonspeech-laugh calls, in comparison with laughed-speech calls that have α contour relatively more smooth. 6.3.1 Analysis using the closed phase quotient (α) In Table 6.1, the average α values (αave ) computed for each speaker are given in columns (a), (b) and (c), for normal speech, laughed-speech and nonspeech-laugh, respectively. The corresponding standard deviations in α (σα ) are given in columns (d), (e) and (f). In general, the αave is observed to be lower and σα higher for laughter (NSL/LS) calls, in comparison with normal speech. Hence, changes in α are better represented by a parameter βα (i.e., β), computed as βα = σα × 100 αave 112 (6.1) Table 6.1 Changes in α and F0EGG for laughed-speech and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average α, (d)-(f) are σα , (g)-(i) are average βα and (l)-(n) are average F0EGG (Hz) for NS, LS and NSL. Columns (j), (k) are ∆βα (%) and (o), (p) are ∆F0EGG (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. (p) (l) (m) (n) (o) (j) (k) Speaker (a) (b) (c) (d) (e) (f) (g) (h) (i) ∆F F ∆F F F σ σ σ 0N SL 0 0 0 0 α α α (M/F) αN SαLS αN SL N S LS N SL βN SβLS βN SL ∆βLS ∆βN SL N S LS N SL LS S1 (M) S2 (M) S3 (F) S4 (M) S5 (M) S6 (F) S7 (M) S8 (M) S9 (F) S10(F) S11(M) .50 .47 .44 .51 .44 .48 .52 .15 .38 .49 .41 .48 .41 .49 .49 .37 .48 .47 .53 .49 .44 .43 .39 .36 .38 .47 .34 .46 .42 .46 .42 .42 .35 .113 .113 .076 .136 .067 .109 .037 .044 .051 .087 .123 .120 .047 .066 .034 .131 .084 .174 .186 .077 .090 .092 Average .116 .162 .206 .115 .129 .153 .106 .138 .197 .138 .185 22 16 15 7 12 26 9 22 22 38 22 19 23 33 22 9 23 25 14 25 35 17 22 23 29 45 54 24 38 33 25 30 47 33 53 38 5 105 46 25 101 -3 57 13 60 -55 -1 31 178 257 237 227 28 179 36 111 -13 142 168 148 249 147 152 235 138 133 212 208 143 168 179 280 149 218 282 126 194 256 355 176 256 183 281 183 235 378 187 221 291 384 182 0 20 12 2 43 20 -9 46 21 71 23 23 52 24 13 25 54 61 36 66 37 85 27 44 The βα values for the three cases of NS, LS and NSL are given in Table 6.1 in columns (g), (h) β −β and (i), respectively. Changes in βα for LS and NSL from NS, i.e., ∆βLS (%) = LSβ N S × 100 β NS −β NS × 100, are given in columns (j) and (k), respectively. The βα and ∆βα and ∆βN SL (%) = N SL βN S values are rounded to integers. The average βα values for NS, LS and NSL cases are 19, 23 and 38, respectively. Across speakers, the increase in βα is higher for nonspeech-laugh than laughed-speech, with reference to normal speech. It means, across speakers the closed phase quotient (αα ) reduces more for nonspeech-laugh than laughed-speech. It is possible that this reduction in the closed phase quotient, is related to the abrupt closure of the vocal folds in each glottal cycle, during production of laughter. This possibility is examined later, in Section 6.5.5. 6.3.2 Analysis using F0 derived from the EGG signal The observation that the instantaneous fundamental frequency (F0 ) rises in a laughter call [16, 11, 85, 118], is confirmed in this study by the analysis from EGG signal. The F0 values (i.e., F0EGG ) are computed from the inverse of the glottal cycle periods (T0EGG ) obtained using the dEGG signal, as illustrated in Fig. 4.6. In Table 6.1, average F0 (in Hz) for the three cases NS, LS and NSL, are given in columns (l), (m) and (n), respectively. Changes in average F0 for LS/NSL from NS, i.e., F −F0N S F −F × 100 are given in columns (o) ∆F0LS (%) = 0LSF0 0N S × 100 and ∆F0N SL (%) = 0N SL F 0 NS NS and (p), respectively. The values are rounded to integers. Across speakers, the changes in average F0 are larger for nonspeech-laugh than for laughed-speech. The average rise in F0 for nonspeech-laugh and laughed-speech, with reference to normal speech, is 44% and 23%, respectively. 113 (b) Average α for female calls 0.5 0.45 0.45 ave 0.5 0.4 α αave (a) Average α for male calls 0.35 0.3 1 0.4 0.35 0.3 2 3 4 1 (c) Average F for male calls 350 (Hz) 350 ave 300 F0 (Hz) ave 400 0 F 4 0 400 250 2 3 Call number 3 (d) Average F for female calls 0 200 1 2 4 300 250 200 1 2 3 Call number 4 Figure 6.4 (Color online) Illustration of inter-calls changes in the average values of ratio α and F0 , for 4 calls each in a nonspeech-laugh bout (solid line) and a laughed-speech bout (dashed line), produced by a male speaker and by a female speaker: (a) αave for NSL/LS male calls, (b) αave for NSL/LS female calls, (c) F0ave for NSL/LS male calls, and (d) F0ave for NSL/LS female calls. 6.3.3 Inter-call changes in α and F0 It is interesting to observe that the average α values (αave ) for calls within a NSL/LS laugh bout also show some inter-calls increasing/decreasing trend. An illustration of inter-calls changes in αave for the calls in NSL and LS bouts is given in the voice of a male speaker and a female speaker, in Fig. 6.4(a) and Fig. 6.4(b), respectively. For nonspeech-laugh there is decrease in αave for successive calls in bout. The inter-calls trend in the average F0 values (F0ave ) also is observed for the calls in NSL and LS bouts of a male speaker and a female speaker, as illustrated in Fig. 6.4(c) and Fig. 6.4(d), respectively. The trends of inter-calls changes in αave and F0ave are indicative of the (inter-calls) growth/decay characteristics of natural laughter bouts. Higher F0 and Lower α for NSL calls, compared to LS calls, can also be observed for both the speakers. The analysis from EGG/dEGG signal highlights some interesting changes in the glottal vibration characteristics during production of laughter. But, two practical limitations in using the EGG signal are: (i) EGG captures changes in the glottal impedance (related with air flow), not the air pressure in glottal region, and (ii) collecting EGG signal may not be practically feasible at all times. Hence, the excitation source characteristics are examined from the acoustic signal in further sections. 6.4 Modified zero-frequency filtering method for the analysis of laughter Production characteristics of laughter are analysed in terms of acoustic features derived from both the EGG and acoustic signals. Since, there are some practical limitations in collecting the EGG signal, 114 xin[n] (a) Input acoustic signal waveform 1 0 −1 (b) Trend built−up after cascaded ZFRs 10 y2[n] x 10 −4 −6 x zm [n] (c) modified ZFF signal 1 0 −1 V/NV[n] (d) V/NV regions 1 0.5 V V V V 0 0.1 0.2 0.3 0.4 Time (sec) 0.5 0.6 0.7 Figure 6.5 Illustration of (a) acoustic signal waveform (xin [n]), (b) the output (y2 [n]) of cascaded pairs of ZFRs, (c) modified ZFF (modZFF) output (zx [n]), and (d) voiced/nonvoiced (V/NV) regions (voiced marked as ‘V’), for calls in a nonspeech-laugh bout of a female speaker. the production characteristics are derived also from the acoustic signal. Though excitation source characteristics from the speech signal for modal voicing can be derived using the zero-frequency filtering (ZFF) method [130, 216]. But, laughter involves rapid changes in the glottal vibration characteristics such as F0 [159]. Hence, in order to capture these changes better, a modified ZFF method is proposed. Steps involved in the proposed modified zero-frequency filtering (modZFF) method for deriving the source characteristics from the acoustic signal of paralinguistic sounds like laughter, are as follows: (a) Preprocess the input acoustic signal xin [n] to obtain a differenced signal s[n], in order to minimize the effect of any slow varying component in recording and produce a zero-mean signal. (b) Pass the differenced speech signal s[n] through a cascade of two ideal digital filters, called zerofrequency resonators (ZFRs) [130]. Each ZFR has a pair of poles located at the Unit-circle in the z-plane. The output of both ZFRs is given by y1 [n] = − ak y1 [n − k] + s[n] k=1 ak y2 [n − k] + y1 [n] , k=1 y2 [n] = − 2 X 2 X (6.2) where a1 = −2, a2 = 1. Successive integration-like operations in cascaded ZFRs result in a polynomial growth/decay kind trend built-up in its output y2 [n], as illustrated in Fig. 6.5(b), for nonspeech-laugh bout signal in Fig. 6.5(a). 115 (c) The trend removal for modal voicing [130] is normally carried out by subtracting local mean computed over a window of size 2N + 1 samples, whose duration is between 1-2 times the average pitch period (T0 ). But, for the nonmodal voices such as laughter, that has rapid changes in the glottal vibration, the trend removal operation is proposed in stages using gradually reducing window sizes. Window sizes 20 ms, 10 ms and 5 ms, and then 3 ms, 2 ms and 1 ms are used for computing the local mean (for gross trend first, then finer changes). This is the key step different from the ZFF method discussed in Section 3.5(c). The resultant output after each stage is given by y˜2 [n] = y2 [n] − N X 1 y2 [n + l] 2N + 1 (6.3) l=−N where each gradually reducing window consists of 2N + 1 samples. Final trend removed output is called the modified zero-frequency filtered (modZFF) output (zx [n]). An illustration of the modZFF output signal is shown in Fig. 6.5(c), for the acoustic signal in Fig. 6.5(a). The detection of voiced/non-voiced (V/NV) regions [39] is based upon the framewise energy of the modZFF signal (zx [n]). An illustration of V/NV regions in laughter acoustic signal is shown in Fig. 6.5(d). (d) This modZFF signal (zx [n]) carries information of glottal vibrations within each glottal cycle due to opening/closing of vocal folds, and also those related to high-pitch frequency. In order to highlight the glottal cycle characteristics better, in particular the locations of glottal closure instants (GCIs), i.e., epochs, the Hilbert envelope is obtained from the modZFF output signal zx [n]. The Hilbert envelope (hz [n]) of the signal z[n] (i.e., zx [n]) is given by [170, 137] q 2 [n] , hz [n] = z 2 [n] + zH (6.4) where zH [n] denotes the Hilbert transform of z[n]. The Hilbert transform zH [n] of the signal z[n] is given by [170, 137] zH [n] = IF T (ZH (ω)) , (6.5) where IF T denotes inverse Fourier transform, and ZH [n] is given by [137] ( +jZ(ω), ω ≤ 0 ZH (ω) = −jZ(ω), ω > 0 (6.6) Here Z(ω) denotes the Fourier transform of the signal z[n]. An illustration of the Hilbert envelope (hz [n]) of few cycles of modZFF output is shown in Fig. 6.6(c), along with the modZFF output signal (zx [n]) in Fig. 6.6(b) and the acoustic signal (xin [n]) in Fig. 6.6(a). Corresponding EGG signal (ex [n]) is shown in Fig. 6.6(d), for comparison with the glottal pulse characteristics. (e) In the case of modal voicing, the locations of (negative to positive going) zero-crossings of ZFF signal correspond to the GCIs (epochs) [130]. The instantaneous fundamental frequency (F0 ) is computed from the inverse of the period (T0 ), i.e., interval between successive epochs [216]. But, in the case of nonmodal voice like laughter, the epochs (i.e., GCIs) are obtained using the Hilbert 116 xin[n] (a) Input acoustic signal 1 0 −1 (b) modified ZFF signal zx[n] 1 0 −1 (c) Trend−removed Hilbert envelope of modified ZFF signal hz[n] 1 0 −1 ex[n] (d) EGG signal 0 −0.5 −1 0 5 10 15 Time (ms) 20 25 30 Figure 6.6 Illustration of (a) signal (xin [n]), and (b) modZFF output (zx [n]), (c) Hilbert envelope of modZFF output (hz [n]), and (d) EGG signal (ex [n] for a nonspeech-laugh call, by a female speaker. envelope of the modZFF signal (hz [n]). Epochs for a nonspeech-laugh call are shown marked with arrows in Fig. 6.6(c). The interval between these successive epochs gives period (T0 ), and inverse of T0 gives the F0 (i.e., F0ZF F ) for the nonmodal voices. The relative strength of impulselike excitation (SoE) is derived from the slope of the Hilbert envelope of ZFF signal (hz [n]), at these epoch locations. An illustration of changes in the F0 and SoE contours for the calls in nonspeech laugh and laughed-speech bouts of a female speaker is shown in Fig. 6.1(c) and (d), and Fig. 6.2(c) and (d), respectively. For better demonstration of relative changes in nonspeech-laugh vs normal speech, an illustration of F0 and SoE contours for laughter in acoustic is given in Fig. 6.7(b) and (c), respectively, for the speech signal shown in Fig. 6.7(a). (f) The intervals between all successive (negative to positive going) zero-crossings of the modZFF output signal (zx [n]) for laughter are used to compute the density of excitation impulses (dI ). Since, all negative to positive going zero-crossings (not necessarily epochs alone) indicate significant excitation at those instants, these are considered useful in examining the characteristics of laughter. Changes in the excitation impulse density (dI ) are observed to be significantly higher for laughter than for normal speech, details of which are discussed in next section. 6.5 Analysis of source and system characteristics from acoustic signal The excitation source characteristics of laughter are derived from the acoustic signal using a modified ZFF (modZFF) method discussed in Section 6.4. Three features (i) instantaneous fundamental 117 xin[n] (a) Input acoustic signal waveform 1 0 −1 F0 (Hz) (b) F0 contour (with V/NV regions) 400 200 ψ (c) SoE (ψ) contour (with V/NV regions) 1 0.8 0.6 0.4 0.2 1 2 4000 1 2 FD ,FD (Hz) (d) FD and FD contours (with V/NV regions) 2000 0 0.1 0.2 0.3 0.4 Time (sec) 0.5 0.6 0.7 Figure 6.7 Illustration of (a) signal (xin [n]), and contours of (b) F0 , (c) SoE (ψ), and (d) FD1 (“•”) and FD2 (“◦”) with V/NV regions (dashed lines), for calls in a nonspeech-laugh bout of a male speaker. frequency (F0ZF F ), (ii) density of excitation impulses (dI ) and (iii) strength of impulse-like excitation (SoE) are extracted from the modZFF signal (zx [n]). Changes in the source characteristics are analysed in two ways: (i) by measuring the degree of changes, and (ii) by capturing the temporal changes in these features. Parameters capturing the degree of changes in features are computed using the average values and standard deviation. Temporal parameters are derived by capturing the temporal changes in features. Changes in the features and the parameters derived are compared for laughed-speech and nonspeech-laugh, with reference to normal speech. The negative to positive going zero-crossings of the modZFF signal (zx [n]) indicate impulse-like excitation at those time-instants. Some of these zero-crossings correspond to epochs (i.e., GCIs). But, all zero-crossings are important in the production of laughter. Hence, a feature ‘density of the excitation impulses’ (dI ), computed at the instants of all zero-crossings of the modZFF signal (zx [n]), is used for examining changes in the excitation around these instants. Changes in dI are observed to be higher for laughter, in comparison to normal speech. These additional time-instants are captured by using the modified ZFF method, which otherwise are neither captured in the EGG/dEGG signal nor highlighted by the ZFF method [130, 216]. An illustration of few glottal cycles of acoustic signal (xin [n]), EGG signal (ex [n]) and modZFF output (zx [n]) is given in Fig. 6.8(a), Fig. 6.8(b) and Fig. 6.8(c), respectively. The presence of impulse-like excitation in-between some successive epochs can be observed in the acoustic signal and in modZFF signal in Fig. 6.8(a) and Fig. 6.8(c), respectively. But, the same is neither visible in the EGG signal in Fig. 6.8(b), nor in dEGG. This presence of more than one pulses in a pitch-period is possibly related to the secondary excitation, also observed in the context of LPC vocoders [67, 8]. “These results suggest that the excitation for voiced speech should consist of several pulses in a pitch period, rather than just one pulse at the beginning of the period” [8]. The role of these secondary excitation pulses is examined later in this section. 118 (a) Input acoustic signal in x [n] 1 0 −1 (b) EGG signal x e [n] 1 0 −1 (c) modified ZFF signal x zm [n] 1 0 −1 0 10 20 30 40 Time (ms) 50 60 70 80 Figure 6.8 Illustration of few glottal cycles of (a) acoustic signal (xin [n]), (b) EGG signal ex [n] and (c) modified ZFF output signal zx [n], for a nonspeech-laugh call produced by a male speaker. 6.5.1 Analysis using F0 derived by modZFF method The instantaneous fundamental frequency (F0 , i.e., F0ZF F ) is also derived from the acoustic signal using the modified ZFF method proposed in Section 6.4. The F0 value is computed from the inverse of the interval (T0 ) between successive epochs that are derived using the Hilbert envelope (hz [n]) of the modZFF output signal (zx [n]). A trend-removed and smoothed Hilbert envelope is used. In order to distinguish from F0EGG , we refer to it as F0ZF F . The intra-call changes in F0ZF F in each laugh call show an increasing (or decreasing) trend, as can be observed in Fig. 6.7(b). The observations are in line with the changes in T0 for laughter reported earlier [185]. Changes in F0ZF F are analysed by measuring the degree of changes and temporal (intra-call) changes in the F0ZF F contour. [A.] Measuring the degree of changes in F0 In Table 6.2, the average F0ZF F values (F0ave ) for each speaker are given in columns (a), (b) and (c) for NS, LS and NSL, respectively. Corresponding standard deviations in F0ZF F (σF0 ) are given in columns (d), (e) and f). The values are rounded to integers. Two observations can be made. (i) The trend in average F0ZF F for the NS, LS and NSL cases in columns (a)-(c) in Table 6.2, is mostly in line with the trend in average F0EGG for the three cases in columns (l)-(n) in Table 6.1. (ii) In columns (a)-(c) and (d)-(f) in Table 6.2, the F0ave and σF0 (spread in F0 ) are higher for laughter (LS/NSL), in comparison with normal speech. Though, F0EGG is expected to be more close to the true F0 , in comparison to F0ZF F . But, since F0ZF F and other features are derived from the acoustic signal that can be collected more easily, and both F0EGG and F0ZF F are mostly comparable, the F0ZF F is termed as F0 to represent the instantaneous fundamental frequency, further in this study. 119 Table 6.2 Changes in F0ZF F and temporal measure for F0 (i.e., θ) for laughed-speech and nonspeechlaugh, with reference to normal speech. Columns (a)-(c) are average F0ZF F (Hz), (d)-(f) are σF0 (Hz), (g)-(i) are average γ1 and (l)-(n) are average θ values for NS, LS and NSL. Columns (j), (k) are ∆γ1 (%) and (o), (p) are ∆θ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. (p) (j) (k) (l) (m) (n) (o) Speaker (a) (b) (c) (d) (e) (f) (g) (h) (i) (M/F) F0N S F0LS F0N SL σF0N SσF0LSσF0N SLγ1N Sγ1LSγ1N SL∆γ1LS∆γ1N SLθN S θLS θN SL ∆θLS ∆θN SL S1 (M) S2 (M) S3 (F) S4 (M) S5 (M) S6 (F) S7 (M) S8 (M) S9 (F) S10(F) S11(M) 184 187 273 146 158 261 157 142 240 262 169 193 217 279 150 219 276 152 211 259 339 186 269 223 308 236 259 378 294 260 298 344 209 57 58 76 19 26 69 40 30 63 43 51 85 66 89 31 58 83 46 85 60 79 48 Average 98 73 144 100 97 72 129 97 117 160 88 10 11 21 3 4 18 6 4 15 11 9 10 16 14 25 5 13 23 7 18 16 27 9 16 27 16 44 24 25 27 38 25 35 55 18 30 56 33 19 71 208 27 10 317 4 140 4 153 52 113 762 513 50 502 487 133 392 113 .25 .40 .32 .31 .29 .46 .34 .34 .71 .74 .60 .43 .53 1.94 .51 1.40 1.95 7.05 .60 1.87 .63 1.89 .30 .67 .32 .51 .33 1.29 .64 1.02 .95 .99 .56 .40 .67 1.73 114 28 516 91 116 -35 -5 -4 -11 29 -6 689 252 2126 498 545 46 49 275 42 34 -33 Since both F0ave and σF0 increase for laughter, a parameter γ1 that reflects changes in the F0ZF F is γ1 = F0ave × σF0 (6.7) 1000 Here F0ave and σF0 are computed over each laughter (LS or NSL) call or NS voiced segment. In Table 6.2, average γ1 for each speaker are given in columns (g), (h) and (i) for the three cases NS, LS γ −γ and NSL, respectively. Changes in average γ1 for LS and NSL from NS, i.e., ∆γ1LS (%) = 1LSγ1 1N S × γ NS −γ 1N S 100 and ∆γ1N SL (%) = 1N SL × 100, are given in columns (j) and (k), respectively. The values γ 1N S are rounded to integers. The average γ1 values for NS, LS and NSL are 10, 16 and 30, respectively. Across speakers, the ∆γ1 are higher for NSL than LS (columns (j), (k)). The changes in γ1 and ∆γ1 for the three cases indicate larger degree of changes in F0ZF F for nonspeech-laugh than laughed-speech. [B.] Measuring temporal changes in F0 Temporal gradient-related changes in F0 (i.e., F0ZF F ) contour are captured through a parameter θ, computed for each laughter (LS and NSL) call and (NS) voiced segment. Temporal parameter θ has two constituents, a monotonicity factor and a duration factor. (i) The monotonicity factor (mF0 ) captures the monotonically increasing (or decreasing for some speakers) trend of F0 within a call. It is the sum of ∆F0 of similar signs, computed over each window of size of 5 successive pitch periods. Here, ∆F0 is the change in F0 over successive epochs in a laugh call or NS voiced segment. The factor mF0 is: mF0 = n X 5 X ∆F0 |+(or −) , i = 1, 2, ..., n, j = 1, 2, ...5 (6.8) i=1 j=1 where n is number of such windows in a call and j is index of ∆F0 values of same sign within each window. (ii) The duration factor (δtF0 ) captures the percentage duration of a call that has similar signs 120 of ∆F0 in each window. It is given by δtF0 = NGC+(or −) NGCseg × 1 tdseg (6.9) where NGC+(or −) is number of glottal cycles (epochs) having ∆F0 of same sign (+ or −, whichever has larger count), and NGCseg is total number of epochs in a laugh call. The tdseg is call duration (ms), used in denominator for normalization. The temporal parameter θ = |mF0 × δtF0 | for a call is given by X 5 n X NGC+(or −) 1 θ= (6.10) × ∆F0 |+(or −) × NGCseg tdseg i=1 j=1 where i is index of the window within a call and j is index of ∆F0 of same signs in each window. In Table 6.2, average θ values for each speaker are given in columns (l), (m) and (n) for NS, LS and NSL, respectively. Changes in average θ for LS and NSL bouts from NS, i.e., ∆θLS (%) = θ −θN S θLS −θN S × 100 and ∆θN SL (%) = N SL × 100, are given in columns (o) and (p), respectively. θN S θN S Average θ for NS, LS and NSL are 0.43, 0.67 and 1.73, respectively. The higher value of θ (e.g., above 1.0) indicates larger content of laughter. Larger changes in temporal parameter θ are observed for nonspeech-laugh, than for laughed-speech. For laughter, the temporal changes in F0ZF F observed using the parameter θ (in columns (l)-(p)), are in line with degree of changes in F0ZF F observed using the parameter γ1 (in columns (g)-(k)). 6.5.2 Analysis using the density of excitation impulses (dI ) It may be observed in Fig. 6.8(c) that the modified ZFF signal (zx [n]) has some extra zero-crossings within almost each glottal cycle, in addition to those corresponding to the epochs. Whether these additional zero-crossings are related to the characteristics of excitation source or vocal tract system, can be ascertained by observing the spectrograms in Fig. 6.9, for a few nonspeech-laugh calls. The spectrogram of a epoch sequence that has impulses located at epochs with amplitude as SoE, in Fig. 6.9(c), shows only the source characteristics. It can be compared with the spectrogram shown in Fig. 6.9(d), for a sequence of impulses located at all negative to positive going zero-crossings of zx [n] with amplitude representing the strength of excitation at these instants. Both spectrograms in Fig. 6.9(c) and Fig. 6.9(d) for are quite similar. It may also be noticed that the spectrogram in Fig. 6.9(d) does not show any formants-like system characteristics that can be observed in the spectrogram in Fig. 6.9(b) for the acoustic signal. Similar observations are made from spectrograms of laughter calls of other speakers. Hence, it may be inferred that the additional zero-crossings in zx [n] are related to the characteristics of the excitation source only, and not of the vocal tract system. In order to highlight the glottal cycle characteristics (at epochs) shown in Fig. 6.6(c), the additional zero-crossings can be suppressed by using the Hilbert envelope (hz [n]) of the modZFF output, as discussed in Section 6.4. But, these additional zero-crossings can also help in discriminating the three cases (NS, LS and NSL). A feature ‘density of excitation impulses’ (dI ) is obtained from all successive 121 xin[n] (a) Input acoustic signal 1 0 −1 Frequency (Hz) Frequency (Hz) 4000 Frequency (Hz) (b) Spectrogram of signal 4000 4000 3000 2000 1000 (c) Spectrogram of epoch sequence 3000 2000 1000 (d) Spectrogram of impulse sequence 3000 2000 1000 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Time (sec) 0.45 0.5 0.55 0.6 Figure 6.9 Illustration of (a) acoustic signal (xin [n]), and spectrograms of (b) signal, (c) epoch sequence (using the modified ZFF method) and (d) sequence of impulses at all (negative to positive going) zerocrossings of zx [n] signal, for few nonspeech-laugh calls produced by a male speaker. negative to positive going zero-crossings of the modZFF signal (zx [n]). Feature dI represents instantaneous density of the impulse-like excitation at such zero-crossings in the unit ‘number of impulses per sec’. Like for F0 , the changes in dI are also analysed in two ways, by measuring the degree of changes and temporal (intra-call) changes in the dI contour. [A.] Measuring the degree of changes in feature dI In Table 6.3, the average dI values (dIave ) for each speaker are given in columns (a), (b) and (c) for the three cases NS, LS and NSL, respectively. Corresponding standard deviations in dI (σdI ) are given in columns (d), (e) and (f). The values are rounded to integers. It may be observed from columns (a)-(c) and (d)-(f) that average values of dI and its spread (σdI ) are in general higher for NSL than for LS. Since, both dIave and σdI increase for laughter, a parameter γ2 that reflects changes in dI is γ2 = dIave × σdI (6.11) 1000 where dIave and σdI are computed over each LS/NSL call or a NS voiced region. For each speaker, the average γ2 values for the three cases NS, LS and NSL are given in columns (g), (h) and (i), respectively. γ −γ Changes in γ2 for LS and NSL from NS, i.e., ∆γ2LS (%) = 2LSγ2 2N S × 100 and ∆γ2N SL (%) = NS γ2N SL −γ2N S γ 2N S × 100, are given in columns (j) and (k), respectively. The values are rounded to integers. The average γ2 values for NS, LS and NSL, are 54, 48 and 69, respectively. For most of the speakers, |∆γ2 |N SL > |∆γ2 |LS , i.e., the degree of changes in γ2 is more for nonspeech-laugh than for laughedspeech, with reference to normal speech. 122 Table 6.3 Changes in dI and temporal measure for dI (i.e., φ) for laughed-speech and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average dI (Imps/sec), (d)-(f) are σdI (Imps/sec), (g)-(i) are average γ2 and (l)-(n) are average φ values for NS, LS and NSL. Columns (j), (k) are ∆γ2 (%) and (o), (p) are ∆φ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. (p) (j) (o) (k) (l) (m) (n) Speaker (a) (b) (c) (d) (e) (f) (g) (h) (i) (M/F) dIN S dILS dIN SL σdIN SσdILSσdIN SL γ2N Sγ2LSγ2N SL∆γ2LS ∆γ2N SLφN S φLS φN SL ∆φLS ∆φN SL S1 (M) S2 (M) S3 (F) S4 (M) S5 (M) S6 (F) S7 (M) S8 (M) S9 (F) S10(F) S11(M) 436 461 424 378 423 486 428 425 458 427 428 426 486 495 382 445 482 455 453 469 408 402 581 520 540 532 486 557 557 489 498 467 376 106 132 142 78 100 162 129 118 136 125 136 99 134 90 83 74 137 143 89 136 84 93 137 165 70 156 125 162 199 98 180 59 134 Average 46 61 60 29 42 79 55 49 62 53 58 54 42 65 45 32 33 66 65 41 64 34 38 48 79 86 38 83 61 90 111 48 90 28 50 69 -9 7 -26 7 -22 -16 17 -19 2 -36 -36 71 40 -37 181 44 15 100 -4 44 -48 -14 89 109 122 46 78 126 95 89 100 91 109 96 75 114 57 47 60 102 103 60 114 44 77 78 147 193 84 178 152 146 220 119 146 31 101 138 -16 4 -54 3 -22 -19 9 -32 14 -52 -29 66 77 -31 288 96 16 133 33 46 -66 -8 [B.] Measuring temporal changes in feature dI Changes in the dI contours are observed to be more rapid than those in F0ZF F contours. Hence, temporal changes in the dI contour are captured by a parameter φ, that is computed using ∆dI between all successive negative to positive going zero-crossings of the modZFF signal (zx [n]). The temporal measure for dI , i.e., parameter φ is given by N 1 X ∆ ∆dI φ= ∆t ∆t , N i = 1, 2, ..., N (6.12) i=1 where N is number of (negative to positive going) zero-crossings of zx [n]. Hence, parameter φ captures the rate of temporal change in dI for a call, computed per second. An illustration of the temporal measure φ is given in Fig. 6.10. Larger changes in ∆dI can be observed for NSL than LS calls. Also, parameter φ is helpful in discriminating between regions of nonspeech-laugh and laughed-speech. In Table 6.3, average φ values for each speaker are given in columns (l), (m) and (n), for NS, LS and NSL, respectively. Changes in φ for LS and NSL in comparison with NS, i.e., ∆φLS (%) = φLS −φN S φ −φN S × 100 and ∆φN SL (%) = N SL × 100, are given in columns (o) and (p), respectively. φN S φN S Average φ values for NS, LS and NSL, are 96, 78 and 138, respectively. In general, higher changes in average φ for nonspeech-laugh can be observed in column (p). Interestingly, the changes in the temporal measure φ (i.e., ∆φ) in columns (o) and (p), are mostly in line with changes in the parameter γ2 (i.e., ∆γ2 ) that captures degree of changes in dI , in columns (j), (k). Both parameters γ2 and φ, derived from the feature dI (impulse density), are helpful in discriminating between laughter and normal speech. 123 (a) Input acoustic signal xin[n] 1 0 −1 (b) Changes in temporal parameter for dI (φ) 600 1 2 3 0.1 0.3 0.5 5 4 6 8 7 |φ| 400 200 0 0.7 0.9 1.1 1.3 Time (sec) 1.5 1.7 1.9 2.1 Figure 6.10 Illustration of changes in the temporal measure for dI , i.e., φ, for NSL and LS calls. (a) Acoustic signal (xin [n]). (b) φ for NSL and LS calls, i.e., regions 1-4 and 5-8, respectively. The signal segment is for the text “it is really funny” produced by a male speaker. 6.5.3 Analysis using the strength of excitation (SoE) In the production of laughter, changes in the characteristics of the glottal excitation source are reflected in two ways: (i) in the locations of epochs (GCIs) or all positive going zero-crossings of the modZFF signal (zx [n]), that is manifested as changes in the features F0ZF F and dI , respectively, and (ii) in the amplitude represented as the strength of impulse-like excitation (SoE, i.e., ψ) at the epochs. Similar to F0ZF F and dI , changes in the SoE for laughter (LS/NSL) calls, with reference to normal speech (NS), are also analysed in two ways, by measuring the degree of changes and temporal (intracall) changes in the SoE contour. [A.] Measuring the degree of changes in SoE (ψ) In Table 6.4, the average SoE values (ψave ) for each speaker are given in columns (a), (b) and (c) for NS, LS and NSL, respectively. Corresponding standard deviations in ψ (i.e., σψ ) are given in columns d), (e) and (f). Changes in both ψave and σψ are captured by a parameter γ3 , given by γ3 = σψ × 100 ψave (6.13) where σψ and ψave are computed for each LS/NSL call and NS voiced region. The average γ3 computed for each speaker are given in Table 6.4 in columns (g), (h) and (i), for the three cases NS, LS γ −γ and NSL, respectively. Changes in γ3 for LS and NSL from NS, i.e., ∆γ3LS (%) = 3LSγ3 3N S × 100 γ NS −γ 3N S and ∆γ3N SL (%) = 3N SL × 100, are given in columns (j) and (k), respectively. The values are γ 3N S rounded to integers. The average γ3 values for NS, LS and NSL are 57, 60 and 65, respectively. For most speakers, the changes in parameter γ3 , with reference to normal speech, are larger for nonspeech-laugh than for laughed-speech. 124 Table 6.4 Changes in SoE (i.e., ψ) and temporal measure for SoE (i.e., ρ) for laughed-speech and nonspeech-laugh, with reference to normal speech. Columns (a)-(c) are average ψ, (d)-(f) are σψ , (g)-(i) are average γ3 and (l)-(n) are average ρ values for NS, LS and NSL. Columns (j), (k) are ∆γ3 (%) and (o), (p) are ∆ρ (%) for LS and NSL cases. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. (j) (p) (k) (l) (m) (n) (o) Speaker (a) (b) (c) (f) (d) (e) (g) (h) (i) (M/F) ψN S ψLS ψN SL σψN S σψLS σψN SL γ3N Sγ3LS γ3N SL∆γ3LS ∆γ3N SLρN S ρLS ρN SL ∆ρLS ∆ρN SL S1 (M) S2 (M) S3 (F) S4 (M) S5 (M) S6 (F) S7 (M) S8 (M) S9 (F) S10(F) S11(M) .262 .289 .282 .289 .321 .303 .327 .363 .299 .278 .273 .238 .318 .278 .233 .278 .207 .449 .294 .283 .132 .251 .222 .260 .178 .320 .341 .316 .204 .310 .332 .162 .192 .126 .181 .172 .152 .158 .195 .184 .223 .176 .164 .163 .137 .199 .140 .120 .131 .149 .258 .167 .162 .079 .149 .161 .128 .084 .175 .176 .256 .152 .184 .210 .099 .126 Average 48 60 60 49 48 65 54 62 59 59 60 57 58 63 51 49 45 95 61 57 58 62 57 60 71 52 63 55 55 81 80 58 66 64 69 65 21 4 -15 0 -6 48 13 -7 -3 5 -5 47 -14 5 13 16 25 47 -5 11 7 14 16 36 39 41 37 39 36 59 41 54 38 39 35 35 46 25 27 30 37 29 50 19 53 35 60 27 75 43 73 119 26 10 124 17 27 54 116 -3 15 -39 -28 -21 5 -51 20 -65 39 265 -25 89 5 94 208 -28 -84 199 -68 -29 [B.] Measuring temporal changes in SoE (ψ) Temporal changes in SoE (ψ) contour are captured through a parameter ρ, similar to parameter θ for F0ZF F . Like for θ, the parameter ρ also comprises of two factors, a monotonicity factor (mψ ) and a duration factor (δtψ ). (i) The monotonicity factor (mψ ) is computed in a similar way as for mF0 using (6.8). The only difference is that here only the (+)ve signed ∆ψ values are considered in each window. It is because the ψ contour is expected to have more regions of monotonically increasing SoE within a laugh call. (ii) The duration factor (δtψ ) is computed in the same way as δtF0 using (6.9). The temporal measure for ψ, i.e., parameter ρ = mψ × δtψ , is computed as: mψ = 5 n X X ∆ψ|+ , i = 1, 2, ..., n, j = 1, 2, ...5 (6.14) i=1 j=1 NGC+ 1 × NGC tdseg seg X 5 n X N 1 GC + ρ = × ∆ψ|+ × NGCseg tdseg i=1 j=1 δtψ = (6.15) (6.16) where i is index of window (each of size of 5 successive pitch periods) in a call, n is number of such windows in a laugh call, and j is index of ∆ψ values of same (+ve) sign within each window. Here, NGC+ is number of glottal cycles (epochs) having ∆ψ of same (+)ve sign, and NGCseg is total number of epochs within a laugh call or NS voiced segment of duration tdseg (in ms). In Table 6.4, the average values of parameter ρ for each speaker are given in columns (l), (m) and (n) for NS, LS and NSL, respectively. Changes in average ρ for LS and NSL from NS, i.e., ∆ρLS (%) = 125 Table 6.5 Changes in FD1ave and σFD1 for laughed-speech (LS) and non-speech laugh (NSL) in comparison to those for normal speech (NS). Columns (a),(b),(c) are FD1ave (Hz) and (d),(e),(f) are σFD1 (Hz) for the three cases NS, LS and NSL. Columns (g),(h),(i) are the average ν1 values computed for NS, LS and NSL, respectively. Columns (j) and (k) are ∆ν1 (%) for LS and NSL, respectively. Note: Si below means speaker number i (i = 1 to 11), and M/F indicates male/female. Speaker# (a)(Hz) (b)(Hz) (c)(Hz) (d)(Hz) (e)(Hz) (f)(Hz) (g) (h) (j)(%) (k)(%) (i) σ σ σ F F F F F F D D D D D D (M/F) ∆ν1N SL 1ave 1ave 1ave 1N S 1LS 1N SL ν1N S ν1LS ν1N SL ∆ν1LS N SL LS NS S1 (M) 1604 1637 1296 413 378 355 66.3 62.0 46.1 -6.49 -30.53 S2 (M) 1689 1700 1738 309 242 228 52.2 41.1 39.6 -21.31 -24.22 S3 (F) 1324 1555 1659 675 449 442 89.4 69.8 73.4 -21.97 -17.94 S4 (M) 1503 1505 1346 458 486 369 68.8 73.2 49.7 6.37 -27.80 S5 (M) 1244 1251 1173 501 457 512 62.3 57.2 60.1 -8.24 -3.64 S6 (F) 1088 1175 1242 671 735 275 73.0 86.3 34.1 18.28 -53.23 S7 (M) 1588 1326 1422 383 637 318 60.9 84.5 45.1 38.79 -25.79 S8 (M) 1429 1483 1502 499 374 301 71.3 55.5 45.2 -22.10 -36.57 S9 (F) 1279 1379 1537 538 781 623 68.9 107.7 95.8 56.39 39.06 S10(F) 1265 1314 939 747 508 267 94.5 66.8 25.1 -29.36 -73.49 S11(M) 1759 1978 1869 406 336 488 71.3 66.5 91.2 -6.72 27.93 Average 70.8 70.1 55.0 ρLS −ρN S ρN S ρ −ρ NS × 100 and ∆ρN SL (%) = N SL × 100, are given in columns (o) and (p), respectively. The ρN S values are rounded to integers. The average values for NS, LS and NSL are 39, 35 and 54, respectively. In general, the degree of changes in parameter ρ (i.e., ∆ρ) are larger for nonspeech-laugh than for laughed-speech, with reference to normal speech. Also, the temporal changes in SoE (ψ) measured using the parameter ρ, are mostly in line with degree of changes in SoE captured using the parameter γ3 . 6.5.4 Analysis of vocal tract system characteristics of laughter In the production of laughter, since there occur rapid changes in the excitation source characteristics, it is possible that there also occur associated changes in the vocal tract system characteristics as well. Hence, system characteristics are also examined. Features such as first two dominant frequencies (FD1 and FD2 ) are derived from the speech signal using LP analysis [112] and group delay method [128, 129], discussed in Section 3.7. In Table 6.5, the average FD1 values (FD1ave ) are given in columns (a), (b) and (c), for the three cases NS, LS and NSL, respectively. The corresponding average values of standard deviation in FD1 (σFD1 ) for all speakers are given in columns (d), (e) and f). All the values are rounded to nearest integer. In general, the values of average FD1 (FD1ave ) and its spread (σFD1 ) are observed to be lower for laughter (LS/NSL) in comparison to those for normal speech, as can be observed from columns (a)-(c) and (d)(f). Hence, a single parameter ν1 representing both features FD1ave and σFD1 is computed as ν1 = FD1ave × σFD1 10000 126 (6.17) Distribution of F D 2 vs F D 1 5000 4000 F D 2 (Hz) 3000 2000 1000 0 0 500 1000 1500 F (Hz) 2000 2500 D 1 Figure 6.11 (Color online) Illustration of distribution of FD2 vs FD1 for nonspeech-laugh (“•”) and laughed-speech (“◦”) bouts of a male speaker. The points are taken at GCIs in respective calls. where FD1ave and σFD1 are computed over an voiced region of NS or a LS/NSL call. The values of ν1 for NS, LS and NSL are given in columns (g), (h) and (i), respectively. Average ν1 values across speakers are 70.8, 70.1 and 55.0 for NS, LS and NSL, respectively. Percentage changes in ν1 for LS/NSL bouts ν −ν1N S ν −ν × 100 are in comparison to that for NS, i.e., ∆ν1LS = 1LSν1 1N S × 100 and ∆ν1N SL = 1N SL ν 1 NS NS given in columns (j) and (k), respectively. Similarly, the average values of FD2 (FD2ave ) and standard deviation in FD2 (σFD2 ) are obtained for the three cases. The single parameter ν2 representing both features FD2ave and σFD2 , and percentage changes in ν2 , i.e., ∆ν2LS and ∆ν2N SL are computed in a way similar to that for FD1 . The average values of FD1 and FD2 (FD1ave and FD2ave ) are computed for each speaker, for the three cases (NS, LS and NSL). The corresponding values of standard deviation in FD1 (σFD1 ) and in FD2 (σFD2 ) are also computed. An illustration of distribution of FD2 vs FD1 for LS and NSL bouts produced by a male speaker is given in Fig. 6.11. The points (FD1ave , FD2ave ) are marked as centroids for LS and NSL bouts. It may be observed that distinct clusters are formed by the relative distribution of FD1 and FD2 for nonspeech laugh and laughed speech, discriminating the two. Also, for some speakers, the average FD1 and FD2 values (FD1ave and FD2ave ) and their respective spread (σFD1 and σFD2 ) are observed to be lower for laughter (NSL and LS) than for normal speech. However, the observation is not consistent across all speakers. The average values of parameter ν1 representing changes in both FD1ave and σFD1 for NS, LS and NSL, are 70.8, 70.1 and 55.0, respectively. The average values of the parameter ν2 representing changes in FD2ave and σFD2 , for NS, LS and NSL, are 195.4, 169.4 and 173.8, respectively. Although, for some speakers, the reduction in parameters ν1 and ν2 is observed to be larger for nonspeech laugh than for the laughed speech. But, the changes in ν1 and ν2 are not consistent across all speakers. 127 xin[n] hp (a) Normal speech: (i) input acoustic signal 1 0 −1 1 0.8 0.6 0.4 0.2 (ii) Peaks of Hilbert envelope of LP residual 0 5 10 Time (ms) xin[n] 1 0.8 0.6 0.4 0.2 1 0.8 0.6 0.4 0.2 (ii) Peaks of Hilbert envelope of LP residual 0 5 10 Time (ms) 15 20 (c) Nonspeech laugh: (i) input acoustic signal 1 0 −1 hp xin[n] hp (b) Laughed speech: (i) input acoustic signal 1 0 −1 15 (ii) Peaks of Hilbert envelope of LP residual 20 0 5 10 Time (ms) 15 20 Figure 6.12 Illustration of (a) input acoustic signal (xin [n]) and few (b) peaks of Hilbert envelope of LP residual (hp ) for 3 cases: (i) normal speech, (ii) laughed-speech and (iii) nonspeech-laugh. 6.5.5 Analysis of other production characteristics of laughter Apart from the glottal excitation source and vocal tract system characteristics examined earlier in this section, the acoustic signal consisting of laughter seems to carry some additional information that humans can perceive easily. This information may possibly be extracted from LP residual [114] of the acoustic signal. This additional information of the production characteristics of laughter is derived from the Hilbert envelope [137] of LP residual of the acoustic signal. Two features are extracted, the amplitude (hp ) and sharpness measure (η) of peaks in the Hilbert envelope (HE) of LP residual at GCIs [170]. The sharpness measure (η) is observed to be useful in discriminating the NS, LS and NSL cases. An illustration of peaks in the Hilbert envelope of LP residual at GCIs is given in Fig. 6.12(a), Fig. 6.12(b) and Fig. 6.12(c), for the three cases NS, LS and NSL, respectively. Normalized values of HE peaks (hp ) are used. The peaks are narrower and sharper for nonspeech-laugh in comparison with laughed-speech calls. Also, the width of peaks (near half-height level) is relatively more for NS than for LS/NSL calls. The degree of sharpness of these peaks can be compared in terms of the sharpness measure η [170] N 1 X σhn (xw ) , i = 1, 2, ..., N (6.18) η= N µhn (xw ) xw =xi −l1 to xi +l2 i=1 where i is index of epoch within a laugh (NSL/LS) call or NS voiced region, and N is total number of epochs in the segment. Here, σhn and µhn are standard deviation and mean of the normalized values of 128 Table 6.6 Changes in average η and ση for laughed speech (LS) and nonspeech laugh (NSL) with reference to normal speech (NS). Columns (a)-(c) are average η, (d)-(f) are ση and (g)-(i) are average ξ values for NS, LS and NSL. Columns (j) and (k) are ∆ξ (%) for LS and NSL, respectively. Note: Si indicates speaker number (i = 1 to 11) and M/F male/female. (g) (h) (j) (k) (i) Speaker (a) (b) (c) (d) (e) (f) (M/F) ηN S ηLS ηN SL σηN S σηLS σηN SL ξN S ξLS ξN SL ∆ξLS ∆ξN SL S1 (M) S2 (M) S3 (F) S4 (M) S5 (M) S6 (F) S7 (M) S8 (M) S9 (F) S10(F) S11(M) .588 .574 .514 .575 .507 .530 .487 .579 .538 .533 .522 .559 .547 .511 .572 .505 .527 .518 .592 .523 .577 .569 .509 .516 .537 .499 .505 .546 .492 .513 .521 .539 .492 .137 .194 .124 .150 .139 .141 .117 .176 .153 .124 .139 .142 .169 .135 .154 .127 .128 .154 .155 .131 .147 .164 .160 .158 .139 .137 .110 .129 .129 .131 .117 .099 .135 Average 239 333 242 262 279 265 241 302 283 233 266 268 260 306 260 271 252 244 298 262 249 255 234 263 308 308 266 276 219 237 264 256 226 183 204 250 9 -8 7 3 -9 -8 24 -13 -12 9 -12 29 -7 10 5 -21 -10 10 -15 -20 -22 -23 Hilbert envelope (hn ), computed over a window (xw ) of size xi −l1 to xi +l2 located at xi for ith epoch. Normalized (hn ) values are obtained by dividing the hp values in the window at xi , by the amplitude at xi . A lower value of η indicates a comparatively sharper (i.e., less spread) peak. In Table 6.6, for each speaker, the average η (ηave ) values are given in columns (a), (b) and (c), for NS, LS and NSL cases, respectively, with corresponding standard deviation (ση ) in columns (d)-(f). Since ση reduces and ηave increases more for NSL than LS calls, a parameter ξ is computed as ξ= ση × 1000 ηave (6.19) where ηave and ση are average and standard deviation of η, computed for a call. The average values of parameter ξ for each speaker are given in columns (g), (h) and (i), for NS, LS and NSL, respectively. ξ −ξN S ξ −ξ × 100, Changes in ξ for LS and NSL cases, i.e., ∆ξLS = LSξ N S × 100 and ∆ξN SL = N SL ξN S NS are given (in %) in columns (j) and (k), respectively. The values are rounded to integers. The average ξ for NS, LS and NSL, are 268, 263 and 250, respectively. For most speakers, larger changes (mostly reduction) in ξ occur for nonspeech-laugh than for laughed-speech, with reference to normal speech. It indicates that peaks of Hilbert envelope of LP residual (at GCIs) are generally more sharp (less spread) for laughter (NSL/LS) calls than for normal speech. It is possible that this increased sharpness of HE peaks is related to faster rate of closing (abrupt closure) of the vocal folds, during production of laughter. 6.6 Discussion on the results Production characteristics of laughter are examined in this study using the EGG and acoustic signals. Following features are extracted: (i) source features α, F0 (i.e., F0ZF F ), dI and SoE (i.e., ψ), (ii) system 129 features FD1 and FD2 , and (iii) other production features hp and η. Parameters are derived from these features, that distinguish laughter (LS/NSL) calls and NS voiced regions. Parameters (βα , γ1 , γ2 , γ3 , ν1 , ν2 and ξ) capturing the degree of changes use average values and standard deviations in all these feature, whereas temporal parameters (θ, φ and ρ) capture the intra-call temporal changes in the source features (F0 , dI and SoE). All the parameters derived in this study can be summarized as follows: (i) (ii) (iii) (iv) (v) (vi) parameter βα , derived from the closed phase quotient (α) using EGG signal parameters γ1 and θ, derived from the F0 (i.e., F0ZF F ) using acoustic signal parameters γ2 and φ, derived from the source feature dI (impulse-density) parameters γ3 and ρ, derived from the source feature SoE (i.e., ψ) parameters ν1 and ν2 , derived from the system features FD1 and FD2 parameters ξ, derived from other production feature η Analysis from EGG signal indicated that the closed phase quotient (α) within each glottal cycle is reduced for laughter, in comparison to normal speech (Table 6.1). The feature α is reduced more for nonspeech-laugh than laughed-speech. Changes in the closed phase quotient (α) are reflected better in a parameter βα . Across all speakers, the increase in βα from normal speech (i.e., ∆βα ) is more for nonspeech-laugh (Table 6.1). The reduction of closed phase quotient (α) for laughter (NSL/LS) calls is possibly related to abrupt closure of vocal folds, which is perhaps reflected in sharper HE peaks, examined using features hp and η in Section 6.5.5. Also, the glottal cycle period (T0 ) is reduced more for nonspeech-laugh than laughed-speech, which causes the average F0EGG to increase more for nonspeechlaugh (Table 6.1). Thus, analysis from EGG signal highlights that significant changes in the characteristics of the glottal source of excitation indeed take place during production of laughter. Due to limitations in collecting the EGG signal, the production characteristics are analysed from the acoustic signal. The excitation source characteristics are analysed in terms of features F0 , dI and SoE derived from the acoustic signal using a modified ZFF method proposed in Section 6.4. The trends of changes in average F0EGG and F0ZF F for LS/NSL laugh relative to NS are quite similar in Tables 6.1 and 6.2. In a few cases the average values of F0ZF F are marginally higher than F0EGG . It may be due to possible rare presence of the secondary excitation pulses that are otherwise suppressed well in the Hilbert envelope of the modified ZFF output signal used for computing the F0ZF F . Theoretically, F0EGG should be more reliable than F0ZF F , and is used as ground truth reference. But, for relative convenience in collecting the acoustic signal data over EGG, and the reason that other features are also derived from the same signal, the F0ZF F is used as F0 in this study. Changes in F0 (i.e., F0ZF F ), that are captured better by using a parameter γ1 , are larger for nonspeech-laugh than laughed-speech, with reference to normal speech (Table 6.2). Interestingly, there occurs gradual inter-calls increasing/decreasing trend in the average α over successive calls in a (LS/NSL) laugh bout (Fig. 6.4). The inter-calls rising/falling trend is observed also in the average F0 values for calls in a bout, which is in-line with an earlier study [11]. In the production of laughter, there possibly exists amplitude modulation of some higher frequency content (around 500-1000Hz) [159], as can be observed in the acoustic signal shown in Fig. 6.8(a). This higher frequency content is not noticeable in the EGG signal (Fig. 6.8(b)). It is possibly related 130 to the presence of secondary excitation pulses in each pitch period [67, 8]. The instants of this secondary impulse-like excitation in the case of laughter, seem to appear as negative to positive going zero-crossings that are additional to regular GCIs (epochs). These instants can be captured better by using some special signal processing technique such as the modified ZFF method (Fig. 6.8(c)). These additional zero-crossing instants can be exploited for discriminating laughter and normal speech, by using a feature dI that represents the density of excitation impulses located at all positive going zerocrossings of the modZFF signal (zx [n]). Changes in the feature dI are examined for the three cases NS, LS and NSL. In general, the average dI (dIave ), intra-call fluctuations in dI (σdI ) and a parameter γ2 representing changes in dI , are observed to be higher for nonspeech-laugh than for laughed-speech (Table 6.3). Changes in the source characteristics are also examined in terms of the strength of excitation (SoE), i.e., feature ψ. The parameter γ3 , representing changes in SoE, shows larger changes for nonspeech-laugh than for laughed-speech, for most speakers (Table 6.4). Temporal changes in the source characteristics are examined to validate the results. Temporal parameters θ, φ and ρ capture temporal changes in the features F0 , dI and SoE (ψ), respectively. Parameter θ captures changes in the intra-call rising (or falling in some cases) gradient of F0 contour, parameter φ the absolute rate of change in the density of excitation impulses (dI ), and parameter ρ captures the (mostly rising) gradient of SoE, within each laugh call. Larger changes in the parameters θ, φ and ρ can be observed for nonspeech-laugh than laughed-speech, with reference to normal speech, in Tables 6.2, 6.3 and 6.4, respectively. Similarity of results in discriminating laughter and normal speech by two different approaches, i.e., by using both the parameters γ1 , γ2 and γ3 and the temporal parameters θ, φ and ρ, validates the utility of these parameters as well as the results. Associated changes in the vocal tract system characteristics during production of laughter are examined in terms of features such as dominant frequencies FD1 and FD2 , derived from the acoustic signal using LP analysis and group delay function. The distribution of FD2 vs FD1 (Fig. 6.11) highlights the ability of FD1 and FD2 in discriminating between NSL and LS bouts, in some cases, which is similar to formants-clusters used in [11]. Parameters ν1 and ν2 derived using averages and fluctuations in these features, show larger changes for nonspeech-laugh than for laughed-speech, for some speakers. But, the observations are not consistent for all speakers. Hence, there is scope for better features of the vocal tract system, to help distinguishing laughter from normal speech. The additional information present in the acoustic signal consisting of laughter, is examined using the Hilbert envelope (HE) of LP residual of the signal. Two features hp and η are extracted, that measure the amplitude and sharpness of HE peaks, respectively. Parameters ξ captures changes in the feature η. Larger changes in the parameter ξ observed for nonspeech-laugh than for laughed-speech (Table 6.6), are mostly in line with changes in the parameters (βα , γ1 , γ2 and γ3 ) for the source characteristics and the parameters (ν1 and ν2 ) for the system characteristics. From all the parameters derived it is observed that, in general, larger changes take place in the production characteristics for nonspeech-laugh than for laughed-speech, with reference to normal speech. 131 6.7 Summary In this study, the production characteristics of laughter are examined from both EGG and acoustic signals. The speech-laugh continuum is analysed in three categories, namely, normal speech, laughedspeech and nonspeech-laugh. Data was collected by eliciting natural laughter responses. Three texts were used for comparing the laughed-speech and normal speech. Laughter data is examined at call and bout levels. Only, voiced cases of spontaneous laughter are considered. The excitation source features are extracted from both the EGG and acoustic signals. The vocal tract system features are extracted from the acoustic signal, along with some production characteristics related to both source and system. Parameters representing changes in these features are derived, to distinguish the three cases. The closed phase quotient (α) of glottal cycles and the instantaneous fundamental frequency (F0 ) are obtained from the analysis of EGG signal. Average α reduces and F0 increases more for nonspeechlaugh than laughed-speech, with reference to normal speech. The average values of α and F0 also exhibit some inter-calls decreasing/increasing trend for (LS/NSL) laughter bouts. The excitation source characteristics are derived from the acoustic signal using a modZFF method proposed. In the acoustic signal of laughter, the likely presence of secondary impulse-like excitation is examined in terms of density of impulses (dI ) and the strength of excitation (SoE), derived from the modZFF signal (zmx [n]). Parameters βα , γ2 and γ3 represent the degree of changes in the source features α, dI and SoE, respectively. Results are validated using two temporal parameters φ and ρ derived from the source features dI and SoE, respectively. These parameters also discriminate well the three cases (NS, LS and NSL). Changes in the vocal tract system characteristics are examined in terms of the first two dominant frequencies FD1 and FD2 derived from the acoustic signal using LP analysis. The features discriminate between laughter and normal speech, in some cases. Additional excitation information present in the acoustic signal of laughter is examined using the Hilbert envelope of the LP residual around epochs, in terms of sharpness (η) of HE peaks. Parameter ξ derived from the feature η helps in further discriminating the three cases. In this study, the unvoiced grunt-like or snort-like laughs are not focused. Also, changes in the vowel-like nature of different laughter-types may be studied further. It would be further interesting to examine expressive voices, that are trained voices and involve voluntary control over the speech production mechanism. However, the production source characteristics of laughter examined in this study and the parameters derived, would be useful in further developing systems for automatic detection of laughter in continuous speech. 132 Chapter 7 Analysis of Noh Voices 7.1 Overview Production characteristics are changed under voluntary control of speech production mechanism in the case of nonverbal emotional speech sounds, and involuntarily in the case of paralinguistic sounds. But, in the case of expressive voices such as opera or Noh singing, the trained voice is produced through the voluntary control exercised by the artist, which is achieved after years of training and practice. This chapter analyzes the significance of aperiodic component of excitation in contributing to the voice quality of expressive voices, in particular, the Noh Voice. The study highlights the feasibility of representing the excitation source characteristics in expressive voice signals, through a time domain sequence of excitation impulses which are related to pitch-perception of aperiodicity. The aperiodic component is represented in terms of a sequence of impulses with amplitudes representing the relative strengths of the impulses. The frequency characteristics of the impulse sequence explain the perception of pitch and its subharmonics. The availability of the aperiodic component in the form of sequence of impulses with relative amplitudes helps in studying the significance of excitation in contributing to the quality of expressive voices, both by analysis and by synthesis. The role of amplitude/frequency modulation (AM/FM) in the excitation component of expressive voice signal is examined using synthetic AM/FM sequences. A signal processing method is proposed for deriving the impulse sequence representation of the excitation source information in expressive voices. Validation of results is carried out using spectrograms, a pitch perception measure and signal synthesis. A method is also proposed for F0 extraction from expressive voices in the regions of harmonics/subharmonics and aperiodicity. The chapter is organised as follows. Section 7.2 discusses issues in representing the excitation source component in expressive voices, in terms of a time domain sequence of impulses having amplitudes corresponding to the strength of excitation around those impulse locations. Analysis approach adopted in this study is discussed in Section 7.3. Section 7.4 discusses the proposed method of analysing of the aperiodic components in terms of saliency of the pitch-perception. The effect of rapid changes in pitchperiods in expressive voices on saliency, and the effect of different window lengths for trend removal operation in ZFF method are analyzed for synthetic AM and FM sequences. Section 7.5 discusses the 133 excitation impulse sequence representation for expressive voices, using a modified zero-frequency filtering (modZFF) method proposed, to minimize the effect of window length for trend removal operation in ZFF method. In Section 7.6, the characteristics of aperiodic excitation for the segments of Noh voice studied by the XSX method [55] are examined using the proposed modZFF method. Perception of subharmonic characteristics and rapid variations in the source characteristics are examined using spectrograms of the source represented in terms of an aperiodic sequence of excitation impulses. The derived SoE impulse sequence represents the source characteristics only, is ascertained by decomposition of the speech signal into source and system characteristics. Results are validated by comparing the saliency plots with results of the XSX based method [55], and with ground truth obtained from the LP residual. Section 7.7 discusses the significance of aperiodicity in expressive voices. Synthesis of speech signal and F0 extraction, using the information of pitch perception in the case of expressive voices, are also demonstrated. Section 7.8 gives a summary and research contributions in this chapter. 7.2 Issues in representing excitation source in expressive voices The zero-frequency filtering method (ZFF) is a simple and effective method for deriving the sequence of epochs and relative strengths of impulse-like excitation at epochs [130, 216]. The method involves passing the differenced speech signal through a cascade of two zero-frequency resonators (ZFRs). Each ZFR is an ideal digital resonator with the pair of poles on the unit circle in the z-plane at 0 Hz. The effect of passing the signal through the cascade of ZFRs is equivalent to successive integration operation, as shown in Fig. 7.1(b). The trend in the output is removed by subtracting the local mean computed over a window of length in the range of one to two pitch periods. The resulting signal is called zero-frequency filtered (ZFF) signal. The instants of negative to positive zero crossings in the ZFF signal correspond to the glottal closure instants (GCIs), termed as epochs [130]. The slope of the ZFF signal around the epochs is used to represent the relative strengths of the impulses at epochs [130, 216]. It is termed as relative strength of significant excitation (SoE). The steps involved in deriving the ZFF signal, epochs and strength of excitation [130] from speech signal, for modal voicing are discussed in Section 4.2.2. The results of the ZFF method for three different window lengths on the extracted epochs and their strengths are shown in Fig. 7.1 for a segment of voiced speech, whose average pitch period is about 9 ms. Note that the epoch locations and their strengths remain same for a range of window lengths within about one to two pitch periods. For smaller window lengths more locations for epochs are identified, and some of them may be attributed to either excitation impulses with lower strengths or may be spurious epochs. The choice of the window length is not critical if changes in the pitch period are not rapid, which is the case for modal voicing. In cases where the pitch period changes rapidly, the ZFF method [130, 216] needs to be modified in order to capture the variations in the pitch period, as in the case of laughter [185]. In expressive voices also, there could be significant changes in the intervals of successive pitch periods. An illustration of the effect of shorter window lengths for trend removal is shown in Fig. 7.1. The ZFF output signal along with epochs (marked with downward arrows) is shown in Fig. 7.1(c), (e) and (g), 134 (a) Input voice signal s[n] 1 0 −1 y1[n] 10 10 x 10 (b) ZF Resonator output 5 0 (c) ZFF output with window length =12 ms y2[n] 1 0 −1 SoE (d) SoE impulse sequence with window length =12 ms 1 0.8 (e) ZFF output with window length =8 ms y2[n] 1 0 −1 SoE (f) SoE impulse sequence with window length =8 ms 1 0.5 (g) ZFF output with window length =4 ms y2[n] 1 0 −1 (h) SoE impulse sequence with window length =4 ms SoE 1 0.5 0 20 40 60 80 100 Time (ms) 120 140 160 180 200 Figure 7.1 Results of the ZFF method for different window lengths for trend removal for a segment of voiced speech. Epoch locations are indicated by inverted arrows. for window lengths of 12ms, 8ms and 4ms, respectively. Impulses at epochs with respective strength of excitation (SoE) are shown in Fig. 7.1(d), (f) and (h). It may be observed in Fig. 7.1(h) that shorter window lengths may highlight relatively more information, which may be useful for signals having rapid pitch variations, e.g., expressive voices. But, sometimes it is difficult to interpret epochs derived using small window length (< one pitch period), especially in the case of modal voicing where pitch does not vary rapidly. Some of these epochs could also be spurious, which may not actually correspond to the instants of impulse-like excitation. Hence, there is need to reduce the effect of spurious epochs that can occur when a small window length is used for trend removal operation in the ZFF method. 7.3 Approach adopted in this study In this study, we take a different approach for analyzing the characteristics of the aperiodic components. The approach is based on representing the characteristics of the excitation signal in time domain in terms of sequence of impulses and their relative strengths. The strengths of the impulses in the aperiodic excitation are shown as amplitudes of the nonzero sample values, at the locations of the impulses corresponding to the instants of significant excitation or epochs. The irregular intervals between epochs along with the variable strengths of the impulses are used to characterize the unique excitation char135 acteristics of the expressive voices. The information extracted using the voice signal samples around epochs may reflect the net effect of the vocal tract system. The effects of the nonuniform intervals of the location of impulses, with nonuniform strengths, and the vocal tract system characteristics can be studied to examine which of these components contribute to the unique characteristics of the expressive voices. The illustrations of Noh voice [55] are considered for comparing the characteristics of the aperiodic components studied by the XSX method [55, 79, 81] with the proposed time domain method. The impulse sequence in time domain is initially extracted using the zero-frequency filtering (ZFF) method [130, 216], which is suitable mainly for modal voicing. The effect of different window lengths for trend removal operation in the ZFF method is examined. The impulse sequence for aperiodic signal is analysed in terms of saliency [55, 79, 80], to signify the perceived pitch. Effect of different window lengths for trend removal on the derived sequence of excitation impulses at epoch locations is examined in terms of saliency, for two synthetic cases. Two different sequences of impulses formed by amplitude modulation (AM) and frequency modulation (FM) of a unit impulse sequence are used for studying this effect on saliency. In order to eliminate the need for selecting an appropriate window length and also to minimize the effect of window length on the derived impulse sequence for aperiodic signal like Noh voice, a modified zero-frequency filtering (modZFF) method is proposed. Deriving the time domain impulse sequence using the modZFF also involves preprocessing of the input signal, advantage of which is validated first by using the Hilbert envelope of LP residual of signal. The characteristics of aperiodicity in expressive voices is then examined in terms of impulse sequence derived using the modZFF method and the saliency [55] computed for this derived impulse sequence. The instantaneous fundamental frequency (F0 ) for expressive voices is obtained from the saliency information. The results obtained by using the proposed signal processing methods are validated through saliency plots, spectrograms and visual comparison with results [55] obtained by XSX based TANDEM STRAIGHT method [55, 79, 81]. Analysis-synthesis approach is adopted for further validation and application of the results. 7.4 Method to compute saliency of expressive voices For aperiodic excitation signals, it is difficult to fix the appropriate window length for trend removal, as the intervals between successive epochs may vary rapidly and randomly. Moreover, due to aperiodicity, the perception of pitch is also difficult to express. The term “saliency” is used to express the significance of perceived pitch [55]. In this study, we consider the autocorrelation function derived from the signal to express the saliency of the perceived pitch frequency. The autocorrelation function is computed using the inverse discrete Fourier transform (IDFT) [137] of the low pass (cut-off frequency 800 Hz) filtered spectrum of a 40 ms Hann windowed segment of the signal. The locations of the peaks in the normalized autocorrelation function (for lags > 0.5 ms) are used as estimates of the perceived pitch periods, and the magnitudes of the peaks are used to represent the saliency (importance) of the estimates. The magnitudes of the normalized autocorrelation function are displayed in terms of gray levels as a function of pitch frequency 136 (1/τ ) for each analysis frame, where τ is the time lag of the autocorrelation function. The gray level display of saliency values as a function of frequency and analysis frame index (frame size = 40 ms and frame shift = 1 ms) gives a spectrogram-like display. The resulting plot is called saliency plot. The following steps are used to obtain the ‘saliency plot’ for a signal: 1. Select a segment sw [n] of 40 ms of the signal s[n] for every 1 ms. 2. Multiply the segment with Hann window w1 [n]. 3. Compute the squared magnitude of short-time DFT [137] (Xw [k]) of the Hann windowed segment xw [n], after appending with sufficiently large number of zeros to obtain adequate samples in the frequency domain. It can be expressed as Xw [k] = N −1 X xw [n] exp n=0 −j2πnk N (7.1) where, xw [n] = sw [n].w1 [n] and N is number of points in DFT. Here, N is a power of 2, and is taken sufficiently large. 4. Multiply the spectrum Xw [k] with a (half Hamming) window function W2 [k] (in frequency domain) to obtain an approximate low pass (< 800 Hz) filtered spectrum (Xw2 [k]). 5. Compute the inverse DFT (IDFT) [137] of the filtered spectrum (Xw2 [k]) to obtain the autocorrelation function r[τ ]. It can be expressed as N −1 1 X j2πτ k 2 r[τ ] = |Xw2 [k]| exp N N (7.2) k=0 where Xw2 [k] = Xw [k].W2 [k] and N , which is a power of 2, is number of points in IDFT. 6. Use the normalized autocorrelation function r[τ ] from lag τ = 0.5 ms to τ = 40 ms to obtain it as a function of frequency (1/τ ). 7. Plot amplitudes of the autocorrelation function r[τ ] as a function of the inverse of the time lag (τ ) (i.e., frequency represented by 1/τ ), as gray levels for each analysis frame (frame rate = 1000 frames per sec). The resulting plot is the saliency plot. Saliency plots are useful for studying the effects of window lengths on the extracted epoch sequences. In order to study the effect of window size for trend removal operation in ZFF method [130, 216], on the estimated locations of epochs, these epoch locations are extracted for two cases of synthetic aperiodic sequences of impulses: an (i) amplitude modulated (AM) pulse train and a (ii) frequency modulated (FM) pulse train. Fig. 7.2(a) shows the saliency plot of the AM pulse sequence shown in Fig. 7.2(b), whose base fundamental frequency is 160 Hz and subharmonic components are at 80 Hz, i.e., at F0 /2. Fig. 7.3 shows the saliency plots and the corresponding epoch sequences derived by using ZFF method on the sequence in Fig. 7.2(b), for different window lengths for trend removal. Since the intervals between successive pulses are nearly same, the ZFF method gives epochs locations nicely when the window length for trend removal is in the range of one to two periods, as in Fig. 7.3(b) for the (fixed) window size of 7 ms. 137 (b) Synthetic AM sequence xs[n] 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 0.5 0.45 0.5 0.45 0.5 0.45 0.5 0.45 0.5 Figure 7.2 (a) Saliency plot of the AM pulse train and (b) the synthetic AM sequence. (b) SoE impulse sequence for AM sequence. Window length=7ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 (d) SoE impulse sequence for AM sequence. Window length=3ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 (f) SoE impulse sequence for AM sequence. Window length=1ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 (h) Cleaned SoE impulse sequence for AM sequence. Window length=1ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 Figure 7.3 Saliency plots ((a),(c),(e),(g)) of the synthetic AM pulse train and the epoch sequences ((b),(d),(f),(h)) derived by using different window lengths for trend removal: 7 ms ((a),(b)), 3 ms ((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for AM sequence and the cleaned SoE sequence for 1 ms window length, respectively. 138 (b) Synthetic FM sequence xs[n] 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 0.5 0.45 0.5 0.45 0.5 0.45 0.5 0.45 0.5 Figure 7.4 (a) Saliency plot of the FM pulse train and (b) the synthetic FM sequence. (b) SoE impulse sequence for FM sequence. Window length=7ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 (d) SoE impulse sequence for FM sequence. Window length=3ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 (f) SoE impulse sequence for FM sequence. Window length=1ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 (h) Cleaned SoE impulse sequence for FM sequence. Window length=1ms SoE 1 0.5 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 Figure 7.5 Saliency plots ((a),(c),(e),(g)) of the synthetic FM pulse train and the epoch sequences ((b),(d),(f),(h)) derived by using different window lengths for trend removal: 7 ms ((a),(b)), 3 ms ((c),(d)) and 1 ms ((e),(f)). In (g) and (h) are the saliency plot for FM sequence and the cleaned SoE sequence for 1 ms window length, respectively. 139 It is interesting to note that for the aperiodic sequence derived using a small window length (1 ms) the saliency plot (Fig. 7.3(e)) matches well with the original (Fig. 7.2(a)). This indicates that the epoch sequence along with respective strengths can be derived using a smaller window length for the trend removal. But, there appear many spurious epochs with smaller strengths (Fig. 7.3(f)). Some of these can be eliminated by retaining the epoch with the largest strength within 1 ms interval (arbitrarily chosen) of the current epoch, which gives cleaned SoE sequence. The resultant cleaned epoch sequence (Fig. 7.3(h)) matches well with the original AM sequence in Fig. 7.2(b). The saliency plot (Fig. 7.3(g)) of the epoch sequence in Fig. 7.3(h), also matches well with that of the original sequence (Fig. 7.2(a)). For aperiodic signals such as the AM sequence, longer window lengths for trend removal does not give proper epoch sequence as can be seen from Fig. 7.3(b) and 7.3(d), in comparison with Fig. 7.2(b). Also, it is difficult to remove the spurious epochs from Fig. 7.3(d), as compared to those in Fig. 7.3(f). Fig. 7.4 shows a synthetic FM pulse sequence (base fundamental frequency = 160 Hz) and the corresponding saliency plot. Note that, since the intervals between epochs decrease with time, there appears a split in saliency at 160 Hz as two diverging lines. Hence for extracting very small intervals as in the region 0.4 sec to 0.45 sec, it is necessary to use a small window length for trend removal. Fig. 7.5 shows the extracted epoch sequences in (b), (d) and (f) using three different window lengths for trend removal. The corresponding saliency plots are also shown on the left hand side in (a), (c) and (e). It is clear that use of 1 ms window for trend removal produces all the epochs correctly as shown in Fig. 7.5(f). But there are many spurious epochs, which can be removed by selecting the epochs with highest strength within 1 ms (arbitrarily chosen in this case) interval of the current epoch. The resulting cleaned epoch sequence and the corresponding saliency plot are shown in Fig. 7.5(h) and (g), respectively. 7.5 Modified zero-frequency filtering method for analysis of Noh voices 7.5.1 Need for modifying the ZFF method As discussed in Section 3.5, the characteristics of the excitation signal can be represented in time domain in the terms of locations of impulses and their relative strengths, i.e., epoch sequence. The ZFF method [130, 216] used for deriving the impulse sequence representation for modal voicing has two limitations when applied for expressive voices: (i) Shorter window length would be required for trend removal for higher F0 [159]. (ii) The impulse sequence for aperiodic signals is affected by the choice of window length for trend removal. The analysis of epoch sequence representation of aperiodic signal in terms of saliency, using synthetic AM/FM pulse trains in Section 7.4, establishes two points: (i) Shorter window lengths bring out the information better, for nonverbal sounds with high degree of aperiodicity. (ii) Some additional zero-crossings obtained by using short window lengths may be spurious. First, we examine whether these additional zero-crossings obtained by using short window lengths are related more to the excitation source component or the vocal tract system. The signal (s[n]) is downsampled to 8000 Hz and 14th order LP analysis is carried out. In order to suppress the system 140 (a) Input speech signal s[n] 1 0 −1 (b) LP Residual of signal e[n] 0.01 0 −0.01 he[n] (c) Hilbert envelope of LP Residual 0.01 0.005 0 (d) modZFF output of Hilbert envelope of LP Residual h z [n] 1 0 −1 (e) SoE impulse sequence from modZFF output ψ[n] 1 0.5 0 0.25 0.26 0.27 0.28 0.29 0.3 0.31 Time (sec) 0.32 0.33 0.34 0.35 Figure 7.6 Illustration of waveforms of (a) input speech signal, (b) LP residual, (c) Hilbert envelope of LP residual and (d) modZFF output, and (e) the SoE impulse sequence derived using the modZFF method. The speech signal is a segment of Noh singing voice used in Fig. 3 in [55]. component, linear prediction (LP) residual [112, 114, 153] e[n] (= x[n] − x ˆ[n]) is obtained from the difference of the downsampled signal x[n] and the predicted signal x ˆ[n] (using LP coefficients {ak }). Then, the excitation source component in LP residual (e[n]) is highlighted by taking its Hilbert envelope (he [n]) [153, 137]. This signal he [n], now carrying predominantly the excitation source information, is used as input to the modified zero-frequency filtering (modZFF) method proposed. Both limitations in ZFF method for the case of expressive voices, mentioned before, are addressed in modZFF method by using gradually reducing window lengths for the trend removal operation. The trend is removed coarsely first, using window lengths 20 ms, 10 ms and 5 ms. Then window lengths 3 ms, 2 ms and 1 ms are used successively, to capture the finer variations in the excitation component. The impulse sequence is then obtained from the resultant modZFF output, whose positive to negative going zero-crossings give impulse locations and its slope around zero-crossings the amplitudes (SoE). Fig. 7.6 is an illustration of the signal (s[n]), LP residual (e[n]), Hilbert envelope of LP residual (he [n]), modZFF output (zh [n]) and the SoE impulse sequence (ψ[n]) for a segment of Noh voice [55]. It may be observed in Fig. 7.6(b), (c) and (e) that the impulses of larger amplitude coincide with instants of significant excitation, which may possibly correspond to the glottal closure instants (GCIs), i.e., epochs. The impulses of smaller amplitudes are located at intermediate points between two epochs. Since, the vocal tract system component was substantially suppressed by taking first the LP residual and then its Hilbert envelope, these impulses of smaller amplitudes most likely correspond to the excitation component, and not the system. Though it is also possible that some of these impulses may be spurious. 141 (a) Input speech signal s[n] 1 0 −1 (b) Preprocessed signal sp[n] 1 0 −1 (c) modZFF output signal zm[n] 1 0 −1 (d) SoE impulse sequence from modZFF output ψ[n] 1 0.5 0 0.25 0.26 0.27 0.28 0.29 0.3 0.31 Time (sec) 0.32 0.33 0.34 0.35 Figure 7.7 Illustration of waveforms of (a) input speech signal, (b) preprocessed signal and (c) modZFF output, and (d) the SoE impulse sequence derived using the modZFF method. The speech signal is a segment of Noh singing voice used in Fig. 3 in [55]. A closer observation of the LP residual, modZFF output and impulse sequence representation obtained for different segments of Noh voice indicated that the excitation component is less likely to be present beyond 1000 Hz. Hence, a preprocessing step prior to ZFF step is proposed, in place of computing the LP residual and its Hilbert envelope. The preprocessing step involves downsampling the signal to 8000 Hz, smoothing over m sample points so as to get equivalent effect of low-pass filtering with cut-off frequency (Fc ) around 1000 Hz, and then upsampling back to original sampling frequency (fs ) of the signal. Then modZFF output is obtained by performing the ZFF steps and trend removal operation (using gradually reducing window lengths) on this preprocessed signal. In Fig. 7.7, an illustration of signal (s[n]), preprocessed signal (sp [n]), modZFF output (zm [n]) and the SoE impulse sequence (ψ[n]) is shown for the same segment of Noh voice as in Fig. 7.6. The impulse sequence (ψ[n]) (in Fig. 7.7(d)) has same or less number of impulses than in the impulse sequence obtained by using the Hilbert envelope of LP residual as input (in Fig. 7.6(e)). Visual comparison between zm [n] in Fig. 7.6(d) and zh [n] in Fig. 7.7(c) indicates reduction in number of impulses by using the preprocessing step, that can be seen in Fig. 7.7(d) as compared to Fig. 7.6(d). This is more likely due to reduction in spurious impulses. 7.5.2 Key steps in the modZFF method Key steps in the proposed modified zero-frequency filtering method can be summarized as follows: 1. Preprocess the input signal (s[n]) by downsampling the signal to 8000 Hz, smoothing over m sample points to obtain an equivalent effect of low-pass filtering with cut-off frequency (Fc ) around 1000 Hz, and then upsampling back to original sampling frequency (fs ) of signal. 142 (a) Input speech signal s[n] 1 0 ψ1[n] zm1[n] −1 (b) modZFF output [last window length=2.5ms] 1 0 −1 (c) SoE impulse sequence [modZFF last window length=2.5ms] 1 0.5 ψ2[n] zm2[n] 0 (d) modZFF output [last window length=2ms] 1 0 −1 (e) SoE impulse sequence [modZFF last window length=2ms] 1 0.5 ψ3[n] zm3[n] 0 (f) modZFF output [last window length=1.5ms] 1 0 −1 (g) SoE impulse sequence [modZFF last window length=1.5ms] 1 0.5 ψ4[n] zm4[n] 0 (h) modZFF output [last window length=1ms] 1 0 −1 (i) SoE impulse sequence [modZFF last window length=1ms] 1 0.5 0 0.25 0.26 0.27 0.28 0.29 0.3 Time (sec) 0.31 0.32 0.33 0.34 0.35 Figure 7.8 Illustration of waveforms of speech signal (in (a)), modZFF outputs (in (b),(d),(f),(h)) and SoE impulse sequences (in (c),(e),(g),(i)), for the choice of last window lengths as 2.5 ms, 2.0 ms, 1.5 ms and 1.0 ms. The speech signal is a segment of Noh voice used in Fig. 3 in [55]. 2. Get differenced signal (˜ x[n]) from the pre-processed signal (sp [n]), to obtain a zero-mean signal. 3. Pass the differenced signal (˜ x[n]) through a cascade of two zero-frequency resonators (ZFRs) as in (8.1). 4. Remove the trend in the output of cascaded ZFRs (y˜1 [n]), coarsely first by using the gradually reducing window lengths 20 ms, 10 ms and 5 ms in successive stages, and then using smaller window lengths 3 ms, 2 ms and 1 ms successively, to highlight the information related to aperiodicity better. In each trend removal stage, implemented as in (8.2), the window has 2N + 1 sample points. Its final output is called the modified zero-frequency filtered (modZFF) signal (zm [n]). An illustration of modZFF output signal (zm [n]) is shown in Fig. 7.7(c) for a segment of Noh voice. 5. The positive to negative going zero-crossings of the modZFF signal (zm [n]) give locations of impulses. The slope of the modZFF signal (zm [n]) around each of these locations indicates the strength of excitation (SoE) of the impulse around that location. An illustration of the SoE impulse sequence (ψ[n]) is shown in Fig. 7.7(d). 7.5.3 Impulse sequence representation of source using modZFF method The advantage of modZFF method is its ability to derive an impulse sequence that represents the excitation component of an aperiodic signal. The only issue remaining to be addressed in it now, appears 143 Table 7.1 Effect of preprocessing on number of impulses: (a) Last window length (wlast ) (ms), #impulses obtained (b) without preprocessing (Norig ), (c) with preprocessing (Nwpp ), and (d) difference N −Nwpp (∆Nimps = orig %). The 3 Noh voice segments correspond to Figures 6, 7 and 8 in [55]. Norig (d) ∆Nimps = (c) Voice segment: Noh (a) wlast (b) Norig −Nwpp % voice Norig Nwpp (ms) Norig Segment1: Fig6. in [55] Segment2: Fig7. in [55] Segment3: Fig8. in [55] 2.5 2.0 1.5 1.0 0.5 2.5 2.0 1.5 1.0 0.5 2.5 2.0 1.5 1.0 0.5 112 120 133 137 152 182 201 210 211 375 300 319 326 327 381 111 115 130 135 147 172 195 209 210 220 270 316 325 327 353 0.89 4.17 2.26 1.46 3.29 5.49 2.99 0.48 0.47 41.33 10.00 0.94 0.31 0.00 7.35 to be - ‘what should be the last window length in the trend removal operation’. It also needs to be verified that - ‘how sensitive the locations of zero-crossings of modZFF output (zm [n]) are, to the choice this last window length’. In order to examine this, the trend removal operation was performed for a Noh voice segment in 4 different iterations, each using the last window length as 2.5 ms, 2.0 ms, 1.5 ms or 1.0 ms, respectively. Fig. 7.8 shows an illustration of input speech signal (s[n]) and the modZFF outputs (zmj [n]) (where j = 1, 2, 3, 4) obtained by using these last window lengths. Corresponding SoE impulse sequence (ψj [n]) is also shown for each case. It is interesting to observe from Fig. 7.8 that the locations of zero-crossings of modZFF output and also of the impulses having larger amplitudes are nearly same, for the last window length (wlast ) taken in the range of 1.0 ms to 2.5 ms. The number of impulses are though marginally increased for some segments when shorter last window lengths are used. It may be inferred that the locations and amplitudes of impulses in the SoE sequence obtained by the modZFF method are not sensitive to the choice of last window length in 1.0 ms to 2.5 ms range. Main advantage of using the preprocessing step in modZFF method is the lesser number of impulses obtained with preprocessing (Nwpp ), in comparison to those obtained without any preprocessing (Norig ), i.e., using the original signal itself. This reduction in the number of impulses (Nwpp ) resulted by using the preprocessing step is possibly related to reduction in spurious impulses. In Table 7.1, the number of impulses obtained by the modZFF method with/without preprocessing are given for 3 different segments of Noh voice [55]. The number of impulses obtained without preprocessing (Norig ) and with preprocessing (Nwpp ) are given in columns (b) and (c), respectively, for the choice of last window length (wlast ) in column (a). The percentage difference in the number of these impulses relative to N −Nwpp % is given in column (d). Fig. 7.9 shows the results given in Table 7.1 Norig , i.e., ∆Nimps = orig Norig 144 Change(%) in #impulses vs last window length 10 ∆Nimps (%) 8 6 4 2 0 0.5 1 w 1.5 (ms) 2 2.5 last Figure 7.9 Selection of last window length: Difference (∆Nimps )(%) in the number of impulses obtained with/without preprocessing vs choice of last window length (wlast ) (ms), for 3 different segments of Noh singing voice [55]. [Solid line: segment1, dashed line: segment2, dotted line: segment3] Figure 7.10 (a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last window length=1 ms), for the input (synthetic) AM sequence. by plotting the percentage difference (∆Nimps ) vs choice of the last window length (wlast ), and fitting a curve of 4th order polynomial to the data points. Table 7.1 and Fig. 7.9 indicate that the difference in the number of impulses obtained with/without preprocessing in modZFF method, is near minimum for the choice of last window length as 1 ms (i.e., wlast =1 ms), for each of the 3 signal segments considered. The ability of modZFF method in giving an impulse sequence representation of the excitation component is verified for the synthetic AM/FM pulse trains (see Section 7.4). Fig. 7.10 shows the saliency plot of the impulse sequence derived using the modZFF, for a synthetic AM sequence. Likewise, Fig. 7.11 shows the saliency plot of the impulse sequence derived using the modZFF method, for FM sequence. It is interesting to observe that the derived impulse sequence and the saliency plot (Fig. 7.10) are very close to those for the original AM sequence (Fig. 7.2). Similarly, the derived impulse sequence and saliency (Fig. 7.11) are quite similar to those shown for the original FM sequence (Fig. 7.4). The similarity between original AM/FM sequences (Fig. 7.2 and Fig. 7.4) and those derived using the modZFF method (Fig. 7.10 and Fig. 7.11), with similarity of respective saliency plots, thus validates the proposed modZFF method and indicates its usefulness in capturing aperiodicity in expressive voices. 145 Figure 7.11 (a) Saliency plot and (b) the SoE impulse sequence derived using modZFF method (last window length=1 ms), for the input (synthetic) FM sequence. (a) Speech signal waveform s[n] 1 0 −1 Frequency (Hz) Frequency (Hz) 4000 Frequency (Hz) (b) Spectrogram of signal 4000 4000 3000 2000 1000 (c) Spectrogram of LP residual of signal 3000 2000 1000 (d) Spectrogram of SoE impulse sequence 3000 2000 1000 2 4 6 8 Time (sec) 10 12 14 Figure 7.12 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 1 in [55]. 7.6 Analysis of aperiodicity in Noh voice The feasibility of representing the information of excitation source through a sequence of impulses was examined in previous sections. It is premised that locations and amplitudes of impulses in this sequence also carry the perceptually significant information of aperiodicity in expressive voices. Hence, it is necessary to first verify whether the impulse sequence derived using modZFF method really represents the excitation source component, or does it also carry the information of vocal tract system? After ascertaining this first, we then examine the aperiodicity in expressive voices later in this section. 7.6.1 Aperiodicity in source characteristics The aperiodic impulse sequences with relative amplitudes (i.e., SoE), representing the excitation source characteristics, can be obtained for different segments of Noh voice signals by using the modZFF 146 (a) Speech signal waveform s[n] 1 0 −1 Frequency (Hz) Frequency (Hz) 4000 Frequency (Hz) (b) Spectrogram of signal 4000 4000 3000 2000 1000 (c) Spectrogram of LP residual of signal 3000 2000 1000 (d) Spectrogram of SoE impulse sequence 3000 2000 1000 2 4 6 8 Time (sec) 10 12 14 Figure 7.13 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 2 in [55]. method. The spectrograms of the epoch sequences reflect the aperiodic characteristics better than the spectrogram of the signal waveform, as the effect of resonances of the vocal tract is suppressed in these. The spectrogram of the epoch sequence also highlights the excitation characteristics such as harmonics, subharmonics, pitch modulations, pitch rise and fall etc. better than the spectrogram of the signal. An important feature of the spectrogram of the aperiodic signals is that the spectral features of the aperiodicity will be mixed with spectral features of the vocal tract system. Hence, it is usually difficult to identify the formant features corresponding to the vocal tract shape in the spectrogram. Linear prediction (LP) residual [112] of the speech signal is expected to reflect mainly the source characteristics [112, 114, 153]. The spectrogram of the LP residual may thus be used as a reference for representing the excitation source characteristics, although it does show some vocal tract system features also. Illustrations of the spectrograms for the speech signal, LP residual and the SoE impulse sequence are shown in Fig. 7.12(b), (c) and (d), respectively. The spectrogram in Fig. 7.12(d) for the SoE impulse sequence displays the excitation source characteristics clearly. It may also be observed in the region of 9-10 sec that the features of harmonics and subharmonics change temporally. The overall spectral characteristics due to aperiodic components are quite distinct from the spectral characteristics of modal voicing (compare region of 9-10 sec with the region around 10-11 sec in Fig. 7.12(b) and (d)). It is likely that the nature and extent of aperiodicity may be different in different short segments, as shown in the spectrogram in Fig. 7.12(d) for the SoE impulse sequence. Spectrograms obtained in a similar fashion for the SoE impulse sequences derived using the modZFF method, for the other two segments of Noh singing voice (corresponding to Fig. 2 and Fig. 3 in [55]) are shown in Fig. 7.13 and Fig. 7.14. A visual comparison between spectrograms of signal (Fig. 7.12(b)-Fig. 7.14(b)) and LP residual (Fig. 7.12(c)-Fig. 7.14(c)) indicates that the LP residual also carries the information of the vocal tract 147 (a) Speech signal waveform s[n] 1 0 −1 Frequency (Hz) Frequency (Hz) 4000 Frequency (Hz) (b) Spectrogram of signal 4000 4000 3000 2000 1000 (c) Spectrogram of LP residual of signal 3000 2000 1000 (d) Spectrogram of SoE impulse sequence 3000 2000 1000 2 4 6 8 Time (sec) 10 12 14 Figure 7.14 (a) Signal waveform and spectrograms of (b) signal, its (c) LP residual and (d) SoE impulse sequence obtained using the modZFF method, for a Noh voice segment corresponding to Fig. 3 in [55]. system to some extent. The vocal tract system information is completely suppressed in the spectrograms of the corresponding SoE impulse sequence. This fact is evident from the comparison of spectrograms of LP residuals (Fig. 7.12(c)-Fig. 7.14(c)) and of SoE impulse sequences (Fig. 7.12(d)-Fig. 7.14(d)). The dark broader contours visible in the spectrograms of the signal (Fig. 7.12(a)-Fig. 7.14(a)) indeed carry the system information, as is shown later in this section. 7.6.2 Presence of subharmonics and aperiodicity The regions of aperiodicity in expressive voices, such as the regions between 9-10 sec in Fig. 7.12, 8.9-9.5 sec and 13-14.5 sec in Fig. 7.13, and 8.0-9.5 sec and 13-14.5 sec in Fig. 7.14, consist of possibly subharmonics also. In order to confirm this, the spectrograms expanded in the frequency range between 0-800 Hz were examined more closely. An illustration of the expanded spectrograms of the signal and the SoE impulse sequence for Noh voice signal, in the region between 13-14.5 sec in Fig. 7.14 for the frequency region 0-800 Hz, is shown in Fig. 7.15(b) and (c), respectively. The differences in both the spectrograms in Fig. 7.15 can be observed in three distinct regions: (i) R1: 13.8-14.2 sec, (ii) R2: 14.25-14.45 sec and (iii) R3: 14.45-14.65 sec. The differences in these regions are better visible in the spectrogram of SoE impulse sequence (in (c)). In the first region R1 (13.8-14.2 sec)(in Fig. 7.15(c)), the presence of periodicity and thereby harmonics is indicated by the regularity of harmonics peaks. In the second region R2 (14.2-14.45 sec), 148 s[n] (a) Speech signal waveform 1 0 −1 (b) Spectrogram of signal Frequency (Hz) 800 600 400 200 0 (c) Spectrogram of SoE impulse sequence Frequency (Hz) 800 600 400 200 0 13.8 13.9 14 14.1 R1 14.2 14.3 14.4 Time (sec) R2 14.5 14.6 14.7 R3 Figure 7.15 Expanded (a) signal waveform, and spectrograms of (b) signal and its (e) SoE impulse sequence obtained using the modZFF method, for Noh voice segment corresponding to Fig. 3 in [55]. subharmonics are clearly visible around 100 Hz. In the third region R3 (14.45-14.65 sec), there exists neither periodicity nor harmonics/subharmonics. The signal exhibits randomness and a noise-like behaviour in this region. Hence, the second region, and possibly third region also in some cases, may be called as regions of ‘aperiodicity’. First region R1 is the region of ‘periodicity’. It is interesting to note that the presence of subharmonics, indicated by a dark band around 100 Hz in the region R2 (14.214.45 sec), is highlighted better in the spectrogram of the SoE impulse sequence. Similar regions of aperiodicity are also observed in other two segments of Noh voice corresponding to Fig. 7.12 and Fig. 7.13. Hence, it may be inferred that regions of aperiodicity in expressive voices can indeed be analysed better from the source characteristics, that is represented by the SoE impulse sequence derived using the modZFF method. 7.6.3 Decomposition of signal into source and system characteristics Decomposition of speech signals for analysis of aperiodic components of excitation was proposed in [214], using the ZFF method [130, 216], in which the vocal tract system characteristics were derived using group delay [128, 129]. In another study [81, 78, 79], the system characteristics were derived using a TANDEM STRAIGHT method, with a XSX method [55] for extracting pitch information. The SoE impulse sequence obtained in previous sections represents the excitation source and carries the aperiodicity information. But, is this impulse sequence representation enough to derive the characteristics of both the source and the system? Also, do the dark broader contours visible in the spectrograms of signal (in Fig. 7.12(a)-Fig. 7.14(a)) and partially visible in the spectrograms of LP residual (in Fig. 7.12(b)Fig. 7.14(b)), really pertain to the system component? We examine both of these questions now. 149 (a) Speech signal waveform s[n] 1 0 −1 Frequency (Hz) Frequency (Hz) 4000 Frequency (Hz) (b) Spectrogram of signal 4000 4000 3000 2000 1000 (c) Spectrogram of source component 3000 2000 1000 (d) Spectrogram of system component 3000 2000 1000 2 4 6 8 Time (sec) 10 12 14 Figure 7.16 (a) Signal waveform, and spectrograms of (b) signal, and its decomposition into (c) source characteristics and (d) system characteristics, for a Noh voice segment corresponding to Fig. 3 in [55]. Assuming that the speech signal s[n] can be decomposed into the excitation source characteristics es [n] (i.e., epoch sequence) and the vocal tract system characteristics hs [n] (i.e., filter characteristics), the signal can be broadly represented as convolution of the two, according to the source-filter model [153]. The spectrum of the signal is given by PT (ω) = Es (ω) Hs (ω) (7.3) where, PT (ω), Hs (ω) and Es (ω) are power spectra of the signal, epoch sequence and vocal tract system, respectively. Using the discrete frequency domain representations and taking only the magnitude spectrum part, PT (ω) corresponds to |S[k]|2 , Es (ω) to |Es [k]|2 and Hs (ω) to |Hs [k]|2 , where S[k], Es [k] and Hs [k] are DFT [137] of s[n], es [n] and hs [n], respectively. Using these relations and (8.5), magnitude spectrum of the system is given by |Hs [k]|2 = |S[k]|2 |Es [k]|2 (7.4) The system characteristics hs [n] can be obtained by using IDFT [137] of Hs [k]. Hence, the vocal tract system characteristics can be broadly obtained from the signal spectrum, by knowing the excitation source characteristics. In the illustration shown in Fig. 7.16(d), the wideband spectrogram of the vocal tract system characteristics is derived from the spectrograms of the signal and the source characteristics (by using (8.6)). The wideband spectrogram of the corresponding Noh voice signal and the signal are shown in Fig. 7.16(b) and (a), respectively. Narrowband spectrogram of the same signal segment can be seen in Fig. 7.14(b). For better clarity, the narrowband spectrogram of 150 the source characteristics (i.e., SoE impulse sequence derived using the modZFF method) is shown in Fig. 7.16(c). Visual similarity between both the wideband spectrograms of the signal (Fig. 7.16(b)) and the system (Fig. 7.16(d)) indicates that spectrogram in Fig. 7.16(d) represents mainly the system characteristics. It also indicates that the dark broader contours in Fig. 7.16(b) (also in Fig. 7.14(b)), indeed represent the formant contours, which are suppressed in all the spectrograms for the SoE impulse sequences shown in Fig. 7.12(d)-Fig. 7.14(d). The spectrogram of the excitation component in Fig. 7.16(c)) can be contrasted with that of the vocal tract system in Fig. 7.16(d). The system characteristics derived in a similar fashion for other segments of Noh voice, also exhibit similar distinction between the spectrograms of the excitation source characteristics and the vocal tract system characteristics. 7.6.4 Analysis of aperiodicity using saliency In this subsection, we validate the ability of the representation of excitation source information through the impulse sequence, in capturing the perceptually significant pitch information. The representation in the form of SoE impulse sequences of different intervals and amplitudes is validated by using the saliency measure [55]. Saliency plots are obtained for different segments of Noh voice signal first by using LP residual, that can be used as reference or ground truth. The saliency plots obtained by using the SoE impulse sequences are then compared with these reference plots, and also with saliency plots obtained by using the XSX method [55]. Aperiodicity in expressive voices is analysed in terms of saliency, which is computed from the SoE impulse sequence representation of the excitation. Fig. 7.17(a) is the saliency plot for a segment of Noh voice signal (in region 9.4-9.8 sec in Fig. 7.12(b)), computed from its LP residual obtained by 14th order LP analysis [112, 114, 153] of signal downsampled to 8000 Hz. This saliency plot may be considered as reference, since it is computed from the derived excitation component of signal. It may be noted that higher harmonics (≥ 300 Hz) are not much visible in this saliency plot. Fig. 7.17(b) is the saliency plot obtained by the XSX method [55, 81], which is meant to represent the perceptually significant pitch information as discussed in [55, 80]. It may be observed that Fig. 7.17(b) captures the prominent features (indicated by darker lines) of the saliency plot in Fig. 7.17(a) computed from the LP residual. Fig. 7.17(c) is the saliency plot computed from the SoE impulse sequence derived by using the modZFF method. Saliency plot (especially the dark bands indicating large saliency) in Fig. 7.17(c) matches well with those in Fig. 7.17(a) and (b). It may be observed that in Fig. 7.17(c) the prominent features visible in the reference plot (in Fig. 7.17(a)) and the higher harmonics visible in the results of XSX based method (in Fig. 7.17(b)), both can be seen clearly. The frequency of large saliency may be interpreted as being perceived pitch harmonics and subharmonics. But due to large deviations from periodicity in the epoch (or SoE impulse) sequence, it is difficult to interpret the resulting perception as small deviation from periodicity. It is indeed difficult to determine which frequency components of the excitation are perceived well by human listener, as the significance of a frequency component need not be based only on its saliency. The saliency plots from the LP residual seem noisy, whereas the saliency plot from the SoE impulse sequence seems to pick up the high salience frequency components well. Thus the SoE impulse sequence may be used 151 Frequency (Hz) (b) XSX method (Fig.6 in [2]) 600 400 300 200 100 80 60 40 0.1 0.2 0.3 Frequency (Hz) (e) XSX method (Fig.7 in [2]) 600 400 300 200 100 80 60 40 0.2 0.4 0.6 (h) XSX method (Fig.8 in [2]) Frequency (Hz) 600 400 300 200 100 80 60 40 0.2 Time (sec) 0.4 0.6 Time (sec) 0.8 Time (sec) Figure 7.17 Saliency plots computed with LP residual (in (a),(d),(g)), using XSX method (copied from [55]) (in (b),(e),(h)), and computed with SoE impulse sequence derived using the modZFF method (in (c),(f),(i)). The signal segments S1, S2 and S3 correspond, respectively, to the vowel regions [o:](Fig. 6 in [55]), [i](Fig. 7 in [55]), and [o](with pitch rise)(Fig. 8 in [55]) in Noh singing voice [55]. as a good representation of the excitation source from perception point of view also. Even for representation and manipulation, the epoch sequence with amplitudes of impulses reflecting the strength of excitation (SoE) is a better choice than the LP residual. Detailed analyses of the aperiodic components in the specific regions considered in [55] are carried out by using the SoE impulse sequences, in order to compare with the analyses made for the same segments using XSX method [55, 81, 80]. Specific regions of Noh voice signal selected for detailed study are the vowel segments considered in [55] namely, 36.9-37.3 sec in Fig. 6, 56.2−57 sec in Fig. 7 and 109.4−110.3 sec in Fig. 8 in [55]. In this paper, these regions correspond to the regions between 9.39−9.81 sec in Fig. 7.12, 13.7−14.5 sec in Fig. 7.13, and 13.88−14.81 sec in Fig. 7.14, respectively. We compare the saliency plots computed from the SoE impulse sequence derived using the modZFF method, with the saliency plots derived using the XSX method [55, 81], to show that all the important features are preserved. The SoE impulse sequence thus provides an alternative representation of the aperiodic component of the voiced excitation, in expressive voices. 152 Fig. 7.17(f), (i) show the saliency plots obtained from the SoE impulse sequences for the other two vowel segments of Noh voice [55], with corresponding saliency plots obtained by XSX method (Fig. 7.17(e), (h)) taken from [55]. The effects of nonuniform intervals and amplitudes of the impulses in the epoch sequences are similar to those as shown in XSX based plots (Fig. 7.17(e), (h)) in terms of regions of large saliency (dark lines). In the saliency plots obtained from SoE impulse sequences, the time intervals of harmonic and subharmonic pitch regions can be seen more clearly. The saliency plots (Fig. 7.17(f), (i)) for SoE impulse sequences obtained by using the modZFF method, show the prominent features in reference saliency plots computed with LP residual (Fig. 7.17(d), (g)). Note that it is difficult to set a threshold on the saliency to determine the significance of the corresponding pitch frequency. It is likely that human perception takes all the values to appreciate the artistic features of the voice in the excitation, rather than characterizing in terms of a few harmonic and subharmonic components. From Fig. 7.17, visual comparison of saliency plots for three vowel segments, computed from the SoE impulse sequences (Fig. 7.17(c), (f) and (i)) with those obtained from LP residual (Fig. 7.17(a), (d) and (g)) validates three points: (i) Epoch sequence representation of the excitation source characteristics is sufficient to represent aperiodicity of source characteristics in expressive voices, since system characteristics can also be derived from it in some cases. (ii) Epoch sequence representation is better than LP residual to represent the excitation source characteristics, since the latter may have traces of spectral characteristics of the vocal tract system also, as observed in spectrograms in Fig. 7.12(c)-Fig. 7.14(c). (iii) The locations of impulses and their relative amplitudes in the epoch sequence, indeed capture the perceptually significant pitch information, that is represented better through saliency plots. 7.7 Significance of aperiodicity in expressive voices The relative importance of nonuniform epoch intervals/amplitudes in the impulse sequence representation of speech signal on the resulting harmonic features in spectrograms was examined in Section 7.6. From the Fig. 7.12(d), Fig. 7.13(d) and Fig. 7.14(d), it can be inferred that the aperiodic structure preserves the harmonic and subharmonic structure. Theoretically, if these nonuniform epoch intervals are made uniform, then most of the subharmonic components will be lost. The study of AM and FM sequences and their saliency, examined in Section 7.4, also highlights the effect of amplitude/frequency modulation of signal on pitch perception that is likely to be present in expressive voices. The inferences drawn from these earlier sections lead to few assertions, which are as follows: (1) Expressive voices like Noh singing voice have regions of aperiodicity, apart from regions of periodicity and randomness. The regions of aperiodicity are highlighted better in the saliency plots. (2) Regions of aperiodicity are more likely to have subharmonic structure, apart from the harmonic structure. Presence of subharmonics is better seen in spectrograms. (3) The instantaneous fundamental frequency (F0 ) changes rapidly in the regions of aperiodicity. Changes in F0 in these regions may appear to be random to some extent. 153 (a) Excitation (synthetic FM sequence) 0.5 x FM [n] 1 0 (b) Synthesized output (using FM sequence excitation) synth [n] 0.5 s 0 −0.5 synth 0.4 ψ s [n] (c) SoE impulse sequence derived from synthesized output 0.2 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 0.5 Figure 7.18 (a) FM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence. (4) Human perception is likely to take into account all likely values of the changing pitch frequency in the regions of aperiodicity. (5) Aperiodicity, subharmonics and changes in pitch perception in expressive voice signals are more related to the excitation component, which can be represented through an SoE impulse sequence. 7.7.1 Synthesis using impulse sequence excitation The effectiveness of the aperiodic sequence in capturing the perceptually salient excitation information can also be studied using synthesis and subjective listening. Speech is synthesized by exciting a 14th order LP model computed for every 10 ms shifted by 1 ms for 3 cases, using the excitation sequences consisting of either of the following: (a) impulse sequence with local averaging of the pitch period, (b) impulse sequence with actual epoch intervals with constant amplitudes (unit impulses), and (c) impulse sequence with actual epoch intervals along with their respective amplitudes (i.e., SoE). Speech is synthesized for each of the three Noh voice signals corresponding to Fig. 7.12, Fig. 7.13 and Fig. 7.14. Speech synthesized with excitation (c), i.e., SoE impulse sequence, sounds relatively better in comparison to other two cases. It is interesting to note that it is the aperiodicity that contributes more to the expressive voice quality. The amplitudes of impulses are not very critical. However, naturalness is lost if the excitation consists of only aperiodic sequence of impulses, as it does not have other residual information. Moreover, the 14th order LP model computed for every 10 ms smoothes the spectral information pertaining to the vocal tract, and hence the fast changes in the vocal tract system characteristics are also not reflected in the synthesis. Usefulness of the modZFF in deriving an aperiodic impulse sequence representation of source characteristics of Noh voice was also examined by a speech-synthesis experiment. Synthetic AM/FM sequence was used for exciting a 14th order all-pole model [153, 112, 114] derived from vowel [a] in modal 154 (a) Excitation (synthetic AM sequence) 0.5 x AM [n] 1 0 (b) Synthesized output (using AM sequence excitation) synth [n] 0.5 s 0 −0.5 (c) SoE impulse sequence derived from synthesized output 0.5 [n] ψ s synth 1 0 0.05 0.1 0.15 0.2 0.25 Time (sec) 0.3 0.35 0.4 0.45 0.5 Figure 7.19 (a) AM sequence excitation, (b) synthesized output and (c) derived SoE impulse sequence. voice. Thus synthesized output consists of excitation by the AM/FM sequence and system characteristics of vowel. In order to highlight the source features better, the LP residual of this synthesized output was taken. Then SoE impulse sequence was derived from it, using the modZFF method. Fig. 7.18(a), (b) and (c) show the excitation FM sequence (xF M [n]), the synthesized output (ssynth [n]) and the derived SoE impulse sequence (ψssynth [n]), respectively. It is interesting to note that the locations of impulselike pulses in the synthesized output (Fig. 7.18(b)), correspond well to those in the excitation FM sequence (Fig. 7.18(a)). In the derived SoE impulse sequence also (Fig 7.18(c)), the location of impulses can be observed to correspond fairly well to those in the excitation FM sequence (Fig 7.18(a)). Changes in the amplitude of impulses are primarily due to effect of the system characteristics. The synthesized output using AM sequence as excitation is shown in Fig 7.19. In both cases of excitation by synthetic AM/FM sequences, the locations of impulses derived from the synthesized signal using modZFF method correspond to the excitation sequence, although some spurious impulses are also present. The modZFF method is helpful in getting back the location of impulses in the AM/FM sequence excitation, which carries the information of harmonic/subharmonic structure discussed in Section 7.6.2 and shown through the saliency plot in Fig. 7.17. Retrieving this finer information in the form of closely spaced impulses (in FM sequence) from the synthesized output, is difficult otherwise. The synthesized output using AM sequence as excitation shows similar results. In both cases of excitation by synthetic AM/FM sequence, the locations of impulses derived from the synthesized output using modZFF method correspond well to the excitation sequence. The observations made by using the excitation AM/FM excitation may apply to expressive voices also, which do exhibit amplitude/frequency modulation of the excitation signal. The observations seem to suggest that: (i) aperiodicity is more related to the excitation source than the system (this was observed in Section 7.6 also), and (ii) the information of aperiodicity is embedded perhaps more in the location than in the amplitude of impulses. 155 (a) Speech signal waveform s[n] 1 0 −1 (b) SoE impulse sequence (using modZFF method) ψ 1 0.5 0 (c) F as per Saliency 0 Frequency (Hz) 600 400 300 200 100 80 60 40 0.1 0.2 0.3 0.4 0.5 Time (sec) 0.6 0.7 0.8 Figure 7.20 Illustration of (a) speech signal waveform, (b) SoE impulse sequence derived using the modZFF method and (c) F0 contour extracted using the saliency information. The voice signal corresponds to the vowel region [o](with pitch rise) in Noh singing voice (Ref. Fig. 8 in [55]). 7.7.2 F0 extraction in regions of aperiodicity In general, it is difficult to compute F0 for an aperiodic signal. It is even more challenging to derive the F0 information which is guided by the pitch perception information, especially in the case of expressive voices. A method for F0 extraction was proposed in [55, 81, 80] by utilizing the pitch perception information captured through saliency, which was computed using a TANDEM STRAIGHT method. In this paper, an alternative method is proposed to compute saliency for an impulse sequence derived using the modZFF method, that does capture the pitch perception information. From the saliency plot, which actually is the autocorrelation (r[τ ]) of low-pass filtered magnitude spectrum of the signal, the highest N peaks are taken (in descending order of magnitude) for each frame. This autocorrelation (r[τ ]) is obtained by using (8.4). Inverse of the location of the highest among these peaks, i.e., time-lag (τ |(max(r[τ ])) ), gives the frequency (F0 ) of perceived pitch for the frame taken at that time instant. Hence, the F0 information (as F0 = 1/(τ |(max(r[τ ])) )), i.e., the frequency of most salient pitch perceived in a frame at a particular time instant, can be computed from the highest peak in the autocorrelation (r[τ ]), i.e., the lag (τ ) for r[τ ]|max . An illustration of the signal waveform, SoE impulse sequence derived using the modZFF method and the F0 information extracted using the saliency computed for a segment of Noh voice signal is shown in Fig. 7.20(a), (b) and (c), respectively. Similarly, F0 contours are obtained for three segments of Noh voice signal considered in Fig. 7.17(a)-(c), (d)-(f) and (g)-(i) (i.e., the segments corresponding to Fig. 6, Fig. 7 and Fig. 8 in [55]). It is interesting to note that the saliency (pitch perception) information is thus useful in extracting the F0 information for expressive voices, which otherwise is difficult to obtain especially in the regions of aperiodicity. 156 7.8 Summary The aperiodic characteristics of the excitation component of expressive voice signals were studied in the context of Noh voices. The aperiodic information is captured in the form of sequence of impulses with relative amplitudes. It was shown that the perceptual features of these voices are well preserved in the epoch/impulse sequence, and that the epoch sequence is derived directly from the speech signal, without computing the short-time spectrum. The aperiodic epoch sequence gives pitch estimation similar to what was obtained using the fluctuating harmonic components in the short-time spectrum [55, 80]. Since the aperiodicity information is available in the time domain, it is much easier to control the excitation by modifying the epoch sequence in a desired manner. Synthesis of Noh voices using LP model for system and epoch sequence for excitation indicates that the aperiodic component mainly contributes to the peculiarities of these voices. Key contributions of this study are two signal processing methods, first a modZFF method, for deriving an impulse/epoch sequence representation of the excitation component in expressive voice signal, and second a method for computing saliency that captures the pitch-perception information related to aperiodicity in expressive voices. The embedded harmonic/subharmonic structure is also examined using the spectrograms. The epoch sequence representation is considered adequate to represent the excitation source component in the speech signal. In some cases, it is also possible to derive the approximate vocal tract system characteristics from this representation. The role of amplitude/frequency modulation in aperiodicity and in harmonic/subharmonic structure is examined by using two synthetic AM/FM pulse trains. Examining the saliency plots for these and different segments of Noh voice signal, it is confirmed that the epoch sequence representation does capture the aperiodicity and salient pitch perception information quite well. In the impulse sequence representation, the information of aperiodicity is more related to the time intervals among impulses, than their amplitudes. Validation of the results is carried out using spectrograms, saliency plots and an analysis-synthesis approach. Extraction of F0 information that captures the pitch perception information for expressive voices, is also demonstrated. Only one set of samples of Noh voice is used in this study. Due to inherent peculiarities of Noh singing voice, analysing the aperiodicity in it from the production characteristics is useful for analysing other expressive voices also. It is assumed that similar regions of aperiodicity and harmonic/subharmonic structures are present in other types of natural expressive voices. Effectiveness of the signal processing methods like modZFF, saliency computation and F0 extraction has also been tested for other speech signals, such as laughter and modal voice. Representation of the excitation source information through an epoch/impulse sequence that also captures adequately the pitch perception information, is focussed in this study. Further, it may be interesting to explore whether the impulse sequence can be generated directly from the pitch perception information extracted from the expressive voice signal? However, this study may be useful in characterizing and analysing the excitation source characteristics of natural expressive voices (laughter, cry etc.) that involve aperiodicity in the speech signal. 157 Chapter 8 Automatic Detection of Acoustic Events in Continuous Speech 8.1 Overview Analysis of the nonverbal speech sounds has helped in identifying their few unique characteristics. Exploiting these, few distinct features are extracted and parameters are derived that discriminate well these sounds from that of normal speech. Towards applications of the outcome of this research work, experimental systems are developed for detection of a few acoustic events in continuous speech. Three prototype systems are developed for automatic detection of trills, shouts and laughter, which are named as automatic trills detection system, shout detection system, laughter detection system, respectively. In this chapter, we discuss the details, performance evaluation and results of these three systems. In Section 8.2, the feasibility of developing an ‘Automatic Trills Detection System’ (ATDS) is discussed, along with results of limited testing of the experimental system. In Section 8.3, a prototype ‘Shout Detection System’ (SDS) is developed for automatic detection of regions of shouts in continuous speech. The parameters are derived from the production features studied in Section 5.6. An algorithm is proposed for taking decision of shouts, using these parameters. Performance evaluation results are also discussed in comparison with those from other methods. In Section 8.4, an experimental ‘Laughter Detection System’ (LDS) is discussed, which can be further developed using laughter production features and parameters derived earlier in Section 6.5. A summary of this chapter is discussed in Section 8.5. 8.2 Automatic Trills Detection System The system uses the production features of apical trills, studied in Section 4.2. Excitation source features F0 and SoE are used, along with an autocorrelation lag based feature proposed in [40]. In the second phase, LF modeled pulses based synthesis is used for confirming the trills detected in first phase. Limited testing of the ATDS has been carried out on a database consisting of 397 trills in different contexts, recorded in the voice of an expert male phonetician. The system gives trill detection rate of 84.13%, with accuracy of 98.82% for this test data. The experimental ATDS is developed mainly to examine the feasibility of developing an automatic system for spotting trills in continuous speech. 158 Figure 8.1 Schematic block diagram of prototype shout detection system 8.3 Shout Detection System Automatic detection of shouted speech (or shout in short) in continuous speech in real-life practical scenarios is a challenging task. We aim to exploit the changes in the production characteristics of shouted speech for automatic detection of shout regions in continuous speech. Changes in the characteristics of the vibration of the vocal folds and associated changes in the vocal tract system for shout from those for normal speech are exploited in discriminating the two modes. Significant changes apparently take place in the excitation source characteristics like the vibration of the vocal folds at the glottis, during production of shouted speech. But, there are very few attempts made in using these for automatic detection of shouted speech. We have studied the excitation source characteristics of shouted and normal speech signals along with Electroglottograph (EGG) signals, in Section 5.6 and Section 5.5, respectively. The closed phase quotient (α) is observed to be longer for shout than for normal speech [123]. Also, spectral band energy ratio (β) is higher for shout in comparison to normal speech (refer Section 5.6.1). Excitation source features F0 and strength of excitation (SoE) are derived from speech signal using the zero-frequency filtering (ZFF) method, and effect of the associated changes in the vocal tract system are studied through feature dominant frequency (FD ), discussed in Section 5.6.3. In this Section, we develop a decision logic for automatic detection of regions of shout in continuous speech. An experimental shout detection system is developed to examine the efficacy of the decision logic and the production features like β, F0 , SoE and FD . Feature β is computed using short-time spectrum. The decision logic uses the degree of deviation in these features for the production of shout as compared to normal speech. Multiple evidences for the decision of shout are collected for each speech segment. Temporal nature of changes in the features and their mutual relations are also exploited. Parameters capturing these changes are used in the decision of shout for each speech segment. Decision for shout for each segment is taken by a linear classifier that uses these eight decision criteria. Schematic block diagram of the experimental shout detection system is shown in Fig. 8.1. Performance of the shout detection system is tested on four datasets of continuous speech in three languages. 159 8.3.1 Production features for shout detection Comparison of differenced EGG signals for normal and shouted speech in Section 5.5 shows that, in the case of shouted speech, the duration of glottal cycle period reduces, and the closed phase quotient within each glottal cycle period increases. The reduction in the period of the glottal cycle, i.e., the rise in the instantaneous fundamental frequency (F0 ) gives perception of higher pitch. The larger closed phase quotient in each glottal cycle period is related to increased air pressure at the glottis, and also to higher resonance frequencies. The study of spectral characteristics in Section 5.6 also indicates that the spectral energy in the higher frequency band (500-5000 Hz), i.e., EHF , increases and in the lower frequency band (0-400 Hz), i.e., ELF , reduces for shout in comparison to normal speech. The effect of coupling between the excitation source and the vocal tract system is usually examined through spectral features like MFCCs or short-time spectra. But the effect can be seen better in the dominant resonance frequency (FD ) of the vocal tract system, discussed in Section 5.6.3. Use of spectral feature FD is also computationally convenient, in comparison to HNGD spectrum discussed in Section 5.6.1. The spectral energies EHF and ELF are computed using the short-time Fourier spectrum for computational convenience, instead of the HNGD spectrum. The excitation source feature F0 and SoE are computed for each segment of speech signal, using the ZFF method discussed in Section 3.5. Relative changes in the features F0 , SoE and FD for shout in comparison to normal speech can be observed in the illustration given in Figures 5.9 and 5.10. It may be noted that the features F0 , SoE, FD and β are derived directly from the speech signal, using computationally efficient methods. This makes these features suitable for developing a decision logic for the shout detection system, discussed next. 8.3.2 Parameters for shout decision logic The degree of changes in the production features F0 , SoE and FD for shout indicates the extent of deviation from normal. Nature of changes relates to the temporal changes in features. Parameters capturing both these aspects of changes can be exploited for the decision of shout in the SDS. Parameters capturing both these aspects of changes are exploited in the algorithm used in the shout detection system for decision of shout. Changes in spectral energies for shout are captured by spectral band energy ratio (β), i.e., ratio of spectral energies in high/low frequency bands (β = EHF /ELF ) in Section 5.6. The average values of F0 , FD and β increase in the case of shouted speech. The degree of changes in F0 , FD and β above respective threshold values is used for shout detection. These thresholds can be obtained either from average values computed for the reference normal speech, or dynamically for each block of the input speech data. The temporal nature of changes in the contours of F0 , SoE and FD can also be exploited for the detection of shout. It is interesting to note the relative fall/rise patterns in the contours of F0 , SoE and FD . Their pair wise mutual relations can actually help in discriminating the shouted speech from normal. For example, in some regions, SoE decreases for shout whereas the F0 increase, and vice versa. Likewise, some patterns of fall/rise in FD contour, relative to the F0 and SoE contours, can also be observed for shout. Such changes are negligible for normal speech. 160 Total eight parameters are computed from the degree and nature of changes in F0 , SoE, FD , and spectral energies, that are used in the decision logic. Three parameters capture the degree of changes in these features, other two capture the temporal nature (mainly gradients) of changes and rest three capture their mutual relations. Average values of these features are computed for each speech segment. The gradients of F0 , SoE and FD contours are computed from changes in their average values for successive segments. This smoothing helps in reducing the transient effects of stray fluctuations in these features. It is also computationally convenient. Using these eight parameters, decision is then taken for each speech segment, to decide - ‘whether this segment belongs to a shout region or not?’ 8.3.3 Decision logic for automatic shout detection system In this section we develop an experimental system for shout detection using the features β, F0 , SoE and FD . The average values of features β, F0 , SoE and FD are computed for each speech segment. The decision logic for shout detection uses eight parameters, capturing the degree of deviation in these features and the temporal nature of changes in their contours. Multiple decision criteria (d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 ) are used, to collect evidences of shout from these eight parameters for each speech segment. For a speech segment, higher number of such evidences gives higher confidence in deciding that segment as shout. The key steps are as follows: Step 1: The decision criteria d1 , d2 and d3 are derived from the parameters using thresholds for average values of the features F0 , FD and β, respectively. The decision criteria d4 to d8 are derived from the parameters capturing the temporal nature of changes in the features F0 , SoE and FD . The decision criteria d4 and d5 use the parameters for thresholds on the gradients of changes in F0 and FD contours, respectively. The decision criteria d6 , d7 and d8 use parameters capturing the pair wise mutual relations of temporal changes in F0 , SoE and FD contours. Since scales and units are different for these features, only directions (signs) of changes in their gradients (rise/fall patterns) are used. The schematic diagram in Fig. 8.2 shows different scenarios for considering pair wise mutual relations of temporal changes in the gradients of F0 , SoE and FD contours. The six segments marked as shout candidates in Fig. 8.2(d) illustrate three possible evidences for shout. (i) Case 1 (d6 ): when changes in the gradients of F0 and SoE contours are in opposite direction, e.g., segments 1, 2 in Fig. 8.2(d). (ii) Case 2 (d7 ): when changes in the gradients of SoE and FD contours are in opposite direction, e.g., segments 3, 4 in Fig. 8.2(d). (iii) Case 3 (d8 ): when changes in the gradients of both F0 and FD contours are in the same direction, e.g., segments 5, 6 in Fig. 8.2(d). Cases 1, 2 and 3 correspond to the decision criteria d6 , d7 and d8 , respectively. The eight decision criteria {di } based upon the parameters derived from the production features, can be summarized below, where θ denotes threshold, g the gradient and G the high gradient. The gradients (g and G) are computed between the average values of features (∆F0 , ∆SoE and ∆FD ) for successive segments. 161 +1 (a) F0 Don’t care −1 +1 (b) SoE Don’t care −1 +1 (c) F Don’t care D −1 1 (d) Shout 1 Candidate 0 2 3 4 case 1 5 case 2 6 case 3 Figure 8.2 Schematic diagram for decision criteria (d5 , d6 , d7 ) using the direction of change in gradients of (a) F0 , (b) SoE and (c) FD contours, for the decision of (d) shout candidate segments 1 & 2 (d5 ), 3 & 4 (d6 ), and 5 & 6 (d7 ). Ave(F0 ) > θF0 ⇒ d1 (8.1) Ave(FD ) > θFD ⇒ d2 (8.2) Ave(β) > θβ ⇒ d3 (8.3) g∆F0 > θg∆F 0 ⇒ d4 , G∆F0 (8.4) g∆FD > θg∆F D ⇒ d5 , G∆FD (8.5) sign(G∆F0 ) = −sign(g∆SoE ) ⇒ d6 (8.6) sign(g∆SoE ) = −sign(G∆FD ) ⇒ d7 (8.7) sign(G∆F0 ) = (8.8) sign(G∆FD ) ⇒ d8 Step 2: The shout decision for each speech segment uses these eight decision criteria {di } for classifying into shout or normal speech. Confidence scores for each speech segment are computed from these decision criteria {di }. Total confidence score above a desired level decides that as shout segment. Step 3: Final decision for a shout region is taken for each utterance or block of speech data. Only contiguous segments of shout candidates are considered in the final decision. It also minimizes the spurious cases of wrong detection, since sporadic shout candidate segments are likely to be false alarms. 8.3.4 Performance evaluation There is a large variability across speakers and languages in speech consisting of shout. Hence, it is a challenging to use the distinguishing features of shouted speech (as identified in Section 5.6) for its automatic detection. Since there is large variability in the values of features, parameters and thresholds 162 Table 8.1 Results of performance evaluation of shout detection: number of speech regions (a) as per ground truth (GT), (b) detected correctly (TD), (c) (shout) missed detection (MD) and (d) wrongly detected as shouts (WD), and rates of (e) true detection (TDR), (f) missed detection (MDR) and (g) false alarm (FAR). Note: CS is concatenated, NCS is natural continuous and MixS is mixed speech. Test set # Test set 1 Test set 2 Test set 3 Test set 4 Data Type CS NCS MixS MixS2 (a) (b) GT TD 44 40 92 85 184 133 591 471 (c) MD 4 6 14 45 (d) (e)(%) (f)(%) WD TDR MDR 0 90.9 9.1 1 92.4 6.5 37 72.3 7.6 75 79.7 7.6 (g)(%) FAR 0 1.9 20.1 12.7 across different scenarios, and the aim is to develop a speaker/language independent SDS, a heuristics based approach is adopted. Dynamic changes in the distinguishing features are captured. Empirical values of the thresholds are computed dynamically from the derived parameters, thereby factoring-in the variability across scenarios. Performance evaluation of the shout decision logic has two limitations, absence of any labelled shout database and nonavailability of ground truth in the natural data sourced from different media resources. Hence, the experimental SDS was tested on four sets of test data. Test set 1: Concatenated speech (CS) data. It consists of 6 concatenated pairs of utterances of same text in normal and shout modes, by 6 speakers. The data was drawn from 51 such pairs (by 17 speakers, for 3 different texts) recorded in the Speech and Vision Lab, IIIT, Hyderabad in English (see Section 5.4). Test set 2: Natural continuous speech (NCS) data. It has 6 audio files having shout content. Data was drawn from IIIT-H AVE database of 1172 audio-visual clips sourced from movies/TV chat shows. Test set 3: Mixed speech (MixS) data. It consists of 184 utterances (47 (27 neutral, 20 anger)+72 (41 neutral, 31 anger)+65 (30 neutral, 35 shout)= 98 neutral, 86 shout), by 24 speakers in 3 languages. Data was drawn from 3 databases: (i) Berlin EMO-DB emotion database [19] in German, having 535 utterances for 7 emotions, (ii) IIIT-H emotion database in Telugu, having 171 utterances for 4 emotions [89], and (iii) IIIT-H AVE database in English, having 1172 utterances for 19 affective states. Test set 4: Mixed speech (MixS2) data. It is same as test set 3, but used by taking shout decision for every 1 sec block, instead of utterance level. Data in test sets 3 and 4 is for 645 sec (591 sec voiced). Assumption is made here that anger speech is usually associated with presence of shout regions. Hence, test data includes anger regions. Ground truth was established by listening to the speech data. Testing results of the SDS over 4 test sets are given in Table 8.1. Total numbers of ground truth speech utterances (or 1 sec blocks), speech regions detected correctly as shout/normal speech, missed shout detection, and normal speech regions detected wrongly as shout are given in columns (a), (b), (c) and (d), respectively. Three performance measures are used: (i) True detection rate (TDR) (= b/a), (ii) Missed detection rate (MDR) (= c/a) and (iii) False alarm rate (FAR) (= d/a). The TDR, MDR and FAR for each test set are given in (%) in columns (e), (f) and (g), respectively. The testing results obtained with granularity of shout decision blocks of 1 sec for test set 4 are better, than for utterance 163 level for test set 3. It is because, shout is an unsustainable state and the utterances in test set 3 are up to 15 sec long during which shout/normal speech often gets interspersed, reducing the decision accuracy. The performance of shout/normal speech detection as 72.3-92.4% with false alarm rate of 1.9-20.1%, by using the proposed SDS is better than those reported in [132] as 64.6-92% and 22.6-35.4%, respectively. This performance is also better than the test results of Gaussian mixture model (GMM) based classifier used in [219] that reported shout detection performance (TDR) of 67.5%. The results are also comparable with test results of multiple model framework approach using Hidden Markov model with support vector machine or GMM classifier, proposed in [217], that reported success rate as 63.8-83.3%, with 5.6-11% miss rate (MDR) and 11.1-25.3% error rate (FAR). Actually, the MDR as 6.5-9.1% and FAR of 1.9-20.1% achieved by using the proposed algorithm are comparatively better. 8.4 Laughter Detection System Excitation sources features F0 and SoE are extracted for each voiced segment, using the modified zero-frequency filtering method. Parameters are then derived from Pitch period (T0 ), Strength of excitation (SoE), ratio of strength of excitation and pitch period (R), slope of pitch period contour (Slopepp ), and slope of strength of excitation contour (SlopeSoE ) [185]. The proposed method for laughter spotting consists of the following steps. (a) The signal is first segmented into voiced and nonvoiced regions. (b) Then five features are extracted for every epoch in the voiced region. (c) If a voiced segment has more epochs than determined by the ‘fraction threshold’ for at least 60% of the features, then that segment is considered as a laughter segment. Regions of durations less than 30 ms have not been considered for detection, so as to minimize spurious detections. Performance of the LDS is evaluated on a limited dataset taken from a database IIIT-H audio-visual emotion (AVE) database collected in the Speech and Vision Lab, IIIT, Hyderabad. Some audio files drawn from movies and TV soap-operas are used. Over a dataset of 180 laugh calls, the detection rate of 75% is achieved by the prototype, with the false detection rate of 19.44%. The performance is comparable with state of the art, and can be improved by using other features and parameters discussed in Section 6.5. 8.5 Summary In this chapter, feasibility is examined for automatic detection of trills, shouted speech and laughter in continuous speech. Three prototype systems are developed. Using the source features of apical trills, studied in Section 4.2, a trill detection system is developed. Limited testing of the ATDS carried out on a trill-database, gave encouraging results. It indicates feasibility of developing a complete automatic system for spotting trills in continuous speech, using the production characteristics of apical trills. Testing of the ATDS has limitations, since only a few languages are rich in usage of apical trills. The experimental system is tested only for apical trills and not for other types of trills. 164 Automatic shout detection in real-life practical scenarios is challenging, yet important for a range of applications. Changes in the characteristics of the vibration of the vocal folds and associated changes in the vocal tract system for shout from those for normal speech are exploited in discriminating the two modes. Parameters capturing the changes in production features β, F0 , SoE and FD are used in developing the SDS. The decision criteria use parameters capturing the extent of deviation and the temporal nature of changes in these. Decision for shout is taken for each segment. Performance of the SDS is evaluated on four test sets, drawn from three different databases. Results are comparable to other reported results and are even better. Further, an online SDS can be developed for live real-life data. Feasibility of automatic detection of laughter (nonspeech-laugh or laughed-speech) in continuous speech is explored in this chapter, by developing an experimental LDS. The initial performance evaluated on a limited dataset is encouraging. The performance can be improved by using other features and parameters discussed in Section 6.5. Using these features and parameters, an online complete system can be developed further for automatic detection of laughter in continuous speech in different scenarios. 165 Chapter 9 Summary and Conclusions 9.1 Summary of the work Nonverbal speech sounds are analysed in this research work, by examining the differences in their production characteristics from normal speech. Four categories of sounds are considered, namely, normal speech, emotional speech, paralinguistic sounds and expressive voices. These sound categories differ in the increasing order of rapidness of pitch-changes and the degree of aperiodicity content. Voluntary control of the excitation source and the vocal tract system during production of these sounds, or involuntary changes in their production characteristics are other differences. The effects of sourcesystem coupling and acoustic loading on the glottal vibration are studied first, for variations in normal speech sounds such as trills, fricatives and nasals. The source-system coupling effect, also present in emotional speech sounds, is studied next. Shouted speech is examined in this category of sounds. Laughter sounds and Noh voice are examined in paralinguistic sounds and expressive voices categories, respectively. The production characteristics, that of the glottal excitation source in particular, are examined from both acoustic and EGG signals in each case. The excitation source features such as F0 and SoE are derived mainly using the zero-frequency filtering method and modifications in it proposed in this thesis for nonverbal speech sounds. Changes in the source characteristics are also examined in terms of glottal pulse shape characteristics such as open/closed phase durations and closed phase quotient derived from the EGG signal. Associated changes in the resonance characteristics of the vocal tract system are examined using the dominant frequencies FD1 and FD2 , derived using LP analysis and group delay function. Changes in the spectral characteristics using Hilbert envelope of the numerator of group delay (HNGD) spectrum, derived using the zero-time liftering (ZTL) method. Other production characteristics derived using Hilbert envelope of LP residual of the acoustic signal are also used in the analysis in few cases. Apart from using some standard signal processing techniques such as short-time spectrum and recently proposed methods such as ZFF and ZTL, a few new methods are proposed in this thesis. A modified ZFF method, a method to compute first two dominant resonance frequencies, an alternative method for computing saliency of pitch-perception, and a method for extracting F0 in the regions of 166 aperiodicity are proposed. Using these, the features are extracted and parameters derived that reflect changes in the production characteristics of nonverbal speech sounds from those for normal speech. Efficacy of these features and parameters in discriminating the nonverbal sounds from normal speech is evaluated through three prototype systems developed, for automatic detection of acoustic events such as trills, shouts and laughter in continuous speech. The representation of the excitation source in terms of a time domain impulse sequence, which has been sought in speech coding methods, is a challenge for nonverbal speech sounds. But using this representation is immensely useful for deriving the production features and later manipulating these for speech synthesis. One such method, the zero-frequency filtering (ZFF) method, is suitable mainly for modal voicing. Using this, we have examined the effects of acoustic loading of the vocal tract system on the vibration characteristics of the vocal folds and system-source interaction, in the production of a selected set of six sound categories sounds. Modifications in the ZFF method are used for extracting this impulse sequence representation of the source characteristics, for the nonverbal sounds examined. To study the effect of coupling between the system and the source characteristics in the case of some emotional speech sounds, it is necessary to extract the spectral characteristics of speech production mechanism with high temporal resolution, which is still a challenging task. Signal processing methods like HNGD (ZTL) that can represent the fine temporal variations of the spectral features, are explored in this study. Production characteristics of speech in four loudness levels, i.e., soft, normal, loud and shout are examined. It is shown that these temporal variations indeed capture the features of glottal excitation that can discriminate shout vs normal speech. The effect of coupling between the excitation source and the vocal tract system during production of shouted speech is examined in different vowel contexts, using dominant frequency computation, along with source features such as F0 and SoE. The production characteristics of paralinguistic sounds like laughter are studied from changes in the vibration characteristics of the glottal excitation source, using the modified ZFF method. Three cases namely normal speech, laughed-speech and nonspeech-laugh are considered. Associated changes in the vocal tract system characteristics are examined using first two dominant frequencies FD1 and FD2 . Other production characteristics of laughter are also examined using features derived from the Hilbert envelope of LP residual of speech signal. Parameters are derived, that represent the changes in these features and help in distinguishing the three cases. The proposed modZFF method is used for deriving an impulse sequence representation of the excitation component in expressive voice signal. A newly proposed method is used to compute saliency, that captures the pitch-perception information related to aperiodicity in expressive voices. The role of amplitude/frequency modulation in aperiodicity and in harmonic/subharmonic structure in the expressive voices such as Noh voice, is examined by using two synthetic AM/FM pulse trains. Examining the saliency plots for these AM/FM sequences and for different segments of Noh voice signal, it is confirmed that the impulse/epoch sequence representation does capture the aperiodicity and salient pitch-perception information quite well. This sequence is considered adequate to represent the excitation source component in the speech signal, and is helpful in some cases to derive the approximate 167 vocal tract system characteristics as well. In the impulse sequence, the information of aperiodicity is more related to the time intervals among impulses, than to their relative amplitudes. The embedded harmonic/subharmonic structure is examined using the spectrograms. Validation of the results is carried out also using saliency plots and an analysis-synthesis approach. Extraction of F0 that captures the pitch-perception information in expressive voices, is also demonstrated. The analyses of the production characteristics of nonverbal speech sounds has helped in identifying their few unique characteristics. Using these, three experimental systems are developed for automatic detection of trills, shouts and laughter in continuous speech. The automatic shout detection system (SDS) is developed to an extent which is much closer to the complete online system. Performance evaluation of these systems, using specifically collected databases labelled with ground truth, gave encouraging results. The results indicate the feasibility of developing these prototype systems into complete systems for automatic detection of such acoustic events in continuous speech, in real-life scenarios. These experimental system confirm that analyses of production features are indeed insightful in the case of nonverbal speech sounds. Also, the excitation impulse sequence representation of the source characteristics which is guided by the pitch perception, is further helpful not only in analyses of these sounds, but also in diverse purposes that possibly includes synthesis of natural-sounding speech. In this study, only a few representative sounds in each of the four categories are examined. The study is expected to be helpful in providing further insight into production of these sounds and also in developing systems for real-life applications, that need to be tested on large databases. 9.2 Major contributions of this work The key contributions of this research work can be listed as follows: (i) The role of system-source coupling, and the effect of acoustic loading of vocal tract system on the glottal vibration are studied for a few dynamic voiced sounds such as trills, laterals, fricatives and nasals. These sounds are examined in vowel context [a] on both sides, in modal voicing. (ii) Four categories of sounds, namely, normal speech, emotional speech, paralinguistic sounds and expressive voices are analysed, both from speech production and perception points of view. Shouted speech, laughter and Noh voice in particular, are studied, by examining changes in their source and system characteristics. Features are extracted to distinguish these sounds from normal speech. (iii) Impulse-sequence representation of the excitation information in the acoustic signal is proposed for each category of sounds. The representation of excitation information by an impulse-sequence, that is guided by pitch perception, is proposed for Noh voice and laughter sounds. (iv) A few new signal processing methods are proposed, such as modified zero-frequency filtering method, method to compute saliency (i.e., a measure of pitch perception), dominant frequency computation method and a method for F0 extraction in the regions of aperiodicity. (v) Three prototype systems are developed for demonstrating the efficacy of features extracted and parameters derived, in distinguishing these nonverbal sounds from normal speech. Performance 168 evaluation results of these prototype systems indicate the feasibility of further developing complete systems for automatic detection of shouts, trill and laughter in continuous speech. 9.3 Research issues raised and directions for future work The impulse-sequence representation of the excitation source characteristics was explored in speech coding methods, mainly for normal speech. ‘Can the impulse-sequence representation of the excitation source information be obtained from acoustic signal for nonverbal speech also’, is explored in this work. The key challenge for these sounds is the ‘rapid changes in the F0 and pitch-perception’. Further, presence of the regions of aperiodicity and subharmonics poses another set of challenges in extracting F0 in these regions. Hence, it is important to explore, ‘can this impulse-sequence representation of the excitation source characteristics be guided by pitch-perception information for nonverbal sounds’? The signal-processing methods like ZFF, that work well for the modal voicing in normal speech, exhibit limitations in the case of nonverbal speech sounds. The proposed modified zero-frequency filtering (modZFF) method helps in deriving the impulse-sequence representation of the excitation, for laughter sounds and Noh voice that have rapid changes in F0 and pitch. A method is proposed to compute saliency, i.e., a measure for pitch-perception information, in the regions of subharmonics and aperiodicity. Using saliency and the impulse-sequence representation of excitation, the F0 is extracted for these nonverbal sounds, which otherwise is a difficult task. ‘Can the impulse-sequence representation of the excitation, be obtained only from the pitch perception information’ would be a further interesting problem and a research challenge. Analyses of nonverbal speech sounds from the production and perception points of view, has been a challenging research issue. Some studies have investigated the vocal tract system characteristics, but the excitation source characteristics are not explored much. In this study, these sounds are analysed by deriving these characteristics from both EGG and acoustic signals. However, ‘whether the glottal pulse shape characteristics that can be derived easily from EGG signal, can also be derived from the acoustic signal reliably’ is still a challenge, in spite of several studies like inverse-filtering, DYPSA or ZFF. Distinguishing features are extracted and parameters are derived for these sounds, that may help in automatic detection of acoustic events like trills, shouts and laughter in continuous speech. Systems for their automatic detection in natural speech may be developed further for diverse applications in real-life scenarios. But, developing these systems in natural environment would be different from investigating in lab environment, and that may possibly unearth new set of challenges as well. Deriving the ‘impulse-sequence representation of the excitation information from the saliency of pitch-perception’, may be attempted in future. Deriving this information would be more interesting for nonverbal sounds. Also, deriving the glottal pulse-shape characteristics reliably from the acoustic signal (rather than EGG signal), would provide further insight into the details of changes in production characteristics of these sounds. Using the production features, the systems can be developed for detection of more acoustic events like laughter or expressive voices in natural conversational speech. 169 Bibliography [1] T. Abe, T. Kobayashi, and S. Imai. The IF spectrogram: A new spectral representation. In Proc. International Symposium on Simulation, Visualization and Auralization for Acoustics Research and Education, ASVA’97, pages 423–430, April 1997. [2] P. Alku. Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2-3):109–118, June 1992. [3] P. Alku, T. Backstrom, and E. Vilkman. Normalized amplitude quotient for parametrization of the glottal flow. 112(2):701–710, Feb. 2002. [4] P. Alku and E. Vilkman. Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering. Speech Communication, 18(2):131–138, 1996. [5] B. Atal and M. R. Schroeder. Predictive coding of speech signals and subjective error criteria. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(3):247–254, 1979. [6] B. S. Atal and B. E. Caspers. Periodic repetition of multi-pulse excitation. The Journal of the Acoustical Society of America, 74(S1):S51–S51, 1983. [7] B. S. Atal and S. L. Hanauer. Speech Analysis and Synthesis by Linear Prediction of the Speech Wave. Journal of the Acoustical Society of America, 50(2B):637–655, 1971. [8] B. S. Atal and J. R. Remde. A new model of LPC excitation for producing natural-sounding speech at low bit rates. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 1, pages 614–617, May 1982. [9] A.V. Oppenheim, R.W. Schafer. Digital Signal Processing, chapter 3, pages 87–121. PHI Learning Private Limited, New Delhi, India, 2 edition, 1975. [10] J. A. Bachorowski and M. J. Owren. Not all laughs are alike: voiced but not unvoiced laughter readily elicits positive affect. Psychology Science, 12(3):252–257, May 2001. [11] J. A. Bachorowski, M. J. Smoski, and M. J. Owren. The acoustic features of human laughter. 110(3):1581– 1597, 2001. [12] A. Barney, C. H. Shadle, and P. O. A. L. Davies. Fluid flow in a dynamic mechanical model of the vocal folds and tract. I. Measurements and theory. The Journal of the Acoustical Society of America, 105(1):444–455, 1999. 170 [13] A. Barney, A. D. Stefano, and N. Henrich. The effect of glottal opening on the acoustic response of the vocal tract. Acta Acustica united with Acustica, 93(6):1046–1056, 2007. [14] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, and N. Amir. Whodunnit – Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech. Computer Speech and Language, Special Issue on Affective Speech in real-life interactions, 25(1):4–28, 2011. [15] M. Berouti, H. Garten, P. Kabal, and P. Mermelstein. Efficient computation and encoding of the multipulse excitation for LPC. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’84, volume 9, pages 384–387, 1984. [16] C. A. Bickley and S. Hunnicutt. Acoustic analysis of laughter. In Proc. Second International Conference on Spoken Language Processing, 1992 (ICSLP’92), pages 927–930. ISCA, Oct 13-16 1992. [17] D. Bitouk, R. Verma, and A. Nenkova. Class-level spectral features for emotion recognition. Speech Communication, 52(7-8):613–625, 2010. [18] B. Boashash. Estimating and interpreting the instantaneous frequency of a signal. I. Fundamentals. Proceedings of the IEEE, 80(4):520–538, April 1992. [19] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss. A database of german emotional speech. In in Proceedings of Interspeech, Lissabon, pages 1517–1520, 2005. [20] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42(4):335–359, Dec. 2008. [21] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In in Sixth International Conference on Multimodal Interfaces ICMI 2004, pages 205–211. ACM Press, 2004. [22] C. Busso, S. Lee, and S. Narayanan. Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech and Language Processing, 17(4):582–596, 2009. [23] R. Cai, L. Lu, H.-J. Zhang, and L.-H. Cai. Highlight sound effects detection in audio stream. In Proc. IEEE International Conference on Multimedia and Expo, 2003 (ICME’03), volume 3, pages 37–40, July 2003. [24] N. Campbell, H. Kashioka, and R. Ohara. No laughing matter. In Proc. 9th European Conference on Speech Communication and Technology, 2005 (INTERSPEECH’05), pages 465–468, Sep. 4-8 2005. [25] J. R. Carson and T. C. Fry. Variable Frequency Electric Circuit Theory with Application to the Theory of Frequency Modulation. Bell System Technical Journal, 16:513–540, Oct. 1937. [26] B. Caspers and B. Atal. Role of multi-pulse excitation in synthesis of natural-sounding voiced speech. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’87, volume 12, pages 2388–2391, 1987. 171 [27] B. E. Caspers and B. S. Atal. Changing pitch and duration in LPC synthesized speech using multipulse excitation. The Journal of the Acoustical Society of America, 73(S1):S5–S5, 1983. [28] J. C. Catford. Fundamental problems in phonetics, pages 1–278. Indiana University Press, Bloomington, USA, 1977. [29] J. C. Catford. A Practical Introduction to Phonetics, chapter four, pages 59–69. Oxford University Press Inc., New York, USA, second edition, 2001. [30] R. W. Chan and I. R. Titze. Dependence of phonation threshold pressure on vocal tract acoustics and vocal fold tissue mechanics. The Journal of the Acoustical Society of America, 119(4):2351–2362, 2006. [31] T. Chen and R. R. Rao. Audio-visual integration in multimodal communication. In Proc. IEEE, pages 837–852, 1998. [32] X. Chi and M. Sonderegger. Subglottal coupling and its influence on vowel formants. The Journal of the Acoustical Society of America, 122(3):1735–1745, 2007. [33] Z. Ciota. Emotion recognition on the basis of human speech. In IEEE ICECom2005, pages 1–4. IEEE, 2005. [34] L. Colantoni. Increasing periodicity to reduce similarity: An acoustic account of deassibilation in rhotics. In M. Diaz-Campos, editor, Selected Proceedings of the 2nd Conference on Laboratory Approaches to Spanish Phonetics and Phonology, pages 22–34. Cascadilla Proceedings Project, Somerville, MA, 2006. [35] R. Cowie and R. R. Cornelius. Describing the emotional states that are expressed in speech. Speech Communication, 40(1-2):5–32, Apr. 2003. [36] D. Crystal. Prosodic Systems and Intonation in English, chapter 2, pages 62–79. Cambridge Studies in Linguistics. Cambridge University Press, Cambridge, UK, 1976. [37] L. Deng and D. O’Shaughnessy. Speech Processing: A Dynamic and Optimization-oriented Approach, chapter seven, pages 213–226. Signal Processing and Communications Series. Marcel Dekker Incorporated, New York, USA, first edition, 2003. [38] N. Dhananjaya. Signal processing for excitation-based analysis of acoustic events in speech. PhD thesis, Dept. of Computer Science and Engineering, IIT Madras, Chennai, Oct. 2011. (last viewed Sep. 23, 2013). [39] N. Dhananjaya and B. Yegnanarayana. Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3):273–276, Mar. 2010. [40] N. Dhananjaya, B. Yegnanarayana, and P. Bhaskararao. Acoustic analysis of trill sounds. The Journal of the Acoustical Society of America, 131(4):3141–3152, 2012. [41] M. Diaz-Campos. Variable production of the trill in spontaneous speech: Sociolinguistic implications. In L. Colantoni and J. Steele, editors, Selected Proceedings of the 3rd Conference on Laboratory Approaches to Spanish Phonology, pages 115–127. Cascadilla Proceedings Project, Somerville, MA, 2008. [42] W. G. Ewan. Can the Intrinsic F0 Differences between Vowels Be Explained by Source/Tract Coupling? Status Report on Speech Research, Haskins Laboratories, SR-51/52:197–199, 1977. 172 [43] W. G. Ewan and J. J. Ohala. Can intrinsic vowel F0 be explained by source/tract coupling? The Journal of the Acoustical Society of America, 66(2):358–362, 1979. [44] F. Eyben, M. W¨ollmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie. On-line Emotion Recognition in a 3-D Activation-Valence-Time Continuum using Acoustic and Linguistic Cues. Journal on Multimodal User Interfaces, Special Issue on Real-Time Affect Analysis and Interpretation: Closing the Affective Loop in Virtual Agents and Robots, 3(1–2):7–12, March 2010. [45] C. G. Fant. Descriptive analysis of the acoustic aspects of speech. LOGOS, 5(1):3–17, 1962. [46] G. Fant. Acoustic Theory of Speech Production, chapter 1.1, pages 15–24. second printing. Mounton Co. N. N. Publishers, The Hague, Netherlands, first edition, 1970. [47] G. Fant. Glottal source and excitation analysis. Speech Transmission Laboratory, KTH, Sweden, Quarterly Progress and Status Report, 20(1):85–107, 1979. [48] G. Fant. SPEECH ACOUSTICS AND PHONETICS Selected Writings, chapter 4.1, pages 143–161. Text, Speech and Language Technology, Volume 24. Kluwer Academic Publishers, Dordrecht, The Netherlands, first edition, 2004. [49] G. Fant and Q. Lin. Glottal source - vocal tract acoustic interaction. Speech Transmission Laboratory, KTH, Sweden, Quarterly Progress and Status Report, 28(1):13–27, 1987. [50] G. Fant, Q. Lin, and C. Gobl. Notes on glottal flow interaction. Speech Transmission Laboratory, KTH, Sweden, Quarterly Progress and Status Report, 26(2-3):21–45, 1985. [51] M. Filippelli, R. Pellegrino, I. Iandelli, G. Misuri, J. R. Rodarte, R. Duranti, V. Brusasco, and G. Scano. Respiratory dynamics during laughter. Journal of Applied Physiology, 90(4):1441–1446, Apr. 2001. [52] J. L. Flanagan. Speech Analysis Synthesis and Perception. Springer-Verlag, 2nd edition, 1972. [53] A. Fourcin and E. Abberton. First application of a new laryngograph. Medical and Biological Illustration, 21(3):172–182, Jul 1971. [54] M. Fratti, G. A. Mian, and G. Riccardi. An Approach to Parameter Reoptimization in Multipulse-Based Coders. IEEE Transactions on Speech and Audio Processing, 1(4):463–465, Oct. 1993. [55] O. Fujimura, K. Honda, H. Kawahara, Y. Konparu, M. Morise, and J. C. Williams. Noh voice quality. Logopedics Phoniatrics Vocology, 34(4):157–170, 2009. [56] D. Gabor. Theory of communication. Part 1: The analysis of information. Journal of the Institution of Electrical Engineers - Part III: Radio and Communication Engineering, 93(26):429–441, 1946. [57] P. K. Ghosh and S. S. Narayanan. Joint source-filter optimization for robust glottal source estimation in the presence of shimmer and jitter. Speech Communication, 53(1):98–109, 2011. [58] C. Gobl. Voice source dynamics in connected speech. STL-QPSR, KTH, Sweden, 29(1):123–159, 1988. [59] C. Gobl. A preliminary study of acoustic voice quality correlates. STL-QPSR, KTH, Sweden, 30(4):9–22, 1989. [60] M. Gordon and P. Ladefoged. Phonation types: a cross-linguistic overview. Journal of Phonetics, pages 383–406, 2001. 173 [61] W. Granzow, B. Atal, K. Paliwal, and J. Schroeter. Speech coding at 4 kb/s and lower using single-pulse and stochastic models of LPC excitation. In Proc. International Conference on Acoustics, Speech, and Signal Processing, 1991. ICASSP-91, 1991, volume 1, pages 217–220, 1991. [62] G. S. Hall and A. Allin. The psychology of tickling, laughing, and the comic. The American Journal of Psychology, 9(1):1–41, 1897. [63] W. Hamza, R. Bakis, E. M. Eide, M. A. Picheny, and J. F. Pitrelli. The IBM expressive speech synthesis system. In Proc. of the 8th International Conference on Spoken Language Processing, Jeju, Korea, pages 14–16, 2004. [64] H. Hatzikirou, W. T. Fitch, and H. Herzel. Voice instabilities due to source-tract interactions. Acta Acustica united with Acustica, 92(3):468–475, 2006. [65] N. Henrich, C. d’Alessandro, B. Doval, and M. Castellengo. Glottal open quotient in singing: Measurements and correlation with laryngeal mechanisms, vocal intensity, and fundamental frequency. The Journal of the Acoustical Society of America, 117(3):1417–1430, 2005. [66] N. C. Henriksen and E. W. Willis. Acoustic characterization of phonemic trill production in Jerezano Andalusian Spanish. In M. Ortega-Llebaria, editor, Selected Proceedings of the 4th Conference on Laboratory Approaches to Spanish Phonology, pages 115–127. Cascadilla Proceedings Project, Somerville, MA, 2010. [67] J. Holmes. Formant excitation before and after glottal closure. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’76., volume 1, pages 39–42, 1976. [68] P. Hong, Z. Wen, and T. S. Huang. Real-time speech-driven face animation with expressions using neural networks. IEEE Trans. Neural Networks, 13:916–927, 2002. [69] M. S. Howe and R. S. McGowan. On the role of glottis-interior sources in the production of voiced sound. The Journal of the Acoustical Society of America, 131(2):1391–1400, 2012. [70] W. Huang, T. K. Chiew, H. Li, T. S. Kok, and J. Biswas. Scream detection for home applications. In The 5th IEEE Conference on Industrial Electronics and Applications (ICIEA), 2010, pages 2115–2120, Jun. 2010. [71] A. I. Iliev, M. S. Scordilis, J. P. Papa, and A. X. Falco. Spoken emotion recognition through optimum-path forest classification using glottal features. Computer Speech and Language, 24(3):445–460, 2010. [72] T. Irino and R. D. Patterson. Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-mellin transform. Speech Communication, pages 181–203, 2002. [73] C. T. Ishi, H. Ishiguro, and N. Hagita. Automatic extraction of paralinguistic information using prosodic features related to F0, duration and voice quality. Speech Communication, 50(6):531–543, 2008. [74] N. S. Jayant. Digital coding of speech waveforms: PCM, DPCM and DM Quantization. In Proc. IEEE, volume 62, pages 611–632, May 1974. 174 [75] M. A. Joseph, S. Guruprasad, and B. Yegnanarayana. Extracting formants from short segments of speech using group delay functions. pages 1009–1012, Pittsburgh PA, USA, Sep. 2006. [76] N. Kamaruddin and A. Wahab. Speech emotion verification system (SEVS) based on MFCC for real time applications. In Proc. 4th International Conference on Intelligent Environments (IE 08), pages 1–7. IEEE, 2008. [77] H. Kawahara, H. Katayose, A. de Cheveigne, and R. D. Patterson. Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. In Proc. Eurospeech’99, volume 6, pages 2781–2784, 1999. [78] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign. Restructuring speech representations using a pitchadaptive time-frequency smoothing and an instantaneous-frequency-based {F0} extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(34):187 – 207, 1999. [79] H. Kawahara and M. Morise. Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework. Sadhana, 36(5):713–727, 2011. [80] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, H. Banno, and T. Irino. A unified approach for F0 extraction and aperiodicity estimation based on a temporally stable power spectral representation. In ISCA ITRW, Speech, Analysis and Processing for Knowledge Discovery, June 4-6 2008. [81] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008), pages 3933 –3936, April 4 2008. [82] M. Kaynak, Q. Zhi, A. Cheok, K. Sengupta, and K. C. Chung. Audio-visual modeling for bimodal speech recognition. In Systems, Man, and Cybernetics, 2001 IEEE International Conference on, volume 1, pages 181–186, 2001. [83] L. S. Kennedy and D. P. W. Ellis. Laughter detection in meetings. In Proc. NIST ICASSP 2004 Meeting Recognition Workshop, pages 118–121, Montreal, Mar. 2004. [84] S. Z. K. Khine, T. L. Nwe, and H. Li. Speech/laughter classification in meeting audio. In Proc. 9th Annual Conference of the International Speech Communication Association, 2008 (INTERSPEECH’08), pages 793–796, Sep. 22-26 2008. [85] S. Kipper and D. Todt. The role of rhythm and pitch in the evaluation of human laughter. Journal of Nonverbal Behavior, 27(4):255–272, 2003. [86] M. T. Knox and N. Mirghafori. Automatic laughter detection using neural networks. In Proc. 8th Annual Conference of the International Speech Communication Association, 2007 (INTERSPEECH’07), pages 2973–2976, Aug. 27-31 2007. [87] K. J. Kohler. ‘Speech-smile’, ‘Speech-laugh’, ‘Laughter’ and their sequencing in dialogic interaction. Phonetica, 65:1–18, 2008. 175 [88] S. G. Koolagudi, A. Barthwal, S. Devliyal, and K. S. Rao. Real Life Emotion Classification from Speech Using Gaussian Mixture Models. In IC3, pages 250–261, 2012. [89] S. G. Koolagudi, S. Maity, A. K. Vuppala, S. Chakrabarti, and K. S. Rao. IITKGP-SESC: Speech database for emotion analysis. In S. Ranka, S. Aluru, R. Buyya, Y.-C. Chung, S. Dua, A. Grama, S. K. S. Gupta, R. Kumar, and V. V. Phoha, editors, IC3, volume 40 of Communications in Computer and Information Science, pages 485–492. Springer, 2009. [90] S. G. Koolagudi and K. S. Rao. Emotion recognition from speech using source, system, and prosodic features. Int. J. Speech Technol., 15(2):265–289, June 2012. [91] A. Kounoudes, P. A. Naylor, and M. Brookes. The DYPSA algorithm for estimation of glottal closure instants in voiced speech. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2002, volume 1, pages 349–352, May 2002. [92] J. Kreiman and D. V. L. Sidtis. Foundations of Voice Studies. Wiley-Blackwell, Malden, 2011. [93] P. Kroon and B. Atal. Quantization procedures for the excitation in CELP coders. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’87, volume 12, pages 1649–1652, 1987. [94] P. Kroon, E. Deprettere, and R. Sluyter. Regular-pulse excitation–a novel approach to effective and efficient multipulse coding of speech. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5):1054– 1063, 1986. [95] H. Kuwabara and Y. Sagisak. Acoustic characteristics of speaker individuality: Control and conversion. Speech Communication, 16(2):165–173, 1995. [96] P. Ladefoged. Vowels And Consonants: An Introduction To The Sounds Of Languages, chapter 13, pages 149–150. Blackwell Pub., 2003. [97] P. Ladefoged. Vowels And Consonants: An Introduction To The Sounds Of Languages, chapter 2, pages 18–24. Blackwell Pub., 2003. [98] P. Ladefoged, A. Cochran, and S. F. Disner. Laterals and trills. 7:46–54, 1977. [99] P. Ladefoged and K. Johnson. A course in Phonetics, chapter One, pages 4–7. Cengage Learning India Private Limited, Delhi, India, sixth edition, 2011. [100] P. Ladefoged and I. Maddieson. Sounds of World’s Languages, chapter 7, pages 217–236. Blackwell publishing, Oxford, UK, 1996. [101] E. Lasarcyk and J. Trouvain. Imitating conversational laughter with an articulatory speech synthesis. In Proc. of the Interdisciplinary Workshop on The Phonetics of Laughter, pages 43–48, Aug. 4-5 2007. [102] J. Laver. Principles of Phonetics, chapter five, pages 119–158. Cambridge Textbooks in Linguistics. Cambridge University Press, 1994. [103] T. Li and M. Ogihara. Content-based music similarity search and emotion detection. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference on, volume 5, pages 705–708, May 2004. 176 [104] W.-H. Liao and Y.-K. Lin. Classification of non-speech human sounds: Feature selection and snoring sound analysis. In IEEE International Conference on Systems, Man and Cybernetics, 2009 (SMC 2009), pages 2695–2700, Oct. 2009. [105] M. Lindau. The story of /r/. In V. Fromkin, editor, Phonetic Linguistics: Essays in Honor of P. Ladefoged, pages 157–167. Academic Press, Orlando, USA, 1985. [106] J. Lipski. Latin American Spanish, pages 1–440. Longman Linguistics Library, New York, USA, 1994. [107] A. Lockerd and F. Mueller. LAFCam: Leveraging affective feedback camcorder. In L. G. Terveen and D. R. Wixon, editors, Extended abstracts of the 2002 Conference on Human Factors in Computing Systems, CHI 2002, Minneapolis, Minnesota, USA, April 20-25, 2002, pages 574–575, New York, USA, 2002. ACM. [108] J. C. Lucero, K. G. Lourenc¸o, N. Hermant, A. V. Hirtum, and X. Pelorson. Effect of source–tract acoustical coupling on the oscillation onset of the vocal folds. The Journal of the Acoustical Society of America, 132(1):403–411, 2012. [109] E. S. Luschei, L. O. Ramig, E. M. Finnegan, K. K. Baker, and M. E. Smith. Patterns of laryngeal electromyography and the activity of the respiratory system during spontaneous laughter. Journal of Neurophysiology, 96(1):442–450, Jul. 2006. [110] I. Maddieson. Patterns of sounds, pages 1–422. Cambridge University Press, Cambridge, UK, 1984. [111] M. M. Makagon, E. S. Funayama, and M. J. Owren. An acoustic analysis of laughter produced by congenitally deaf and normally hearing college students. The Journal of the Acoustical Society of America, 124(1):472–483, 2008. [112] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, Apr. 1975. [113] J. D. Markel and A. H. Gray. A Linear Prediction Vocoder Simulation Based upon the Autocorrelation Method. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-22(2):124–134, April 1974. [114] J. E. Markel and A. H. Gray. Linear Prediction of Speech. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1982. [115] H. Masubuchi and H. Kobayashi. An acoustic abnormal detection system. In Proceedings., 2nd IEEE International Workshop on Robot and Human Communication, 1993, pages 237–242, Nov 1993. [116] R. S. McGowan. Tongue-tip trills and vocal-tract wall compliance. The Journal of the Acoustical Society of America, 91(5):2903–2910, 1992. [117] H. Mcgurck and J. W. Macdonald. Hearing lips and seeing voices. Nature, 264(246-248), 1976. [118] C. Menezes and Y. Igarashi. The speech laugh spectrum. In Proc. 6th International Seminar on Speech Production, 2006 (ISSP’06), pages 157–524, Dec. 13-15 2006. [119] A. Metallinou, S. Lee, and S. Narayanan. Audio-visual emotion recognition using gaussian mixture models for face and voice. In Proceedings of the IEEE International Symposium on Multimedia, page 250257, Berkeley, CA, Dec. 2008. 177 [120] A. Metallinou, M. W¨ollmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan. Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification. IEEE Transactions on Affective Computing, 3(2):184–198, April – June 2012. [121] D. G. Miller. EGGs for Singers, 2012. (last viewed Apr. 1, 2013). [122] V. K. Mittal, N. Dhananjaya, and B. Yegnanarayana. Effect of tongue tip trilling on the glottal excitation source. In Proc. INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Portland, Oregon, USA, Sep. 2012. [123] V. K. Mittal and B. Yegnanarayana. Effect of glottal dynamics in the production of shouted speech. The Journal of the Acoustical Society of America, 133(5):3050–3061, May 2013. [124] V. K. Mittal and B. Yegnanarayana. Production features for detection of shouted speech. In Proc. 10th Annual IEEE Consumer Communications and Networking Conference, 2013 (CCNC’13), pages 106–111, Jan. 11-14, 2013. [125] P. Moore and H. Von Leden. Dynamic variations of the vibratory pattern in the normal larynx. Folia Phoniat (Basel), 10(4):205–238, 1958. [126] D. Morrison, R. Wang, and L. C. D. Silva. Ensemble methods for spoken emotion recognition in callcentres. Speech Communication, 49(2):98–112, 2007. [127] E. Mower, M. J. Mataric, and S. Narayanan. A framework for automatic human emotion classification using emotion profiles. Trans. Audio, Speech and Lang. Proc., 19(5):1057–1070, Jul. 2011. [128] H. A. Murthy and B. Yegnanarayana. Formant extraction from group delay function. Speech Communication, 10(3):209 – 221, 1991. [129] H. A. Murthy and B. Yegnanarayana. Group delay functions and its applications in speech technology. Sadhana, 36(5):745–782, 2011. [130] K. S. R. Murty and B. Yegnanarayana. Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8):1602–1613, 2008. [131] H. Nanjo, H. Mikami, S. Kunimatsu, H. Kawano, and T. Nishiura. A fundamental study of novel speech interface for computer games. In IEEE 13th International Symposium on Consumer Electronics, 2009 (ISCE ’09), pages 558–560, May 2009. [132] H. Nanjo, T. Nishiura, and H. Kawano. Acoustic-based security system: Towards robust understanding of emergency shout. In Fifth International Conference on Information Assurance and Security, 2009 (IAS ’09), volume 1, pages 725–728, Aug. 2009. [133] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes. Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 15(1):34 –43, Jan. 2007. [134] E. E. Nwokah, P. Davies., A. Islam, H. C. Hsu, and A. Fogel. Vocal affect in three-year-olds: a quantitative acoustic analysis of child laughter. 94(6):3076–3090, Dec 1993. 178 [135] E. E. Nwokah, H.-C. Hsu, P. Davies, and A. Fogel. The integration of laughter and speech in vocal communication: A dynamic systems perspective. J Speech Lang Hear Res, 42(4):880–894, 1999. [136] J. J. Ohala and B. W. Eukel. Explaining the intrinsic pitch of vowels. Channon, Shockey, in In Honor of Ilse Lehiste,, pages 207–215, 1987. [137] A. V. Oppenheim and R. W. Schafer. Digital Signal Processing, chapter 7, pages 337–365. Prentice Hall, Englewood Cliffs, New Jersey, USA, 1975. [138] A. V. Oppenheim, R. W. Schafer, and J. R. Buck. Discrete-Time Signal Processing (2nd Edition) (Prantice Hall Signal Processing Series), chapter 2, pages 42–96. Pearson Prantice Hall, New Delhi, India, 2 edition, Jan. 1999. [139] M. J. Owren and J.-A. Bachorowski. Reconsidering the evolution of nonlinguistic communication: The case of laughter. Journal of Nonverbal Behavior, 27(3):183–200, 2003. [140] K. Ozawa and T. Araseki. Low bit rate multi-pulse speech coder with natural speech quality. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’86., volume 11, pages 457–460, 1986. [141] A. Panat and V. Ingole. Affective State Analysis of Speech for Speaker Verification: Experimental Study, Design and Development. In Proc. International Conference on Computational Intelligence and Multimedia Applications, pages 255–261, Los Alamitos, CA, USA, 2007. IEEE Computer Society. [142] T.-L. Pao, Y.-T. Chen, and J.-H. Yeh. Emotion recognition from mandarin speech signals. In Chinese Spoken Language Processing, 2004 International Symposium on, pages 301–304, 2004. [143] T.-L. Pao, W.-Y. Liao, T.-N. Wu, and C.-Y. Lin. Automatic visual feature extraction for mandarin audiovisual speech recognition. In SMC, pages 2936–2940. IEEE, 2009. [144] J. S. Perkell and M. H. Cohen. An indirect test of the quantal nature of speech in the production of the vowels /i/, /a/ and /u/. Journal of Phonetics, 17:123–133, 1989. [145] A. Perkis, E. B. Ribbum, and E. T. Ramstad. Improving subjective quality in waveform coders by the use of postfiltering. Department of Elec. Eng. and Comp. Science, pages 60–65, 1985. [146] J. Pohjalainen, P. Alku, and T. Kinnunen. Shout detection in noise. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pages 4968–4971, May 2011. [147] J. Pohjalainen, T. Raitio, S. Yrttiaho, and P. Alku. Detection of shouted speech in noise: Human and machine. 133(4):2377–89, Apr. 2013. [148] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. Recent advances in the automatic recognition of audio-visual speech. In PROC. IEEE, pages 1306–1326, 2003. [149] S. R. M. Prasanna and D. Govind. Analysis of excitation source information in emotional speech. In INTERSPEECH, pages 781–784, 2010. [150] R. R. Provine. Laughter: A Scientific Investigation. Viking, New York, USA, 2000. [151] R. R. Provine and K. R. Fischer. Laughing, smiling, and talking: Relation to sleeping and social context in humans. Ethology, 83(4):295–305, 1989. 179 [152] R. R. Provine and Y. L. Yong. Laughter: A stereotyped human vocalization. Ethology, 89(2):115–124, 1991. (published by Blackwell Publishing Ltd). [153] L. Rabiner, B. H. Juang, and B. Yegnanarayana. Fundamentals of Speech Recognition, chapter third, pages 88–113. Pearson Education Inc., New Delhi, India, Indian subcontinent adaptation, first edition edition, 2009. [154] K. S. Rao and S. G. Koolagudi. Characterization and recognition of emotions from speech using excitation source information. International Journal of Speech Technology, 16:181–201, 2013. [155] D. Recasens. On the production characteristics of apicoalveolar taps and trills. 19:267–280, 1991. [156] G. Rigoll, R. M¨uller, and B. Schuller. Speech Emotion Recognition Exploiting Acoustic and Linguistic Information Sources. In G. Kokkinakis, editor, Proceedings 10th International Conference Speech and Computer, SPECOM 2005, volume 1, pages 61–67, Patras, Greece, October 2005. [157] P. Roach. English Phonetics and Phonology: A practical course, chapter 4, pages 26–35. Cambridge University Press, Cambridge, UK, 1998. [158] M. Rothenberg. Acoustic interaction between the glottal source and the vocal tract. In Vocal fold physiology, pages 305–323. University of Tokyo Press, Tokyo, 1981. edited by K. N. Stevens and M. Hirano. [159] H. Rothganger, G. Hauser, A. C. Cappellini, and A. Guidotti. Analysis of laughter and speech sounds in italian and german students. Naturwissenschaften, 85(8):394–402, 1998. [160] J.-L. Rouas, J. Louradour, and S. Ambellouis. Audio events detection in public transport vehicle. In IEEE Intelligent Transportation Systems Conference, 2006 (ITSC ’06), pages 733–738, Sep. 2006. [161] W. Ruch and P. Ekman. The Expressive Pattern of Laughter. Emotion, Qualia, and Consciousness, pages 426–443, 2001. edited by A. W. Kaszniak (Word Scientific, Tokyo). [162] M. Ruhlen. A Guide to the World’s Languages, Vol. 1: Classification, pages 1–492. Stanford University Press, Stanford, USA, 1987. [163] N. Ruty, X. Pelorson, and A. V. Hirtum. Influence of acoustic waveguides lengths on self-sustained oscillations: Theoretical prediction and experimental validation. The Journal of the Acoustical Society of America, 123(5):3121–3121, 2008. [164] R. W. Schafer and L. R. Rabiner. System for automatic formant analysis of voiced speech. The Journal of the Acoustical Society of America, 47(2B):634–648, 1970. [165] M. Schroeder and B. Atal. Code-excited linear prediction(CELP): High-quality speech at very low bit rates. In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’85, volume 10, pages 937–940. IEEE, April 1985. [166] M. R. Schroeder. Recent Progress in Speech Coding at Bell Telephone Laboratories. In Proc. III Int. Congress on Acoustics. Elsevier Publishing Co., Amsterdam. [167] B. Schuller, A. Batliner, S. Steidl, and D. Seppi. Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge. Speech Communication, Special Issue 180 on Sensing Emotion and Affect - Facing Realism in Speech Processing, 53(9/10):1062–1087, November/December 2011. [168] B. Schuller, B. Vlasenko, F. Eyben, M. W¨ollmer, A. Stuhlsatz, A. Wendemuth, and G. Rigoll. CrossCorpus Acoustic Emotion Recognition: Variances and Strategies. IEEE Transactions on Affective Computing, 1(2):119–131, July-December 2010. [169] B. Schuller, Z. Zhang, F. Weninger, and F. Burkhardt. Synthesized Speech for Model Training in CrossCorpus Recognition of Human Emotion. International Journal of Speech Technology, Special Issue on New and Improved Advances in Speaker Recognition Technologies, 15(3):313–323, 2012. [170] G. Seshadri and B. Yegnanarayana. Perceived loudness of speech based on the characteristics of glottal excitation source. The Journal of the Acoustical Society of America, 126(4):2061–2071, 2009. [171] C. H. Shadle. Intrinsic fundamental frequency of vowels in sentence context. The Journal of the Acoustical Society of America, 78:1562–1567, 1985. [172] M. Shami and W. Verhelst. An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Communication, 49(3):201–212, 2007. [173] S. Singhal. Optimizing pulse amplitudes in multipulse excitation. The Journal of the Acoustical Society of America, 74(S1):S51–S51, 1983. [174] S. Singhal and B. Atal. Improving performance of multi-pulse LPC coders at low bit rates. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 9, pages 9–12, 1984. [175] M. J. Sol´e. Aerodynamic characteristics of trills and phonological patterning. 30:655–688, 2002. [176] M. A. Sonderegger. Subglottal coupling and vowel space: An investigation in quantal theory. Physics B. S. Thesis, Massachusetts Institute of Technology, Cambridge, MA, 2004. [177] M. Song, J. Bu, C. Chen, and N. Li. Audio-Visual Based Emotion Recognition- A New Approach. 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2:1020–1025, 2004. [178] S. Spajic, P. Ladefoged, and P. Bhaskararao. The trills of Toda. 26(1):1–21, 1996. [179] S. Steidl, A. Batliner, D. Seppi, and B. Schuller. On the Impact of Children’s Emotional Speech on Acoustic and Language Models. EURASIP Journal on Audio, Speech, and Music Processing, Special Issue on Atypical Speech, 2010(Article ID 783954), 2010. [180] K. N. Stevens. Airflow and turbulence noise for fricative and stop consonants: Static considerations. The Journal of the Acoustical Society of America, 50(4B):1180–1192, 1971. [181] K. N. Stevens. Physics of laryngeal behavior and larynx modes. Phonetica, 34(4):264–279, 1977. [182] K. N. Stevens. Acoustic Phonetics, chapter two, pages 55–126. Current Studies in Linguistics 30. The MIT Press, Cambridge, Massachusetts, London, first edition, 1998. [183] K. N. Stevens. Acoustic Phonetics, chapter three, pages 167–198. Current Studies in Linguistics 30. MIT Press, Cambridge, first edition, 2000. [184] K. N. Stevens, D. N. Kalikow, and T. R. Willemain. A miniature accelerometer for detecting glottal waveforms and nasalization. Journal of Speech, Language, and Hearing Research, 18:594–599, 1975. 181 [185] K. Sudheer Kumar, M. Sri Harish Reddy, K. Sri Ram Murty, and B. Yegnanarayana. Analysis of laugh signals for detecting in continuous speech. In Proc. 10th Annual Conference of the International Speech Communication Association,2009 (INTERSPEECH’09), pages 1591–1594. ISCA, Sep 6-10 2009. [186] S. Sundaram and S. Narayanan. Automatic acoustic synthesis of human-like laughter. The Journal of the Acoustical Society of America, 121(1):527–535, 2007. [187] H. Tanaka and N. Campbell. Acoustic features of four types of laughter in natural conversational speech. In Proc. 17th International Congress of Phonetic Sciences, 2011 (ICPhS XVII), pages 1958–1961, Aug. 17-21 2011. [188] T. Tanaka, T. Kobayashi, D. Arifianto, and T. Masuko. Fundamental frequency estimation based on instantaneous frequency amplitude spectrum. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002, volume 1, pages I–329–I–332, 2002. [189] I. Titze, T. Riede, and P. Popolo. Nonlinear source–filter coupling in phonation: Vocal exercises. The Journal of the Acoustical Society of America, 123(4):1902–1915, Apr. 2008. [190] I. R. Titze. The physics of small-amplitude oscillation of the vocal folds. Journal of the Acoustical Society of America, 83(4):1536–1552, 1988. [191] I. R. Titze. Theory of glottal airflow and source-filter interaction in speaking and singing. Acta Acustica united with Acustica, 90(4):641–648, 2004. [192] I. R. Titze. Nonlinear source–filter coupling in phonation: Theory. The Journal of the Acoustical Society of America, 123(5):2733–2749, 2008. [193] I. R. Titze and B. H. Story. Acoustic interactions of the voice source with the lower vocal tract. The Journal of the Acoustical Society of America, 101(4):2234–2243, 1997. [194] K. P. Truong and D. A. V. Leeuwen. Automatic detection of laughter. In Proc. of 9th European Conference on Speech Communication and Technology, 2005 (INTERSPEECH’05), pages 485–488, Sep. 4-8 2005. [195] K. P. Truong and D. A. V. Leeuwen. Automatic discrimination between laughter and speech. Speech Communication, 49(2):144–158, Feb. 2007. [196] K. P. Truong and S. Raaijmakers. Automatic recognition of spontaneous emotions in speech using acoustic and lexical features. In A. Popescu-Belis and R. Stiefelhagen, editors, MLMI, volume 5237 of Lecture Notes in Computer Science, pages 161–172. Springer, 2008. [197] C. K. Un and D. T. Magill. The Residual-Excited Linear Prediction Vocoder with Transmission Rate Below 9.6 kbits/s. IEEE Transactions on Communications, 23(12):1466–1474, 1975. [198] J. Urbain, R. Niewiadomski, E. Bevacqua, T. Dutoit, A. Moinet, C. Pelachaud, B. Picart, J. Tilmanne, and J. Wagner. AVLaughterCycle. Journal on Multimodal User Interfaces, 4(1):47–58, 2010. [199] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. Scream and gunshot detection and localization for audio-surveillance systems. In IEEE Conference on Advanced Video and Signal Based Surveillance, 2007 (AVSS 2007), pages 21–26, Sep. 2007. 182 [200] J. Van Den Berg. Myoelastic-aerodynamic theory of voice production. Journal of Speech and Hearing Research, 1(3):227–244, 1958. [201] B. Van Der Pol. The fundamental principles of frequency modulation. Journal of the Institution of Electrical Engineers - Part III: Radio and Communication Engineering, 93(23):153–158, 1946. [202] P. W. J. Van Hengel and T. C. Andringa. Verbal aggression detection in complex social environments. In IEEE Conference on Advanced Video and Signal Based Surveillance, 2007 (AVSS 2007), pages 15–20, Sep. 2007. [203] D. Ververidis and C. Kotropoulos. Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9):1162–1181, 2006. [204] J. Ville. Theory and Applications of the Notion of Complex Signal, volume 2A. RAND Corporation, Santa Monica, CA. [205] T. Vogt, E. Andr, and J. Wagner. Automatic recognition of emotions from speech: a review of the literature and recommendations for practical realisation. In In LNCS 4868, pages 75–91, 2008. [206] J. Wagner, J. Kim, and E. Andr. From physiological signals to emotions: Implementing and comparing selected methods for feature extraction and classification. In ICME, pages 940–943. IEEE, 2005. [207] M. W¨ollmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll. LSTM-Modeling of Continuous Emotions in an Audiovisual Affect Recognition Framework. Image and Vision Computing, Special Issue on Affect Analysis in Continuous Input, 31(2):153–163, February 2013. [208] M. W¨ollmer, M. Kaiser, F. Eyben, F. Weninger, B. Schuller, and G. Rigoll. Fully Automatic Audiovisual Emotion Recognition – Voice, Words, and the Face. In T. Fingscheidt and W. Kellermann, editors, Proceedings of Speech Communication; 10. ITG Symposium, pages 1–4, Braunschweig, Germany, September 2012. ITG, IEEE. [209] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi. Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis. IEICE - Trans. Inf. Syst., E88-D(3):502–509, March 2005. [210] B. Yang and M. Lugger. Emotion recognition from speech signals using new harmony features. Signal Processing, 90(5):1415–1423, 2010. [211] N. Yang, R. Muraleedharan, J. Kohl, I. Demirkol, W. Heinzelman, and M. Sturge-Apple. Speech-based emotion classification using multiclass svm with hybrid kernel and thresholding fusion. In SLT, pages 455–460. IEEE, 2012. [212] B. Yegnanarayana. Formant extraction from linear prediction phase spectra. 63(5):1638–1640, May 1978. [213] B. Yegnanarayana and N. G. Dhananjaya. Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Communication, 55(6):782–795, 2013. [214] B. Yegnanarayana, M. A. Joseph, V. G. Suryakanth, and N. Dhananjaya. Decomposition of speech signals for analysis of aperiodic components of excitation. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011), pages 5396 –5399, May 2011. 183 [215] B. Yegnanarayana and H. A. Murthy. Significance of group delay functions in spectrum estimation. 40(9):2281–2289, September 1992. [216] B. Yegnanarayana and K. S. R. Murty. Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4):614–624, 2009. [217] P. Zelinka, M. Sigmund, and J. Schimmel. Impact of vocal effort variability on automatic speech recognition. Speech Communication, 54(6):732 – 742, 2012. [218] Z. Zeng, J. Tu, B. Pianfetti, and T. S. Huang. Audio-visual affective expression recognition through multistream fused hmm. IEEE Transactions on Multimedia, 10(4):570–577, 2008. [219] C. Zhang and J. H. L. Hansen. Analysis and classification of speech mode: whispered through shouted. pages 2289–2292, Antwerp, Belgium, 2007. ISCA. [220] S. Zhang, Y. Xu, J. Jia, and L. Cai. Analysis and modelling of affective audio visual speech based on pad emotion space. 2008. [221] Z. Zhang, J. Neubauer, and D. A. Berry. The influence of subglottal acoustics on laboratory models of phonation. The Journal of the Acoustical Society of America, 120(3):1558–1569, 2006. 184 List of Publications Papers in refereed Journals 1. V. K. Mittal and B. Yegnanarayana, “Effect of glottal dynamics in the production of shouted speech”, The Journal of the Acoustical Society of America, vol. 133, no. 15, pp. 3050-3061, May 2013. 2. V. K. Mittal, B. Yegnanarayana and P. Bhaskararao, “Study of the effects of vocal tract constriction on glottal vibration”, The Journal of the Acoustical Society of America, vol. 136, no. 4, pp. 19321941, Oct. 2014. 3. Vinay Kumar Mittal and Bayya Yegnanarayana, “Analysis of production characteristics of laughter”, Computer Speech and Language, published by Elsevier, http://dx.doi.org/10.1016/j.csl.2014.08.004, Sep. 2014. 4. Vinay Kumar Mittal and Bayya Yegnanarayana, “Study of characteristics of aperiodicity in expressive voices”, submitted to The Journal of the Acoustical Society of America, (under review since 28 July 2014). Papers in Conferences 1. V. K. Mittal, N. Dhananjaya and B. Yegnanarayan, “Effect of Tongue Tip Trilling on the Glottal Excitation Source”, in Proc. INTERSPEECH 2012, 13th Annual Conference of the International Speech Communication Association, Sep. 9-13, 2012, Portland, Oregon, USA. 2. V. K. Mittal and B. Yegnanarayana, “Production Features for Detection of Shouted Speech”, in Proc. 10th Annual IEEE Consumer Communications and Networking Conference, 2013 (CCNC’13), pp. 106-111, Jan. 11-14, 2013, USA. 3. Vinay Kumar Mittal and B. Yegnanarayana, “Study of Changes in Glottal Vibration Characteristics During Laughter”, in Proc. INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, pp. 1777-1781, Sep. 14-18, 2014, Singapore. 185 4. Vinay Kumar Mittal and B. Yegnanarayana, “Significance of Aperiodicity in the Pitch Perception of Expressive Voices”, in Proc. INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, pp. 504-508, Sep. 14-18, 2014, Singapore. 5. V. K. Mittal and B. Yegnanarayana, “An Automatic Shout Detection System Using Speech Production Features”, in Proc. Workshop on Multimodal Analyses enabling Artificial Agents in HumanMachine Interaction, INTERSPEECH 2014 (15th Annual Conference of ISCA), (would appear in LNAI by Springer in Dec 2014), Sep. 14, 2014, Singapore. Other related Papers 1. P. Gangamohan, V. K. Mittal and B. Yegnanarayana, “Relative Importance of Different Components of Speech Contributing to Perception of Emotion”, in Proc. 6th Interantional Conference on Speech Prosody (ISCA), pp. 657-660, May 22-25, 2012, Shanghai, China. 2. P. Gangamohan, V. K. Mittal and B. Yegnanarayana, “A Flexible Analysis Synthesis Tool (FAST) for studying the characteristic features of emotion in speech”, in Proc. 9th Annual IEEE Consumer Communications and Networking Conference, 2012 (CCNC’12), pp. 250-254, Jan. 14-17, 2012, USA. 186
© Copyright 2024 ExpyDoc