Monaural Speaker Segregation Using Group Delay Spectral Matrix Factorization Karan Nathwani∗ , Anurag Kumar† and Rajesh M Hegde∗ Indian Institute of Technology Kanpur, India ∗ Carnegie Mellon University, USA † ∗ Email: {nathwani, rhegde}@iitk.ac.in, † [email protected] Abstract—Non-negative matrix factorization (NMF) methods have been widely used in single channel speaker separation. NMF methods use the magnitude of the Fourier transform for training the basis vectors. In this paper, a method that factorizes the spectral magnitude matrix obtained from the group delay function (GDF) is described. During training, pre-learning is applied on a training set of original sources. The bases are trained iteratively to minimize the approximation error. Separation of the mixed speech signal involves the factorization of the non negative group delay spectral matrix along with the use of fixed stacked bases computed during training. This matrix is then decomposed into a linear combination of trained bases for each individual speaker contained in the mixed speech signal. The estimated spectral magnitude for each speaker signal is modulated by the phase of mixed signal to reconstruct signal for each speaker signal. The separated speaker signals are further refined using a min-max masking method. Experiments on subjective evaluation, objective evaluation and multi speaker speech recognition on the TIMIT and the GRID corpus indicate reasonable improvements over other conventional methods. I. I NTRODUCTION Monaural speaker separation where the speech signal acquired by a single microphone is challenging in the presence of a competing speaker. This results in degradation of intelligibility of target speaker speech in the presence of interfering speaker. Non-negative matrix factorization (NMF) [1], [2], [3] has been used in this context to effectively separate a mixture of speech signals by decomposing the spectral matrix obtained from the magnitude of the Fourier transform [4], [5]. Sparsity promoting methods [6] have also been used in the context of NMF for training bases of individual speakers. Separation based on latent variable and latent Dirichlet decomposition have also been used in [7] and [8]. In [9], instantaneous frequency based method is described for audio source separation. In this paper, the non-negative spectral magnitude matrix is computed from the group delay function (GDF) [10]. This spectral magnitude computed from GDF is used in NMF decomposition instead of spectral magnitude computed from FFT. The magnitude from the group delay spectrum is computed using the weighted cepstrum. This computation utilizes the relation between the spectral magnitude and the group delay function via the cepstral coefficients. During training phase, the non-negative group delay spectral matrix of the individual speaker is decomposed into a non negative basis vectors and its corresponding non negative weights. During testing, the spectral magnitude of mixed speaker signal computed from GDF is decomposed into linear combination of non c 2014 IEEE 978-1-4799-2361-8/14/$31.00 negative weights for each individual speaker with the use of fixed stacked bases computed during training. The estimated spectral magnitude for each speaker signal is modulated by the phase of mixed signal to reconstruct the signal for each speaker. The separated speaker signals are further refined using a min-max masking method. The subjective, objective and speech recognition experiments for the TIMIT and the GRID corpus indicates reasonable improvements over other conventional methods. The experimental results indicate the significance of proposed method for automatic speech recognition application. The remainder of this paper is organized as follows. Section II describes the problem formulation for speaker separation. In Section III, the computation of spectral magnitude from group delay function is explained. The algorithm for speaker separation using group delay matrix factorization is presented in Section IV. The performance evaluation for the proposed method is discussed in Section V. Section VI presents a brief conclusion. II. P ROBLEM F ORMULATION Let us consider a mixed speech signal s(n) consist of two speakers m(n) and f (n). The objective of speaker separation is to obtain the estimates of m(n) and f (n). Here n represents the time samples. The speech signal is generally time varying, which makes difficult to obtain their characteristic. To solve this problem, the short time Fourier transform (STFT) is used in calculating the discrete Fourier transform of each speaker signal. Let Sk (ω), Mk (ω) and Fk (ω) represent the STFT of s(n), m(n) and f (n) respectively. Here, k represents the short time over which the Fourier transform is evaluated and ω is the frequency index. Since STFT is linear, so we can write Sk (ω) = Mk (ω) + Fk (ω) jφSk (ω) Se jφMk (ω) = Me jφFk (ω) + Fe (1) (2) With the assumption of magnitude only reconstruction [10], the magnitude of mixed speech signal S can be written as sum of magnitudes of first and second speaker. S=M+F (3) where M and F are unknown magnitude spectrum. M and F needs to be estimated using mixed speech data in an interative manner. M, F and S are function of time & frequency index. The magnitude spectrum of mixed signal is estimated using group delay spectrum of mixed signal. The computation of spectral magnitude and advantage of GDF over conventional FFT based method in calculation of magnitude spectrum is described in next Section. III. C OMPUTING THE SPECTRAL MAGNITUDE FROM G ROUP D ELAY F UNCTION In this work, the spectral magnitude is computed from the group delay function [10]. The group delay function τx (ω) [11], [10], [12], [13] of sequence x(n) is defined as the negative derivative of the unwrapped short time phase spectrum [14]. The group delay function is a measure of degree of nonlinearity of the phase, when its values moves away from a constant. XR (ω)YR (ω) + YI (ω)XI (ω) τx (ω) = (4) |X(ω)|2 where the subscripts R and I, denote the real and imaginary parts of the Fourier transform. X(ω) and Y (ω) are the Fourier transforms of x(n) and nx(n), respectively. The weighted cepstrum w(n) ˆ is estimated from τx (ω) as w(n) ˆ = IDFT(τx (ω)), n = 0, 1, ....N − 1 HTm HmMWm HTm .1 (9) (6) Similarly, the divergence between the magnitude spectrum F (computed from GDF) and product of non-negative basis vector Hf and non-negative weights matrix Wf as shown in Equation 10 can be minimized. S PEAKER S EGREGATION USING G ROUP D ELAY M ATRIX FACTORIZATION F ≈ Hf Wf (10) The updates for Hf and Wf can be obtained on similar lines as obtained for Hm and Wm respectively. Similarly, the columns of Hf is normalized at each iteration to have normalized columns for F. The large number of basis vectors results in lower approximation error but results in large computation time. The appropriate number of bases depends upon dimension and signal type. B. Decomposition of Mixed Signal In this Section, speaker separation using group delay matrix factorization (GDMF) is discussed. The term GDMF is defined as the process of factorizing magnitude spectrum matrix estimated from group delay function in NMF framework. Once the basis vectors of individual sources are learned separately, they can be used to estimate each source from a monaural mixture. The testing phase utilizes the spectral magnitude computed from GDF for mixed signal and bases for each speaker signal obtained during training to produce excitations weights for both the speakers. Hence, supervised separation of two sources from a monaural mixture is performed. A. Training the Bases In training phase, pre-learning is applied on a training set of original sources. The magnitude spectrum M calculated from group delay function for target speaker signal is decomposed into a non-negative basis vector Hm and its corresponding nonnegative weight matrix Wm as shown below: M ≈ Hm Wm Wm = Wm . (8) (5) where n goes from 1 to N/2. Then N-point DFT of w1 (n) is computed, which is given by Y1 (ω). Then, ln|Ys (ω)| = Real[Y1 (ω)] is computed. The estimated smooth spectrum X obtained from group delay function is given by 2*ln|Ys (ω)|. The GDF has high resolution property [11], [15], [12] which makes it more robust in spectrum estimation compared to conventional FFT based method of calculating magnitude spectrum. This GDF is used to compute non-negative magnitude spectrum matrix for each of the speaker m(n) and f (n) in training phase and for mixed signal s(n) in testing phase. IV. M WTm Hm = Hm . Hm Wm T 1.Wm where 1 is matrix of ones with same size of K. The Hm and Wm are randomly initialized. The Equation 8 & 9 are used to solve the above Equation 7. M have normalized columns. At each iteration, the columns of Hm is also normalized. The sequence w1 (n) is generated from w(n) ˆ w1 (0) = 0 w1 (n) = w(n)/n ˆ w1 (N − n + 1) = w1 (n) in the matrix M. The corresponding column of the matrix Wm represents the weights for the basis vectors. In order to allow data in M to be approximated as a non-negative linear combination of its constituent vectors, the optimization of nonnegative basis matrix Hm is required. The objective of GDMF is to minimize the divergence between M and product of Hm and Wm as defined in [4], [6] by updating Hm and Wm . (7) The weighted linear combination of the basis vectors in the columns of Hm is approximately equal to every column vector In testing phase, the magnitude spectrum of mixed speech signal Sn (ω) is obtained from GDF [10]. This magnitude spectrum S is decomposed using conventional NMF [16] technique. The bases obtained during training stage for each speaker signal are stacked together and used in decomposition of S. S ≈ [Hm Hf ]W (11) Here, Hm and Hf are obtained by solving Equation 7 and 10. During testing, weight matrix W is randomly initailized. The weight update Equation 9 is used to solve Equation 11 with fixed bases matrix. The estimated magnitude spectrum for both the speaker is obtained by multiplying the bases matrix of each speaker’s signal with its corresponding weights in matrix W obtained from Equation 11 as ˆ = Hm Wm M (12) ˆ = Hf Wf F (13) where Wf & Wm are submatrices in weight matrix W that belongs to two speaker components respectively in Equation 11. Fig. 1. The spectrograms of the reference target signal and the mixed signal (above), the reconstructed signals without and with mask (below) respectively, when the target-to-interference ratio (TIR) is 0dB. C. Masking λ= 1 0 ˆ >F ˆ if M ˆ ˆ if M < F (14) In Equation 14, λ is a spectrographic mask matrix computed ˆ and F ˆ which are function of time & frequency. Where, using M ˆ ˆ M and F are defined as the estimated magnitude spectrum for both the target and interference speaker respectively. The hard ˆ and mask obtained is applied to the estimated spectrogram M multiplied by the phase of mixed speech signal to obtain an improved version of the original signal Mn (ω). Similar process is followed to obtain an improved version of the original signal Fn (ω). It is clear from Figure 1 that GDMF algorithm segregates the target speaker effectively after the application of mask. Thus, proposed method results in better separation of the signals as illustrated through Figure 1. V. D IVERSITY A NALYSIS OF THE S PECTRAL M AGNITUDE C OMPUTED FROM THE GDF In this Section, the diversity analysis of the spectral magnitude computed from GDF is discussed. This analysis include spectral smoothness and robustness measures for the magnitude computed from GDF and FFT. 1) Diversity Analysis via Spectral Smoothness: Consider the system shown in Figure 2(a), where signals are generated synthetically with formants at 1020Hz, 1400Hz, 1780Hz and 2300Hz, sampled at 10 kHz frequency and pitch period is 10 ms in the presence of additive white Gaussian noise with signal to noise ratio equal to 20dB. In Figure 2(b), the group delay is calculated for the system shown in Figure 2(a). In Figure 2(b), all the three formants are clearly distinguished due to high resolution property of the GDF in contrast to FFT Fig. 2. Comparison of magnitude spectrum computed from FFT (c) & from GDF (d) for the system shown in (a). spectrum [15]. It can be noted that the magnitude spectrum computed from conventional FFT in Figure 2(c) has large variance due to large fluctuation. The fluctuation is significantly reduced in the magnitude spectrum estimated from GDF in Figure 2(d). It can also be seen from Figure 2(d) that the magnitude spectrum computed from GDF is more smooth compared to FFT spectrum Figure 2(c). This helps in efficient decomposition of magnitude spectrum computed from GDF into its corresponding bases and weight matrices. In general, a smoother the magnitude spectrum results in an effective decomposition of spectral magnitude. Fig. 3. Comparison of average magnitude estimated from FFT and GDF. 2) Diversity Analysis via Spectral Robustness: The average magnitude computed for FFT and GDF is obtained by averaging 300 overlaid realizations of estimated magnitude spectrum from FFT and GDF in presence of noise as shown in Figure 3. The magnitude spectrum is obtained for the synthetic signal which is generated at 1020Hz, 1400Hz, 1780Hz and 2300Hz formants and sampled at 10 kHz frequency with pitch period of 10 ms. It can be noted from Figure 3 that averaging the spectral magnitude computed for FFT reduces fluctuations but at the expense of large bias. These fluctuations are significantly reduced in the spectrum computed from GDF after averaging. This is due to the high resolution and robustness properties of GDF [15], [10] which results in robust magnitude spectrum. Due to the robustness property of GDF, the spectrum estimated from GDF observe less variance and more smoothness in comparison to magnitude spectrum estimated from conventional FFT method. Also, the magnitude spectrum estimated from GDF is less dependent on the window compared to FFT methods. Thus from Figure 3, it can be observed that for every iteration of magnitude estimation in the presence of AWGN noise at SNR equal to 20 dB, the spectrum estimated from GDF are least affected by additive noise compared to spectrum estimated from FFT. Thus, the spectrum estimated at every iteration from GDF are overlapped. This is due to the robustness property of GDF which results in perfect overlapping of magnitude spectrum compared to spectrum estimated from FFT. Due to the robustness property described above, the spectral magnitude computed from GDF is used in NMF framework instead of spectral magnitude computed from FFT. Fig. 4. The subjective scores based on GQ, TP, OSS & ANS for reconstructed target signal VI. P ERFORMANCE E VALUATION In this Section, the performance of the proposed GDMF separation method is evaluated in terms of speech enhancement and multi-speaker speech recognition. The experiments are performed on a large number of sentences from the GRID corpus [17] and TIMIT databases [18], in order to evaluate the separation results. A. Subjective Evaluation The speech separation task is best judged by the most efficient filter, which is human ear. In subjective methods, the evaluation incorporates the subjective judgment of the listener. Subjective measure is perhaps the most accurate way of evaluating the quality of any separation system. In subjective test, the separated signals are rated upon four parameters global TABLE I. M EAN PSM AND Methods GDMF FFT IF LVD LDD NMF PSM T SCORES FOR TIR=-6dB PSM PSMt 0.69 -0.20 0.55 -0.26 0.57 -0.25 0.68 -0.23 0.68 -0.21 0.67 -0.22 THE PROPOSED METHOD GRID TIR=-3dB PSM PSMt 0.75 -0.13 0.59 -0.20 0.61 -0.18 0.73 -0.15 0.74 -0.14 0.73 -0.16 (GDMF) & OTHER TIR=0dB PSM PSMt 0.77 -0.03 0.65 -0.12 0.68 -0.09 0.74 -0.05 0.75 -0.04 0.75 -0.04 CONVENTIONAL METHODS AT VARIOUS TIR=-6dB PSM PSMt 0.61 -0.21 0.40 -0.32 0.41 -0.31 0.56 -0.25 0.57 -0.24 0.56 -0.24 TIMIT TIR=-3dB PSM PSMt 0.63 -0.17 0.45 -0.26 0.45 -0.24 0.59 -0.21 0.61 -0.19 0.60 -0.20 TIR VALUES . TIR=0dB PSM PSMt 0.68 -0.11 0.51 -0.22 0.52 -0.22 0.63 -0.17 0.68 -0.11 0.69 -0.11 have better subjective and objective performance in comparison to others due to the robust spectral magnitude computed from GDF which results in effective decomposition in NMF framework. C. Experiments on Multi-Speaker Speech Recognition Fig. 5. The subjective scores based on GQ, TP, OSS & ANS for reconstructed interfering signal quality (GQ), target preservation (TP), other signal separation (OSS) & artificial noise separation (ANS) by 25 listeners. The instructions followed by the listeners during judgment are discussed elaborately in [19]. The subjective listening scores for reconstructed target (m(t)) ˆ & interferening (fˆ(t)) signal on four parameters are shown in Figure 4 and Figure 5 respectively for GRID database. The results indicate that the GDMF algorithm performs reasonably better than the standard FFT, Instantaneous frequency (IF) [9], Latent Variable Decomposition (LVD) [7], Latent Dirichlet Decomposition (LDD) [8], and NMF methods. B. Objective Evaluation The PEMO-Q technique [20] is used to predict perceived quality differences between audio signals. PEMO-Q is based on the Huber and Kollmeier [21] work. PEMO-Q is used to compute internal representations of signal pairs. These pairs are compared quantitatively by calculating the linear cross correlation coefficient. The mean perceptual similarity measure (PSM) and its instantaneous audio quality (PSMt) scores for the proposed GDMF is better compared to other methods like FFT, IF, LVD, LDD & NMF. The result are listed at various TIR values in Table I for GRID and TIMIT database. As expected, the perceptual similarity measure drops at low TIR levels. When the values of TIR is closer or smaller to zero, PSMt (the averaged value of instantaneous PSM vector) scores are negative, which indicates poor separation. These PSM and PSMt scores corroborates with the results obtained from the subjective evaluation of the sound files. The proposed method The GRID corpus and TIMIT databases are used for performing the experiments on multi-speaker speech recognition. Models are built from GRID corpus and TIMIT databases for the speech recognition system using the entire two-talker sentence pairs files. The sentences are different for each speaker in mixed clip. We have used 15 state, and 3 mixtures triphone HMM with 39 dimension MFCC with delta and acceleration coefficients in speech recognition experiments for both the database separately. The baseline triphone models of the recognition system are trained by clean speech training data from two databases. The test data sets are generated at different TIR (0db, -3dB, 3dB, -6dB, 6dB) from two databases and applied to the proposed (GDMF) algorithm along with other methods used in comparison. For testing the recognition system, the reconstructed files generated from these methods are used. However, the recognition results are presented in form of word error rates (WER) after recognizing the complete sentence, where WER is defined as WER = S+D+I N (15) Here, S is the number of substitutions, D the number of deletions, I the number of insertions and N is the number of words in the reference. Human WER has been calculated by running a series of listening and identification experiments and then calculating the WER for the task at various TIR values. TABLE II. C OMPARISON OF THE WORD ERROR RATE (%) AT SEVERAL TIR VALUES AT GRID AND TIMIT DATABASES Methods GDMF HP FFT IF LVD LDD NMF 6 38.67 8 48.33 47.67 46 46.67 45.00 Methods GDMF HP FFT IF LVD LDD NMF 6 40.33 12 68.67 65.33 46.67 46.67 46.00 GRID TIR (dB) 3 0 40.33 41.67 17 25 50.33 51.67 48.00 48.67 47.33 48.00 47.00 48.00 47 47.67 TIMIT TIR (dB) 3 0 42.33 43.67 20 29 71.00 71.67 69.67 70.33 48.00 49.67 48.33 49.33 48.33 50.00 -3 43.00 28 52.00 51.67 49.67 49.33 49.33 -6 45.67 29 54.67 53.33 50.33 50.00 50.33 -3 45.33 31 75.67 74.00 52.67 51.67 52.33 -6 46.67 33 77.00 76.67 55.33 55.00 55.33 Table II lists the percentage WER of the extracted target signals for different methods at various TIR values. The vocabulary size of 853 and 52 have been used to evaluate the system on TIMIT and GRID database respectively. The test perplexity of 853 and 52 is used for TIMIT and GRID database respectively. It is clear from Table II that the proposed algorithm GDMF has lower WER compared to other algorithms. Moreover, the proposed method (GDMF) turns out to be closest to the human performance (HP) results compared to the other methods. VII. [6] M. Schmidt and R. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” 2006. [7] B. Raj and P. Smaragdis, “Latent variable decomposition of spectrograms for single channel speaker separation,” in Applications of Signal Processing to Audio and Acoustics, 2005. IEEE Workshop on. IEEE, 2005, pp. 17–20. [8] B. Raj, M.V.S. Shashanka, and P. Smaragdis, “Latent dirichlet decomposition for single channel speaker separation,” in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006, vol. 5, pp. V–V. [9] L. Gu, Single-channel Speech Separation Based on Instantaneous Frequency, Ph.D. thesis, Citeseer, 2010. [10] B. Yegnanarayana and H.A. Murthy, “Significance of group delay functions in spectrum estimation,” Signal Processing, IEEE Transactions on, vol. 40, no. 9, pp. 2281–2289, 1992. [11] H. A. Murthy and B. Yegnanarayana, “Formant extraction from group delay function,” vol. 10, pp. 209–221, 1991. [12] H. A. Murthy and B. Yegnanarayana, “Group delay functions and its applications in speech technology,” Sadhana, vol. 36, no. Part 5, 2011. [13] Karan Nathwani, Pranav Pandit, and Rajesh M Hegde, “Group delay based methods for speaker segregation and its application in multimedia information retrieval,” IEEE transactions on multimedia, vol. 15, no. 6, pp. 1326–1339, 2013. [14] A.V. Oppenheim, R.W. Schafer, J.R. Buck, et al., Discrete-time signal processing, vol. 2, Prentice hall Upper Saddle Riverˆ eN. JNJ, 1989. [15] R.M. Hegde, H.A. Murthy, and V.R.R. Gadde, “Significance of joint features derived from the modified group delay function in speech processing,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2007, no. 1, pp. 5–5, 2007. [16] A. Lefevre, F. Bach, and C. Fevotte, “Online algorithms for nonnegative matrix factorization with the itakura-saito divergence,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2011 IEEE Workshop on, Oct., pp. 313–316. [17] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,” The Journal of the Acoustical Society of America, vol. 120, pp. 2421, 2006. [18] J.S. Garofolo, TIMIT: Acoustic-phonetic Continuous Speech Corpus, Linguistic Data Consortium, 1993. [19] Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker Hohmann, “Subjective and objective quality assessment of audio source separation,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 7, pp. 2046–2057, 2011. [20] R.K.B. Huber, “PEMO-Q–A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception,” IEEE Transactions on Audio Speech and Language Processing, vol. 14, no. 6, pp. 1902– 1911, 2006. [21] J. Damaschke, R. Huber, V. Hohmann, and B. Kollmeier, “PRO-DASP: An audio quality testbench for optimizing low-power chip designs of speech processing algorithms,” Muller, D., Kretzschmar, C., Siegmund, R., eds, vol. 3, pp. 50–54, 2002. C ONCLUSION The single channel speaker separation method proposed in this work uses non-negative matrix factorization for decomposition of the spectral magnitude of the group delay function. This method helps reduce error in the estimation of the basis vectors of individual speakers in a mixture primarily due to the high resolution property of the group delay function. Subjective evaluation of the proposed method shows reasonable improvement over other speaker separation methods. Lower word error rate is noted in multi-speaker speech recognition experiments. Although a reasonable improvement is noticed by using the magnitude of the group delay, the spiky nature of the group delay function due to pitch periodicity effects especially when multiple speakers are involved needs to be addressed. The effect of the abrupt transitions in the group delay function can lead to erroneous estimation of the spectral magnitude of the mixed speech signal. Hence, its effect on factorization and training the bases also needs to be addressed. R EFERENCES [1] [2] [3] [4] [5] Daniel D Lee and H Sebastian Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999. A. Ozerov and C. Fevotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 18, no. 3, pp. 550–563, March. Paris Smaragdis and Judith C Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on. IEEE, 2003, pp. 177–180. E.M. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization and spectral masks,” in Digital Signal Processing (DSP), 2011 17th International Conference on. IEEE, 2011, pp. 1–6. Cemil Demir, Mehmet Ugur Dogan, A Taylan Cemgil, and Murat Sarac¸lar, “Single-channel speech-music separation using nmf for automatic speech recognition,” in Signal Processing and Communications Applications (SIU), 2011 IEEE 19th Conference on. IEEE, 2011, pp. 486–489.
© Copyright 2024 ExpyDoc