Monaural Speaker Segregation Using Group Delay

Monaural Speaker Segregation Using Group Delay
Spectral Matrix Factorization
Karan Nathwani∗ , Anurag Kumar† and Rajesh M Hegde∗
Indian Institute of Technology Kanpur, India ∗
Carnegie Mellon University, USA †
∗
Email: {nathwani, rhegde}@iitk.ac.in, † [email protected]
Abstract—Non-negative matrix factorization (NMF) methods
have been widely used in single channel speaker separation.
NMF methods use the magnitude of the Fourier transform for
training the basis vectors. In this paper, a method that factorizes
the spectral magnitude matrix obtained from the group delay
function (GDF) is described. During training, pre-learning is
applied on a training set of original sources. The bases are trained
iteratively to minimize the approximation error. Separation of the
mixed speech signal involves the factorization of the non negative
group delay spectral matrix along with the use of fixed stacked
bases computed during training. This matrix is then decomposed
into a linear combination of trained bases for each individual
speaker contained in the mixed speech signal. The estimated
spectral magnitude for each speaker signal is modulated by the
phase of mixed signal to reconstruct signal for each speaker
signal. The separated speaker signals are further refined using a
min-max masking method. Experiments on subjective evaluation,
objective evaluation and multi speaker speech recognition on the
TIMIT and the GRID corpus indicate reasonable improvements
over other conventional methods.
I.
I NTRODUCTION
Monaural speaker separation where the speech signal acquired by a single microphone is challenging in the presence
of a competing speaker. This results in degradation of intelligibility of target speaker speech in the presence of interfering
speaker. Non-negative matrix factorization (NMF) [1], [2], [3]
has been used in this context to effectively separate a mixture
of speech signals by decomposing the spectral matrix obtained
from the magnitude of the Fourier transform [4], [5]. Sparsity
promoting methods [6] have also been used in the context
of NMF for training bases of individual speakers. Separation
based on latent variable and latent Dirichlet decomposition
have also been used in [7] and [8]. In [9], instantaneous frequency based method is described for audio source separation.
In this paper, the non-negative spectral magnitude matrix
is computed from the group delay function (GDF) [10]. This
spectral magnitude computed from GDF is used in NMF
decomposition instead of spectral magnitude computed from
FFT. The magnitude from the group delay spectrum is computed using the weighted cepstrum. This computation utilizes
the relation between the spectral magnitude and the group
delay function via the cepstral coefficients. During training
phase, the non-negative group delay spectral matrix of the
individual speaker is decomposed into a non negative basis
vectors and its corresponding non negative weights. During
testing, the spectral magnitude of mixed speaker signal computed from GDF is decomposed into linear combination of non
c 2014 IEEE
978-1-4799-2361-8/14/$31.00 negative weights for each individual speaker with the use of
fixed stacked bases computed during training. The estimated
spectral magnitude for each speaker signal is modulated by
the phase of mixed signal to reconstruct the signal for each
speaker. The separated speaker signals are further refined using
a min-max masking method.
The subjective, objective and speech recognition experiments for the TIMIT and the GRID corpus indicates reasonable
improvements over other conventional methods. The experimental results indicate the significance of proposed method
for automatic speech recognition application. The remainder
of this paper is organized as follows. Section II describes
the problem formulation for speaker separation. In Section
III, the computation of spectral magnitude from group delay
function is explained. The algorithm for speaker separation
using group delay matrix factorization is presented in Section
IV. The performance evaluation for the proposed method is
discussed in Section V. Section VI presents a brief conclusion.
II.
P ROBLEM F ORMULATION
Let us consider a mixed speech signal s(n) consist of two
speakers m(n) and f (n). The objective of speaker separation
is to obtain the estimates of m(n) and f (n). Here n represents
the time samples. The speech signal is generally time varying,
which makes difficult to obtain their characteristic. To solve
this problem, the short time Fourier transform (STFT) is used
in calculating the discrete Fourier transform of each speaker
signal. Let Sk (ω), Mk (ω) and Fk (ω) represent the STFT of
s(n), m(n) and f (n) respectively. Here, k represents the
short time over which the Fourier transform is evaluated and
ω is the frequency index. Since STFT is linear, so we can write
Sk (ω) = Mk (ω) + Fk (ω)
jφSk (ω)
Se
jφMk (ω)
= Me
jφFk (ω)
+ Fe
(1)
(2)
With the assumption of magnitude only reconstruction [10],
the magnitude of mixed speech signal S can be written as sum
of magnitudes of first and second speaker.
S=M+F
(3)
where M and F are unknown magnitude spectrum. M and F
needs to be estimated using mixed speech data in an interative
manner. M, F and S are function of time & frequency index.
The magnitude spectrum of mixed signal is estimated using
group delay spectrum of mixed signal. The computation of
spectral magnitude and advantage of GDF over conventional
FFT based method in calculation of magnitude spectrum is
described in next Section.
III.
C OMPUTING THE SPECTRAL MAGNITUDE FROM
G ROUP D ELAY F UNCTION
In this work, the spectral magnitude is computed from the
group delay function [10]. The group delay function τx (ω)
[11], [10], [12], [13] of sequence x(n) is defined as the
negative derivative of the unwrapped short time phase spectrum
[14]. The group delay function is a measure of degree of
nonlinearity of the phase, when its values moves away from a
constant.
XR (ω)YR (ω) + YI (ω)XI (ω)
τx (ω) =
(4)
|X(ω)|2
where the subscripts R and I, denote the real and imaginary
parts of the Fourier transform. X(ω) and Y (ω) are the Fourier
transforms of x(n) and nx(n), respectively. The weighted
cepstrum w(n)
ˆ
is estimated from τx (ω) as
w(n)
ˆ
= IDFT(τx (ω)), n = 0, 1, ....N − 1
HTm HmMWm
HTm .1
(9)
(6)
Similarly, the divergence between the magnitude spectrum
F (computed from GDF) and product of non-negative basis
vector Hf and non-negative weights matrix Wf as shown in
Equation 10 can be minimized.
S PEAKER S EGREGATION USING G ROUP D ELAY
M ATRIX FACTORIZATION
F ≈ Hf Wf
(10)
The updates for Hf and Wf can be obtained on similar lines as
obtained for Hm and Wm respectively. Similarly, the columns
of Hf is normalized at each iteration to have normalized
columns for F. The large number of basis vectors results in
lower approximation error but results in large computation
time. The appropriate number of bases depends upon dimension and signal type.
B. Decomposition of Mixed Signal
In this Section, speaker separation using group delay matrix
factorization (GDMF) is discussed. The term GDMF is defined
as the process of factorizing magnitude spectrum matrix estimated from group delay function in NMF framework. Once the
basis vectors of individual sources are learned separately, they
can be used to estimate each source from a monaural mixture.
The testing phase utilizes the spectral magnitude computed
from GDF for mixed signal and bases for each speaker signal
obtained during training to produce excitations weights for
both the speakers. Hence, supervised separation of two sources
from a monaural mixture is performed.
A. Training the Bases
In training phase, pre-learning is applied on a training set of
original sources. The magnitude spectrum M calculated from
group delay function for target speaker signal is decomposed
into a non-negative basis vector Hm and its corresponding nonnegative weight matrix Wm as shown below:
M ≈ Hm Wm
Wm = Wm .
(8)
(5)
where n goes from 1 to N/2. Then N-point DFT of w1 (n)
is computed, which is given by Y1 (ω). Then, ln|Ys (ω)| =
Real[Y1 (ω)] is computed. The estimated smooth spectrum X
obtained from group delay function is given by 2*ln|Ys (ω)|.
The GDF has high resolution property [11], [15], [12] which
makes it more robust in spectrum estimation compared to
conventional FFT based method of calculating magnitude spectrum. This GDF is used to compute non-negative magnitude
spectrum matrix for each of the speaker m(n) and f (n) in
training phase and for mixed signal s(n) in testing phase.
IV.
M
WTm
Hm = Hm . Hm Wm T
1.Wm
where 1 is matrix of ones with same size of K. The Hm and
Wm are randomly initialized. The Equation 8 & 9 are used to
solve the above Equation 7. M have normalized columns. At
each iteration, the columns of Hm is also normalized.
The sequence w1 (n) is generated from w(n)
ˆ
w1 (0) = 0
w1 (n) = w(n)/n
ˆ
w1 (N − n + 1) = w1 (n)
in the matrix M. The corresponding column of the matrix
Wm represents the weights for the basis vectors. In order to
allow data in M to be approximated as a non-negative linear
combination of its constituent vectors, the optimization of nonnegative basis matrix Hm is required. The objective of GDMF
is to minimize the divergence between M and product of Hm
and Wm as defined in [4], [6] by updating Hm and Wm .
(7)
The weighted linear combination of the basis vectors in the
columns of Hm is approximately equal to every column vector
In testing phase, the magnitude spectrum of mixed speech
signal Sn (ω) is obtained from GDF [10]. This magnitude
spectrum S is decomposed using conventional NMF [16]
technique. The bases obtained during training stage for each
speaker signal are stacked together and used in decomposition
of S.
S ≈ [Hm Hf ]W
(11)
Here, Hm and Hf are obtained by solving Equation 7 and 10.
During testing, weight matrix W is randomly initailized. The
weight update Equation 9 is used to solve Equation 11 with
fixed bases matrix. The estimated magnitude spectrum for both
the speaker is obtained by multiplying the bases matrix of each
speaker’s signal with its corresponding weights in matrix W
obtained from Equation 11 as
ˆ = Hm Wm
M
(12)
ˆ = Hf Wf
F
(13)
where Wf & Wm are submatrices in weight matrix W that
belongs to two speaker components respectively in Equation
11.
Fig. 1. The spectrograms of the reference target signal and the mixed signal (above), the reconstructed signals without and with mask (below) respectively,
when the target-to-interference ratio (TIR) is 0dB.
C. Masking
λ=
1
0
ˆ >F
ˆ
if M
ˆ
ˆ
if M < F
(14)
In Equation 14, λ is a spectrographic mask matrix computed
ˆ and F
ˆ which are function of time & frequency. Where,
using M
ˆ
ˆ
M and F are defined as the estimated magnitude spectrum for
both the target and interference speaker respectively. The hard
ˆ and
mask obtained is applied to the estimated spectrogram M
multiplied by the phase of mixed speech signal to obtain an
improved version of the original signal Mn (ω). Similar process
is followed to obtain an improved version of the original
signal Fn (ω). It is clear from Figure 1 that GDMF algorithm
segregates the target speaker effectively after the application
of mask. Thus, proposed method results in better separation of
the signals as illustrated through Figure 1.
V.
D IVERSITY A NALYSIS OF THE S PECTRAL M AGNITUDE
C OMPUTED FROM THE GDF
In this Section, the diversity analysis of the spectral
magnitude computed from GDF is discussed. This analysis
include spectral smoothness and robustness measures for the
magnitude computed from GDF and FFT.
1) Diversity Analysis via Spectral Smoothness: Consider
the system shown in Figure 2(a), where signals are generated
synthetically with formants at 1020Hz, 1400Hz, 1780Hz and
2300Hz, sampled at 10 kHz frequency and pitch period is
10 ms in the presence of additive white Gaussian noise with
signal to noise ratio equal to 20dB. In Figure 2(b), the group
delay is calculated for the system shown in Figure 2(a). In
Figure 2(b), all the three formants are clearly distinguished
due to high resolution property of the GDF in contrast to FFT
Fig. 2. Comparison of magnitude spectrum computed from FFT (c) & from
GDF (d) for the system shown in (a).
spectrum [15]. It can be noted that the magnitude spectrum
computed from conventional FFT in Figure 2(c) has large
variance due to large fluctuation. The fluctuation is significantly reduced in the magnitude spectrum estimated from
GDF in Figure 2(d). It can also be seen from Figure 2(d)
that the magnitude spectrum computed from GDF is more
smooth compared to FFT spectrum Figure 2(c). This helps
in efficient decomposition of magnitude spectrum computed
from GDF into its corresponding bases and weight matrices.
In general, a smoother the magnitude spectrum results in an
effective decomposition of spectral magnitude.
Fig. 3.
Comparison of average magnitude estimated from FFT and GDF.
2) Diversity Analysis via Spectral Robustness: The average
magnitude computed for FFT and GDF is obtained by averaging 300 overlaid realizations of estimated magnitude spectrum
from FFT and GDF in presence of noise as shown in Figure
3. The magnitude spectrum is obtained for the synthetic signal
which is generated at 1020Hz, 1400Hz, 1780Hz and 2300Hz
formants and sampled at 10 kHz frequency with pitch period
of 10 ms.
It can be noted from Figure 3 that averaging the spectral
magnitude computed for FFT reduces fluctuations but at the
expense of large bias. These fluctuations are significantly
reduced in the spectrum computed from GDF after averaging.
This is due to the high resolution and robustness properties of
GDF [15], [10] which results in robust magnitude spectrum.
Due to the robustness property of GDF, the spectrum estimated
from GDF observe less variance and more smoothness in
comparison to magnitude spectrum estimated from conventional FFT method. Also, the magnitude spectrum estimated
from GDF is less dependent on the window compared to FFT
methods.
Thus from Figure 3, it can be observed that for every
iteration of magnitude estimation in the presence of AWGN
noise at SNR equal to 20 dB, the spectrum estimated from
GDF are least affected by additive noise compared to spectrum
estimated from FFT. Thus, the spectrum estimated at every
iteration from GDF are overlapped. This is due to the robustness property of GDF which results in perfect overlapping
of magnitude spectrum compared to spectrum estimated from
FFT. Due to the robustness property described above, the
spectral magnitude computed from GDF is used in NMF
framework instead of spectral magnitude computed from FFT.
Fig. 4. The subjective scores based on GQ, TP, OSS & ANS for reconstructed
target signal
VI.
P ERFORMANCE E VALUATION
In this Section, the performance of the proposed GDMF
separation method is evaluated in terms of speech enhancement
and multi-speaker speech recognition. The experiments are
performed on a large number of sentences from the GRID
corpus [17] and TIMIT databases [18], in order to evaluate
the separation results.
A. Subjective Evaluation
The speech separation task is best judged by the most
efficient filter, which is human ear. In subjective methods,
the evaluation incorporates the subjective judgment of the
listener. Subjective measure is perhaps the most accurate way
of evaluating the quality of any separation system. In subjective
test, the separated signals are rated upon four parameters global
TABLE I.
M EAN PSM
AND
Methods
GDMF
FFT
IF
LVD
LDD
NMF
PSM T SCORES FOR
TIR=-6dB
PSM
PSMt
0.69
-0.20
0.55
-0.26
0.57
-0.25
0.68
-0.23
0.68
-0.21
0.67
-0.22
THE PROPOSED METHOD
GRID
TIR=-3dB
PSM
PSMt
0.75
-0.13
0.59
-0.20
0.61
-0.18
0.73
-0.15
0.74
-0.14
0.73
-0.16
(GDMF) & OTHER
TIR=0dB
PSM
PSMt
0.77
-0.03
0.65
-0.12
0.68
-0.09
0.74
-0.05
0.75
-0.04
0.75
-0.04
CONVENTIONAL METHODS AT VARIOUS
TIR=-6dB
PSM
PSMt
0.61
-0.21
0.40
-0.32
0.41
-0.31
0.56
-0.25
0.57
-0.24
0.56
-0.24
TIMIT
TIR=-3dB
PSM
PSMt
0.63
-0.17
0.45
-0.26
0.45
-0.24
0.59
-0.21
0.61
-0.19
0.60
-0.20
TIR
VALUES .
TIR=0dB
PSM
PSMt
0.68
-0.11
0.51
-0.22
0.52
-0.22
0.63
-0.17
0.68
-0.11
0.69
-0.11
have better subjective and objective performance in comparison to others due to the robust spectral magnitude computed
from GDF which results in effective decomposition in NMF
framework.
C. Experiments on Multi-Speaker Speech Recognition
Fig. 5. The subjective scores based on GQ, TP, OSS & ANS for reconstructed
interfering signal
quality (GQ), target preservation (TP), other signal separation
(OSS) & artificial noise separation (ANS) by 25 listeners.
The instructions followed by the listeners during judgment are
discussed elaborately in [19].
The subjective listening scores for reconstructed target
(m(t))
ˆ
& interferening (fˆ(t)) signal on four parameters are
shown in Figure 4 and Figure 5 respectively for GRID
database. The results indicate that the GDMF algorithm performs reasonably better than the standard FFT, Instantaneous
frequency (IF) [9], Latent Variable Decomposition (LVD) [7],
Latent Dirichlet Decomposition (LDD) [8], and NMF methods.
B. Objective Evaluation
The PEMO-Q technique [20] is used to predict perceived
quality differences between audio signals. PEMO-Q is based
on the Huber and Kollmeier [21] work. PEMO-Q is used to
compute internal representations of signal pairs. These pairs
are compared quantitatively by calculating the linear cross
correlation coefficient.
The mean perceptual similarity measure (PSM) and its
instantaneous audio quality (PSMt) scores for the proposed
GDMF is better compared to other methods like FFT, IF,
LVD, LDD & NMF. The result are listed at various TIR
values in Table I for GRID and TIMIT database. As expected,
the perceptual similarity measure drops at low TIR levels.
When the values of TIR is closer or smaller to zero, PSMt
(the averaged value of instantaneous PSM vector) scores are
negative, which indicates poor separation. These PSM and
PSMt scores corroborates with the results obtained from the
subjective evaluation of the sound files. The proposed method
The GRID corpus and TIMIT databases are used for performing the experiments on multi-speaker speech recognition.
Models are built from GRID corpus and TIMIT databases
for the speech recognition system using the entire two-talker
sentence pairs files. The sentences are different for each
speaker in mixed clip. We have used 15 state, and 3 mixtures
triphone HMM with 39 dimension MFCC with delta and
acceleration coefficients in speech recognition experiments for
both the database separately. The baseline triphone models of
the recognition system are trained by clean speech training
data from two databases. The test data sets are generated at
different TIR (0db, -3dB, 3dB, -6dB, 6dB) from two databases
and applied to the proposed (GDMF) algorithm along with
other methods used in comparison. For testing the recognition
system, the reconstructed files generated from these methods
are used. However, the recognition results are presented in
form of word error rates (WER) after recognizing the complete
sentence, where WER is defined as
WER =
S+D+I
N
(15)
Here, S is the number of substitutions, D the number of
deletions, I the number of insertions and N is the number of
words in the reference. Human WER has been calculated by
running a series of listening and identification experiments and
then calculating the WER for the task at various TIR values.
TABLE II.
C OMPARISON OF THE WORD ERROR RATE (%) AT SEVERAL
TIR VALUES AT GRID AND TIMIT DATABASES
Methods
GDMF
HP
FFT
IF
LVD
LDD
NMF
6
38.67
8
48.33
47.67
46
46.67
45.00
Methods
GDMF
HP
FFT
IF
LVD
LDD
NMF
6
40.33
12
68.67
65.33
46.67
46.67
46.00
GRID
TIR (dB)
3
0
40.33
41.67
17
25
50.33
51.67
48.00
48.67
47.33
48.00
47.00
48.00
47
47.67
TIMIT
TIR (dB)
3
0
42.33
43.67
20
29
71.00
71.67
69.67
70.33
48.00
49.67
48.33
49.33
48.33
50.00
-3
43.00
28
52.00
51.67
49.67
49.33
49.33
-6
45.67
29
54.67
53.33
50.33
50.00
50.33
-3
45.33
31
75.67
74.00
52.67
51.67
52.33
-6
46.67
33
77.00
76.67
55.33
55.00
55.33
Table II lists the percentage WER of the extracted target
signals for different methods at various TIR values. The
vocabulary size of 853 and 52 have been used to evaluate
the system on TIMIT and GRID database respectively. The
test perplexity of 853 and 52 is used for TIMIT and GRID
database respectively. It is clear from Table II that the proposed algorithm GDMF has lower WER compared to other
algorithms. Moreover, the proposed method (GDMF) turns out
to be closest to the human performance (HP) results compared
to the other methods.
VII.
[6]
M. Schmidt and R. Olsson, “Single-channel speech separation using
sparse non-negative matrix factorization,” 2006.
[7]
B. Raj and P. Smaragdis, “Latent variable decomposition of spectrograms for single channel speaker separation,” in Applications of Signal
Processing to Audio and Acoustics, 2005. IEEE Workshop on. IEEE,
2005, pp. 17–20.
[8]
B. Raj, M.V.S. Shashanka, and P. Smaragdis, “Latent dirichlet decomposition for single channel speaker separation,” in Acoustics, Speech
and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE
International Conference on. IEEE, 2006, vol. 5, pp. V–V.
[9]
L. Gu, Single-channel Speech Separation Based on Instantaneous
Frequency, Ph.D. thesis, Citeseer, 2010.
[10]
B. Yegnanarayana and H.A. Murthy, “Significance of group delay functions in spectrum estimation,” Signal Processing, IEEE Transactions on,
vol. 40, no. 9, pp. 2281–2289, 1992.
[11]
H. A. Murthy and B. Yegnanarayana, “Formant extraction from group
delay function,” vol. 10, pp. 209–221, 1991.
[12]
H. A. Murthy and B. Yegnanarayana, “Group delay functions and its
applications in speech technology,” Sadhana, vol. 36, no. Part 5, 2011.
[13]
Karan Nathwani, Pranav Pandit, and Rajesh M Hegde, “Group delay
based methods for speaker segregation and its application in multimedia
information retrieval,” IEEE transactions on multimedia, vol. 15, no.
6, pp. 1326–1339, 2013.
[14]
A.V. Oppenheim, R.W. Schafer, J.R. Buck, et al., Discrete-time signal
processing, vol. 2, Prentice hall Upper Saddle Riverˆ eN. JNJ, 1989.
[15]
R.M. Hegde, H.A. Murthy, and V.R.R. Gadde, “Significance of joint
features derived from the modified group delay function in speech processing,” EURASIP Journal on Audio, Speech, and Music Processing,
vol. 2007, no. 1, pp. 5–5, 2007.
[16]
A. Lefevre, F. Bach, and C. Fevotte, “Online algorithms for nonnegative
matrix factorization with the itakura-saito divergence,” in Applications
of Signal Processing to Audio and Acoustics (WASPAA), 2011 IEEE
Workshop on, Oct., pp. 313–316.
[17]
M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual
corpus for speech perception and automatic speech recognition,” The
Journal of the Acoustical Society of America, vol. 120, pp. 2421, 2006.
[18]
J.S. Garofolo, TIMIT: Acoustic-phonetic Continuous Speech Corpus,
Linguistic Data Consortium, 1993.
[19]
Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker
Hohmann, “Subjective and objective quality assessment of audio
source separation,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 19, no. 7, pp. 2046–2057, 2011.
[20]
R.K.B. Huber, “PEMO-Q–A New Method for Objective Audio Quality
Assessment Using a Model of Auditory Perception,” IEEE Transactions
on Audio Speech and Language Processing, vol. 14, no. 6, pp. 1902–
1911, 2006.
[21]
J. Damaschke, R. Huber, V. Hohmann, and B. Kollmeier, “PRO-DASP:
An audio quality testbench for optimizing low-power chip designs of
speech processing algorithms,” Muller, D., Kretzschmar, C., Siegmund,
R., eds, vol. 3, pp. 50–54, 2002.
C ONCLUSION
The single channel speaker separation method proposed in
this work uses non-negative matrix factorization for decomposition of the spectral magnitude of the group delay function.
This method helps reduce error in the estimation of the basis
vectors of individual speakers in a mixture primarily due to
the high resolution property of the group delay function. Subjective evaluation of the proposed method shows reasonable
improvement over other speaker separation methods. Lower
word error rate is noted in multi-speaker speech recognition
experiments. Although a reasonable improvement is noticed
by using the magnitude of the group delay, the spiky nature
of the group delay function due to pitch periodicity effects
especially when multiple speakers are involved needs to be
addressed. The effect of the abrupt transitions in the group
delay function can lead to erroneous estimation of the spectral
magnitude of the mixed speech signal. Hence, its effect on
factorization and training the bases also needs to be addressed.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
Daniel D Lee and H Sebastian Seung, “Learning the parts of objects
by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp.
788–791, 1999.
A. Ozerov and C. Fevotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” Audio,
Speech, and Language Processing, IEEE Transactions on, vol. 18, no.
3, pp. 550–563, March.
Paris Smaragdis and Judith C Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Applications of Signal
Processing to Audio and Acoustics, 2003 IEEE Workshop on. IEEE,
2003, pp. 177–180.
E.M. Grais and H. Erdogan, “Single channel speech music separation
using nonnegative matrix factorization and spectral masks,” in Digital
Signal Processing (DSP), 2011 17th International Conference on. IEEE,
2011, pp. 1–6.
Cemil Demir, Mehmet Ugur Dogan, A Taylan Cemgil, and Murat
Sarac¸lar, “Single-channel speech-music separation using nmf for automatic speech recognition,” in Signal Processing and Communications
Applications (SIU), 2011 IEEE 19th Conference on. IEEE, 2011, pp.
486–489.