Speaker Identification by Combining MFCC and Phase Information

Speaker Identification by Combining
MFCC and Phase Information
Longbiao Wang
(Nagaoka University of Technologyh, Japan)
Seiichi Nakagawa
(Toyohashi University of Technologyh, Japan)
1
Background

The importance of phase in human speech recognition
has been reported.

In conventional speaker recognition methods based
on mel-frequency cepstral coefficients (MFCCs),
phase information has hitherto been ignored.
2
Purpose and method

We aim to use the phase information for speaker
recognition.

We propose a phase information extraction method
that normalizes the change variation in the phase
depending on the clipping position of the input
speech and combines the phase information with
MFCCs .
3
Investigating the effect of phase
Conventional MFCCs that capture the vocal tract
information cannot distinguish the different speaker
characteristics caused by vocal source.
The phase is greatly influenced by vocal source characteristics.
We generated a speech wave for different vocal sources and
pitch, and a fixed vocal tract shape corresponding to vowel /a/.
4
Phase information extraction

The short-term spectrum S(ω, t) for the i-th frame of
a signal is obtained by the DFT of an input speech
signal sequence
• For conventional MFCCs, power spectrum is used,
but the phase information
is ignored.
In this paper, phase
is also extracted as one of
the feature parameters for speaker recognition.
5
Problem of unnormalized phase
However, the phase
changes depending on the clipping
position of the input speech even with the same frequency ω.

The unnormalized wrapped phases of two windows become
quite a bit different because the phases change depending on the
clipping position.
Example of the effect of clipping position on
phase for Japanese vowel /a/
6
Phase normalization (1/2)

To overcome this problem, the phase of a certain basis radian
frequency
of all frames is converted to constant, and the
phase of the other frequency is estimated relative to this. In the
experiments discussed in this paper, the phase of basis radian
frequency
is set to 2π × 1000 Hz.

For example, setting the phase of the basis radian frequency
to π/4, we have
7
Phase normalization (2/2)

The difference of unnormalized wrapped phase
on
basis frequency
and the normalized wrapped phase is
With ω = 2πf in the other frequency (that is,
), the difference becomes
Thus, the spectrum on frequency ω becomes
and the phase information is normalized as
8
Comparison of unnormalized
phase and normalized phase
After normalizing
the wrapped phase,
the phase values
become very
similar.
Example of the effect of clipping position on phase for Japanese vowel /a/
9
From phase θ to phase{cosθ, sinθ}

There is a problem with this method when comparing two
phase values. For example, with the two values
and
, the difference is
then the
difference
despite the two phases being very similar to
one another.

Therefore, for this research,
we changed the phase into
coordinates on a unit circle,
that is,
10
How to synchronize the splitting section
Combination method
The GMM based on MFCCs is combined with the GMM based
on phase information.

The likelihood of MODEL 1 is linearly coupled with that of
MODEL 2 to produce a new score given by

where
is the likelihood produced by the n-th speaker
model based on MFCC and the n-th speaker model based on
phase, n=1,2,…,N with N being the number of speakers
registered.
12
Experimental setup (1/3)

NTT database





# speaker: 35 (22 males and 13 females)
# session: 5 (1990.8, 1990.9, 1990.12, 1991.3, 1991.6)
# training utterance: 5 (1990.8)
# test utterance: 1 (about 4 seconds),
35×4×5=700 trials
JNAS database
# speaker: 270 (135 males and 135 females)
# training utterance: 5 (about 2 seconds / sentence)
# test utterance: 1 (about 5.5 seconds), about 95 sentences / person
270×95=25650 trials

13
Experimental setup (2/3)

Noise



Stationary noise (in a computer room)
Non-stationary noise (in an exhibition hall)
Noisy speech

Noise was added to clean speech at the average SN ratios of
20 dB and 10 dB, respectively.
14
Experimental setup (3/3)
MFCC
Sampling frequency
Phase
16k Hz
Frame length
25 ms
12.5 ms
Frame shift
12.5 ms
5 ms
Dimensions
25
{θ}:12
{cosθ,sinθ}:24
GMMs
8 mixtures with fullcovariance matrices
64 mixtures with diagonal
covariance matrices
15
Speaker identification using
clean speech
16
Speaker identification result on NTT database (1/2)
Speaker identification results using the combination of MFCC-based
GMM and the original phase {θ}
17
Speaker identification result on NTT database (2/2)
Speaker identification results using the combination of MFCC-based
GMM and the modified phase {cosθ, sinθ}
18
Speaker identification result on JNAS database
Speaker identification results using the combination of MFCC-based
GMM and the modified phase {cosθ, sinθ}
19
Speaker identification under
stationary/non-stationary noisy
conditions
20
Speaker identification rate (%)
Speaker identification results under noisy
conditions (1/2)
90
80
70
MFCC
Phase
Combination
60
50
40
Clean model
Clean model + frame
deletion
NTT database
21
Speaker identification results under noisy
conditions (2/2)
Speaker identification rate (%)
90
80
70
60
MFCC
Phase
Combination
50
40
30
20
Clean model
Clean model + frame
deletion
JNAS database
22
Conclusion

We proposed a phase information extraction
method which normalizes the change variation
of phase depending on the clipping position of
the input speech and integrates the phase
information with MFCC.

The experimental results showed that the
combination of phase information and MFCC
improved the speaker recognition performance
remarkably than MFCC-based method.
23
Thank you for your attention!
24