Age Detection Using Audio Features
-
Upload
sujay-pujari -
Category
Documents
-
view
135 -
download
0
description
Transcript of Age Detection Using Audio Features
AGE AND GENDER ESTIMATION FROM AUDIO
FEATURES USING DISCRIMINANT ANALYSIS
AND NN FRAMEWORK
A thesis submitted in partial fulfilment of the requirements for
The award of the degree of
M.Tech.
in
COMMUNICATION SYSTEMS
By
PUJARI SUJAY GIRISH
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
NATIONAL INSTITUTE OF TECHNOLOGY
TIRUCHIRAPALLI – 620 015.
MAY 2011
BONAFIDE CERTIFICATE
This is to certify that the project titled “AGE AND GENDER ESTIMATION
FROM AUDIO FEATURES USING DISCRIMINANT ANALYSIS AND NN
FRAMEWORK” is a bonafide record of the work done by
PUJARI SUJAY GIRISH (208109013)
in partial fulfilment of the requirements for the award of the degree of Master of
Technology in Communication Systems of the NATIONAL INSTITUTE OF
TECHNOLOGY, TIRUCHIRAPPALLI, during the year 2010-2011.
S. DEIVALAKSHMI
Guide Head of the Department
Project Viva-voce held on _____________________________
Internal Examiner External Examiner
ABSTRACT
In the field of speech processing, applications like Interactive voice response (IVR)
systems and Artificial Intelligence needs to replicating human behavior; one of them
is auditory feature of human and from that to perceive sex and approximate age of
speaker.
With the help of selected features extracted from unknown speaker voice, proposed
automated system can estimate age group and gender of that person. For the sake of
classification 7 classes are defined namely young male–female, adult male-female,
senior male-female and child.
This system is made of mainly 2 parts feature extraction from real time samples from
microphone and followed by 2 stage feature classification. Features like Pitch, MFCC
& delta MFCC are considered for extraction .For purpose of classification
combination of Canonical discriminant analysis and NN framework is applied.
For the purpose of experiment required stimuli databases were collected from 192
different speakers.
Keywords: Speech Processing, Age, Gender, Discriminant analysis, Neural Network.
ACKNOWLEDGEMENTS
I take this opportunity to express my sincere thanks & deep sense of gratitude to my
project guide Mrs S. Deivalakshmi, Assistant Professor, Department of Electronics
and Communication Engineering, National Institute of Technology, Tiruchirappalli
for her guidance, and kind co-operation.
With immense pleasure, I record my profound gratitude and indebtedness to
Prof. Sanjay Patil, Department of Electronics and Telecommunication, Maharashtra
Academy of Engineering, Pune University for his needful suggestions & guidance.
I would like to express my sincere thanks to Prof. P. Somaskandan, Professor and
Head of the Department, Department of Electronics and Communication Engineering,
National Institute of Technology, Tiruchirappalli for providing with all facilities from
the part of department for the successful completion of this project.
I express my deep sense of gratitude to Dr S. Raghavan, Prof. Department of ECE
and Mr M. Bhaskar, Associate Prof. Dept. of ECE for giving me the much required
lab facilities; I would also like to thank him for his motivation and support.
My special thanks to Anil, Jamuna, Nithyananth, Senkathir, Kishore and Pardu for
their encouragement and invaluable help to collect audio database.
I would like to thank to all teaching staff and my classmates and computer support
group staff, for their sincere help.
Last but not the least; I dedicate this work to my parents and my family.
Sujay Pujari
May 2011
TABLE OF CONTENTS
Title Page No
ABSTRACT………………………………………………….... iii
ACKNOWLEDGEMENTS…………………………………... iv
TABLE OF CONTENTS……………………………………... v
LIST OF FIGURES………………………………………….... vi
LIST OF TABLES…………………………………………….. viii
ABBREVIATIONS……………………………………………. ix
CHAPTER 1 INTRODUCTION
1.1 Objectives and Approach ……………………………………………....... 1
1.2 Database Collection …….…………………………………………………. 3
1.3 Study Outline …………………………………………………………. 5
CHAPTER 2 LITERATURE REVIEW 6
CHAPTER 3 FEATURE EXTRACTION
3.1 Pitch………………………………………………………………………. 9
3.2 MFCC (Mel Frequency Cepstral Coefficient )………………………….. 12
3.3 Windowing………………………………………………………………… 15
CHAPTER 4 FEATURE CLASSIFICATION
4.1 First Stage with Discriminant Analysis ………………………………….. 20
4.2 Second Stage with NN frameworks………………………………………. 22
CHAPTER 5 RESULTS AND DISCUSSION
5.1 Unknown stimuli results ……….…………………………………………. 30
5.2 classification stage one result …….………………………………………. 34
5.3 classification stage two result ………………………..…………………… 36
CHAPTER 6 CONCLUSION & FURTHER WORK 41
REFERENCES 42
LIST OF FIGURES
Figure No Title Page
No
1.1 Proposed System ……………………………………………………2
1.2 Snapshot of recording with Wave Surfer……………….…………...4
3.1 An example, input sinusoidal signal. ………………….…………...10
3.2 Autocorrelation of give frame input ….……………………………11
3.3 Number of samples between 2 maximums. ………………………..11
3.4 MFCC feature extraction steps ………………………………….....12
3.5 Mel frequency Vs Frequency ………………………………………13
3.6 Mel filter bank ……………………………………………….……..14
3.7 Hamming window …………………………………………………..15
3.8 Overlapped frames followed by windowing function ……………...16
3.9 Reconstructed waveform (Above) after windowing and original wave
(below) …………………………………………………………..….16
3.10 Cross correlation between original signal and reconstructed one…...17
3.11 Welch method all periodogram and their average…………………18
4.1 Abstract Flow of Proposed classification stages …………….……...19
4.2 Classification based on Discriminant analysis followed by decision
based on Euclidean distance for stage one classification ……..…….21
4.3 Equivalent decision C1 of NN framework based on 3 neural networks
output ……………………………………………………………….22
4.4 Euclid. distance method for decision based on stage 2……...............22
4.5 Neural Network structure for NNA, NNB and NNC ……………….23
4.6 Classification algorithm in stage 2…………………………………..24
4.7 Matlab nprtool for NN implementation……………………...……...26
5.1 Average pitch for Females of all 4 classes from database 1 ….......27
5.2 Average pitch for Males of all 4 classes from database 1 …………..28
5.3 Waveform of one of the record from database 1 ...............................28
5.4 Pitch track for waveform shown in fig. 5.3 ………………...............29
5.5 Pitch track- Unknown stimuli ……………………………………...30
5.6 13 MFCC coefficients- Unknown stimuli …………………………..30
5.7 12 dMFCC coefficients- Unknown stimuli …...................................31
5.8 11 ddMFCC coefficients- Unknown stimuli ……………………….31
5.9 Feature vector of 37 x 1 for Unknown stimuli ……………………..32
5.10 Discriminant score plot for all 3 groups ………………………...….35
5.11 NN1 framework – NNA network…………………………………...36
5.12 NN1 framework – NNB network …………………………………...36
5.13 NN1 framework – NNC network …………………………………...37
5.14 NN2 framework – NNA network…………………………………...37
5.15 NN2 framework – NNB network ………………………………...…38
5.16 NN2 framework – NNC network …………………………………...38
5.17 Stage 1 + NN2 framework – all females…………………………….39
5.18 Stage 1 + NN1 framework – males………………………………….39
5.19 Overall classification result ....…...………………………………….40
6.1 Comparison chart for successful estimation of class………………..41
LIST OF TABLES
Table no Title Page no
1.1 Classification Groups………………………………………..………4
5.1 Neural networks output –unknown stimuli……………………...….33
5.2 Canonical Discriminant function coefficients………………………34
5.3 Functions at group centroids……..…………………………………34
5.4 classification result of stage 1 with database 1……………………...35
ABBREVIATIONS
CDF Canonical Discriminant function
DA Discriminant Analysis
DS Discriminant score
MFCC Mel Frequency Cepstrum Coefficients
NN Neural Network
YM Young Male
YF Young Female
AM Adult Male
FM Adult Female
SM Senior Male
SF Senior Female
DCT Discrete Cosine Transform
PSD Power Spectral Density
CHAPTER-1
INTRODUCTION
Automatic speech recognition (ASR) based algorithms are widely deployed for customer
care and service applications. ASR research is currently moving from mere “speech-to-text”
(STT) systems towards “rich transcription” (RT) systems, which annotate recognized text
with non-verbal information such as speaker identity, emotional state. In Interactive voice
response systems; this approach is already being used to identify dialogs involving angry
customers, which can then be analyzed with the goal of automatically identifying
problematic dialogs, transferring unsatisfied customers to an agent, and other purposes.
Also, the first adaptive dialogs are now appearing, particularly in systems exposed to
inhomogeneous user groups. These can adapt degree of automation, order of presentation,
waiting queue music, or other properties to properties of the caller such as age or gender.
As an example, it would be possible to offer different advertisements to children and adults
in the waiting queue. In non-personalized services, speaker classification will be based on
the caller’s speech data. While classifier performance is only one factor influencing the utility
of the above approach in an IVR system, it is certainly a major factor.
Proposed algorithm for automatic age and gender estimation is going to help in same
regard, as it is a proposed algorithm which classifies speaker voice in one among the classes,
which predicts its gender and approximate age group.
1.1 Objectives and Approach
The ultimate aim of the proposed system is to predict age group and gender of speaker with
the help of its stimuli of any length at real time. Now such systems are mainly consisting of
two stages feature extraction, selection and Classification based on extracted features. For
such observations or to find features which can give distinct values for different classes we
need to have database. So one of the first tasks was to collect this audio database; followed
by feature extraction and classification.
For feature extraction features like pitch, MFCC & delta MFCC coefficients are worked out. In
following stage of classification we adapted two different method namely CDA and NN
framework. With the help of all collected 290 stimuli present in the database neural
networks are trained and at the end such trained networks are used for real time
classification.
Fig. 1.1 Proposed System
Classification stage 1 based on Discriminant analysis
Feature set extraction
Classification stage 2 based on NN Frameworks
Young
Male 5
Child
1
Young
Female 2
Adult
Male 6
Adult
Female 3
Senior
Male 7
Senior
Female 4
Audio input
Human voice
1.2 Database Collection
For classification purpose we had collected data for following 8 different groups:
I. Child –Boy (age <15)
II. Child –Girl (age <15)
III. Young Men (age<30)
IV. Young Women (age<30)
V. Adult Gents (age<55)
VI. Adult Ladies (age<55)
VII. Senior citizen –Male (age>55)
VIII. Senior citizen –Female (age>55)
At the starting, we had collected 2 stimuli from 105 speakers namely
1) “HAPPY BIRTHDAY”
2) To tell your name in your mother tongue.
For example,
<My Name is Sujay> or
<Maz nav ... ahe.>Marathi or
< En peyar... > Tamil or
< Naa peru...> Telugu.
With specification of
a) Fs sampling rate=8000 samples/sec
b) Bits per sample =16 bit
c) Mono channel
By the end of experiment on these 2 stimuli it was quite clear that to distinguish between
groups I & II was quite impossible.
So finally we fixed classification groups to 7 classes by merging group I and II to a single
group known as child. And finally we adapted classification groups as given in Table 1.1
Table 1.1 Classification Groups
3) After this we adapted stimuli of “OM” but with condition to extend it to more than
10seconds and with single breathe.
Like- “OOOOOOOOOOOMMMMMMMMMMMmmmmmmmm”
We have such 87 sample recordings with following specifications:
A. Fs sampling rate=16000 samples/sec
B. Bits per sample =16 bit
C. Mono channel
Fig. 1.2 Snapshot of recording with Wave Surfer
For recording purpose we used open source tool known as “Wave Surfer 1.8.8p3” for the
purpose of recording & editing with desired specifications.
We can refer these 3 databases by Database1, Database 2and Database 3. In which all files
are stored in the form of “Name _Age.wav”.
Group no Group Symbol Category
I C Child (age <18)
2 YF Young Female (age <30)
3 AF Adult Female (age <55)
4 SF Senior Female (age >55)
5 YM Young Male (age <30)
6 AM Adult Male (age <55)
7 SM Senior Male (age >55)
1.3 Study Outline
This project thesis is organized as follows. Chapter 2 reviews the literature review &
background of the algorithms adapted towards estimation of age and gender. The
Materials and Methods used in this study are discussed in Chapter 3 and Chapter 4.
Chapter 3 deals with first part of feature extraction and Chapter 4 with Feature
classification. Chapter 5 provides the results and discussion and the Chapter 6 concludes
thesis with future direction.
CHAPTER-2
LITERATURE REVIEW
In this chapter, important literatures used to implement proposed algorithm are
reviewed.
Minematsu, N.et.al (1993),
In “Automatic estimation of one’s age with his/her speech based upon acoustic modelling
techniques of speakers”, proposed technique to identify subjective elderly speakers with
prosodic features such as MFCC based speech rate.
William R. Klecka (1980), in “Discriminant Analysis” presents a lucid and simple
introduction to several related statistical procedures known as discriminant analysis.
Discriminant Analysis (DA) introduces canonical discriminant function (CDF) of variables in
discriminant analysis. Professor Klecka derives canonical discriminant function coefficients,
provides spatial interpretation of them, and provides a nice discussion of the interpretation
of CDFs. He presents clear discussion of unstandardized and standardized
SPSS ver. 14 manual on algorithms titled “Discriminant” explains all steps involved toward
Classification based on CDF coefficients.
Braun,Aet. Al (1999),
In “Estimating speaker age across Languages ”, he conducted analysis to show correlation
between calender age and perceived age with help of Italian and Dutch stimulies; and
further concluded that male and female listener can safely be combined.
Cerrato, L. et. Al (2000),
In “subjective age estimation of telephonic voices”,he carried out the statistical analysis
which show that listeners are capable of of assigning a general chronological age category to
voice without seeing or knowing the speaker & they are able to make distinguish beetween
male and female voice transmitted over telephone line.
Krauss, R. M.et.al (2002)
In “Inferring speakers, physical attributes from their voices”, he examined listeners ability to
make accurate inferences about speakers from the non-linguistic content of their speech.
Shafran, I. et. al. (2003),
In “Voice signatures”, he explores problem of extracting voice signature from speaker voice
& found standard Mel warped cepstral features, speaking rate & shimmer to be useful.
Rabiner, L. et. al. (1976)
In “A comparative performance study of several pitch detection algorithms” ,all Pitch
detection algorithms are discussed .According to him Pitch can be as low as 40 Hz (for very
low pitch male) or high as 600 Hz (for very high pitched female or child’s voice)
Rabiner, L. et. al. (1977)
In “On the use of autocorrelation analysis for pitch detection”, with the help of short time
autocorrelation analysis, pitch detection technique is explained.
Mcleod,P. and Wywill, G. (2005)
“A smarter way to find pitch”, he found that existing pitch algorithms that use Fourier
domain suffer from spectral leakage and he suggested windowing as remedy over it.
Metze, F. et.al. (2007)
In “Comparison of four approaches to age and gender recognition for telephone
application”, compares different approaches to age gender classification on telephone
speech with small and large utterance lengths.
Welch, P. D. (1967) In “The Use of Fast Fourier Transform for the Estimationof Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms”, use of FFT for PSD estimation is given. Moller, M. (1993)
In, “A scaled conjugate algorithm for fast supervised learning”, he introduces SCG whose
performance is benchmarked compare to that of back propagation. Also it is fully
automated, included no critical user dependent parameter & avoids time consuming line
search
Huang, X. et .al. (2001)
In "Spoken Language Processing: A guide to Theory, Algorithm, and System Development,"
Prentice Hall, he describes prosodic phenomena’s like pitch with algorithms available.
Childers, D. et.al.(1977)
In “The Cepstrum: A guide to processing”, Pragmatic details of Cepstrum concepts is given.
Spiegl, W. et. al. (2009)
In “Analysing Features for Automatic Age Estimation on Cross-Sectional Data”, developed
acoustic feature set for the estimation of persons age from recorded speech signal. &
demonstrated that age can be effectively estimated using feature vector of prosodic,
spectral & cepstral features.
CHAPTER-3
FEATURE EXTRACTION
In this chapter Feature extraction algorithms are explained. In this work 2 type of features
are found more suitable namely Pitch and MFCC coefficients.
3.1 PITCH
Pitch represents perceived fundamental frequency of a sound and it may be quantified as
frequency in cycles per second (hertz), however pitch is not purely objective physical
property, but a subjective psychoacoustical attribute of sound.
According to Huang, X. [ref 10], prosody is a complex weave of physical, phonetic effects
that is being employed to express attitude, assumptions, and attention as a parallel channel
in our daily speech communication.
The semantic content of a spoken or written message is referred to as its denotation, while
the emotional and attentional effects intended by the speaker or inferred by a listener are
part of the message’s connotation. Prosody has an important supporting role in guiding a
listener’s recovery of the basic messages (denotation) and a starring role in signalling
connotation, or the speaker’s attitude toward the message, toward the listener(s), and
toward the whole communication event.
From the listener’s point of view, prosody consists of systematic perception and recovery of
a speaker’s intentions based on:
I. Pauses: to indicate phrases and to avoid running out of air.
II. Pitch: rate of vocal-fold cycling (fundamental frequency) as a function of time.
III. Rate/relative duration: phoneme durations, timing, and rhythm.
IV. Loudness: relative amplitude/volume.
Pitch is the most expressive of the prosodic phenomena. As we speak, we systematically
vary our fundamental frequency to express our feelings about what we are saying, or to
direct the listener’s attention to especially important aspects of our spoken message
3.1.1 PITCH DETECTION
According to Naotoshi Seo [ref 15], Pitch can be detected in following ways:
a. Autocorrelation method
b. Cepstrum method
c. Harmonic product spectrum method (HPS)
d. Linear predictive coding (LPC)
In our work we have adapted first method, in order to calculate pitch, we need at least two
peaks to be within the block we are measuring pitch. We can ensure that at least 2 peaks are
within the block. And therefore block size must be greater than 3 wavelength of lowest
possible frequency. Lowest possible frequency is known as Pitch floor.
No of minimum samples required per frame
……
Samples/frame
According to this method to get pitch we need to get autocorrelation of signal for given
block or frame. Then sample distance between value at first sample and at second highest
peak K can be used to find fundamental frequency, where Fs is sampling frequency.
………. Hertz
Fig. 3.1: An example, input sinusoidal signal.
Nmin = Pitch floor3 * Fs
Pitch =(noof samplescovered between2 max imum peaksK)
Fs
0 50 100 150 200 250 300 350 400 450 500-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
no of samples
Ampli
tude
Fig. 3.2: Autocorrelation of given input frame
Fig. 3.3 Number of samples between 2 maximums.
0 2000 4000 6000 8000 10000 12000 14000 16000 18000-8000
-6000
-4000
-2000
0
2000
4000
6000
8000
sample number
Am
plitu
de
0 200 400 600 800 10000
1000
2000
3000
4000
5000
6000
7000
8000
X: 41
Y: 7980
sample number
Am
plitu
de
For Example as shown in fig. 2,
K = (41-1) = 40
For, Fs = 8000
Pitch=8000/40=200 Hz
3.2 MFCC (Mel Frequency Cepstral Coefficient )
We are using MFCC which is so popular as its efficient to compute .It incorporates a
perceptual Mel frequency scale. It seprates source and filter. IDFT (DCT) decorrelates the
features which in turn improves differences..
Fig. 3.4 MFCC feature extraction steps
3.2.1 MEL SCALE
Human hearing is not equally sensitive to all frequency bands
Less sensitive at higher frequencies, roughly > 1000 Hz
I.e. human perception of frequency is non-linear:
Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz
Fig. 3.5 Mel frequency Vs Frequency
For our work we are using 13 Mel filter banks as shown in fig. 1, which inturns gives 13 MFCC
coeficients
Fig. 3.6 Mel filter bank
3.2.2 LOG ENERGY
Logarithm compresses dynamic range of values
o Human response to signal level is logarithmic
o humans less sensitive to slight differences in amplitude at high amplitudes
than low amplitudes
Makes frequency estimates less sensitive to slight variations in input (power
variation due to speaker’s mouth moving closer to mike)
Phase information not helpful in speech
3.2.3 CEPSTRUM
According to Childers, D. [ref 4] ; cepstrum is nothing but spectrum of spectrum.
The cepstrum requires Fourier analysis But we’re going from frequency space back to time
So we actually apply inverse DFT . Since the log power spectrum is real and symmetric,
inverse DFT reduces to a Discrete Cosine Transform (DCT)
3.2.4 DELTA MFCC and DOUBLE DELTA MFCC
These are nothing but MFCC variations and variation in MFCC variations. For 13 MFCC
coefficients we get 12 delta-MFCC and 11-delta2 MFCC coefficients. Which is nothing but
Quefrency.
3.3 Windowing
Instead of recording all the audio signal we are using windowing with
overlaping which helps to limit buffer length to 1024 samples and
previous frame results like PSD. And then we can implement methods
like welch to find out periodogram for non-stationary signals.
Fig. 3.7 Hamming window - W (n) = 0.54 - 0.46 * COS (2*pi*n/N)
In our case we are adapting windowing for estimation of pitch in case of
pitch track and PSD estimation for extracting MFCC coefficients.
Here we are adapting hamming window of length 1024 with 50%
overlapping.
Fig. 3.8 Overlapped frames followed by windowing function
Fig. 3.9 : Reconstructed waveform(Above) after windowing and original wave (below)
Fig. 3.10 Crosscorrelation between original signal and reconstructed one.
3.3.1 PSD ESTMATION -Welch's Method
Welch's method for estimating power spectra is carried out by dividing the time signal into
successive blocks, forming the Periodograms for each block, and averaging.
Denote the mth windowed, zero-padded frame from the signal x by
Where R is defined as the window hop size, and let k denote the number of available frames.
Then the Periodogram of the mth block is given by
And the Welch estimate of the PSD is given by
In our work we are using Hamming window of N=1024, with 50% overlapping.
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-15
-10
-5
0
5
10
15
X: 2560
Y: 15.53
Cross correlation
Fig. 3.11 : Welch method all periodogams and their average.
0 500 1000 1500 2000 2500 3000 3500 4000 45000
0.5
1
1.5
2
2.5
3
3.5
4
4.5x 10
-4
Hz
Watt/H
z
Periodograms of all blocks
0 500 1000 1500 2000 2500 3000 3500 4000 45000
0.2
0.4
0.6
0.8
1
1.2x 10
-4
X: 993
Y: 0.0001179
Hz
Watt/H
Z
Average periodogram
CHAPTER-4
FEATURE CLASSIFICATION
In this chapter we are going to propose combination of two classification algorithms as two
stage classification. As there are 7 classes to classify C,YF,AF,SF,YM,AM & SM. It was found
that with the help of statistical classificationa like Canonical Discriminant Analysis we can
predict speaker as Male , Female or Child. Now, that is the first stage of classification.Then to
specify young, adult or senior among Male and Female group we are using NN
framework;which is nothing but second stage.
Fig 4.1 Abstract Flow of Proposed classification stages
Classification based on Discriminant analysis
MALE
Group 5 to 7
Feature set
FEMALE
Group 2 to 4
NN Framework -1 NN Framework -2
YM
5
AM
6
SM
7
C
1
YF
2
AF
3
SF
4
4.1 First Stage with Discriminant Analysis
Feature classification stage 1 is done using Discriminant Analysis (DA). In this method 2
Canonical Discriminant functions are determined from extracted features. For the feature
vectors only two features namely pitch and delta2 MFCC (10) are only used as input vector
for this state. For the purpose of training we used 39 Female, 27 child and 37 Male stimuli
from Database 1.
After extracting features from feature set for all Training cases we determined
unstandardized coefficients along with group centroids for 2 functions. Now, we can
determine Discriminant score for unknown feature set. And classification is done based on
Euclidean distance rule.
4.1.1 Steps for Canonical Discriminant Analysis:
Selected 103 samples can be referred as training database. So using SPSS package we can
find out canonical Discriminant functions, in our case we have 3 classes so it will end up with
2 functions. For that we need to give (no. of samples from that group x 2 feature values)
such 3 feature matrices as input. Along with a (Total samples x 1) vector depicting truth
value of that class.
After following steps as explained in Klecka [ref 5] we will have following information for
each function
1. Unstandardized coefficients D 2. Constant D0 3. Function values at group centroids
Now Canonical Discriminant Function can be determined as,
f=D0+XD
Where X is (1xP) feature vector for given mammogram.
4.1.2 Classification based on CDF
Now after substituting Xinput value in obtained CDF we will get finput; this value is nothing but
Discriminant Score (DS) for given input feature vector. And for 2 functions you will get 2
different finput;
Therefore,
F1 = [finput1 finput2]
We will have 3 [2 x 1] group centroids values, we can find out the group which is having
minimum Euclidian distance from F1, can be selected as classified group.
Fig. 4.2 Classification based on Discriminant analysis followed by decision based on
Euclidean distance for stage one classification
C3
F1
C1
C2
X input
[Pitch d2MFCC (10)]
F1
2 Trained Functions with
1. Unstandardized coefficients D
2. Constant D0 3. Function values at group
centroids
Nearest Centroid
belongs to class M
or F or C
C F
M
4.2 Second Stage with NN frameworks
Now we just need to apply NN framework as per applicable to male or female speaker. Now
for both NN frameworks we are going to apply same algorithm as shown in figure .That
means we are going to apply [37 x 1] feature vector simultaneously to 3 Neural Networks.
And output obtained from each network can be considered as co-ordinates in 3d plane
assuming 3rd coordinate as zero. From such 3 position vectors P1, P2 & P3 we will get
centroid co-ordinates C1.Now among [1 0 0], [0 1 0] & [0 0 1] which are target values for 3
subclasses of male & female. Minimum distance between centroid C1 and among three
points proves selection of one of the three classes.
Fig. 4.3 Equivalent output C1 of NN framework based on 3 neural networks output
Figure:
Fig. 4.4 Euclidian distance method for decision based on classification stage 2
Now these 3 neural networks are nothing but 3 trained networks obtained by considering 2
classes as target output at a time so such 3C2=3 neural networks are required.
[0 0 1]
P1
P2
P3
C1
[1 0 0]
[0 1 0]
L3
L1
L2
C1
[0 0 0]
[0 0 1]
[1 0 0]
[0 1 0]
Fig. 4.5 Neural Network structure for NNA, NNB and NNC
Then,
From NNA, P1 = [op1 op2 0]
From NNB, P2 = [0 op1 op2]
From NNC, P3 = [op1 0 op2]
C1=centroid of (P1, P2 & P3);
L1=dist(C1,[1 0 0]) L2=dist(C1,[0 1 0]) L3=dist(C1,[0 0 1])
For both NN1 and NN2 framework following method is applicable.
NNA is neural network with the help of only Young and Adults male or Female
NNB is neural network with the help of only Senior and Adults male or Female
Input Layer
37 Elements
Hidden Layer
40 elements
Target
2 elements
W i, j
ayer
W2 i, j
ayer
op1
op2
NNC is neural network with the help of only Young and senior male or Female
Fig. 4.6 Classification algorithm in stage 2
0
X input [37x1] feature vector
{Pitch, 13 MFCC, 12 dMFCC, 11 ddMFCC}
NNB
Centroid C1
NNA NNC
0 0
P1 P3 P2
L1 L3 L2
Y A
S
Smaller L decides
class
4.2.2 Neural Network Implementation
For the purpose of neural network implementation we used Matlab tool
specially designed for Neural Network Pattern Recognition tool which
can be invoked by “nprtool” command.
This tool uses Conjugate gradient back propagation Method as explained
in ref.
With the help of “trainscg” command .Speciality about this SCGB
method is it con train any network as long as its weight, net input &
transfer function have derivative function. This algorithm is based on
conjugate directions and does not perform a line search at each
iterations.
Training stops when any one of occurs:
1. Maximum no of epochs is reached
2. Maximum amount of time is reached
3. Performance is minimised to goal
4. Performance gradient falls below minimum gradient.
5. Validation performance has increased more than max fail time.
We are taking number neurons in hidden layer equal to 40. And for
training all 3 databases combinedly we are using. In that tool itself we
can specify percentage samples for training, validation and testing; we
are using them with ratio of 70%, 15% and 15% respectively.
For training it uses ‘Hyperbolic tangent sigmoid transfer function.’ for
neuron modelling..
After satisfactory training i.e. good classification rate you can save this
network in memory and can anytime invoke it at the time of testing.
Fig. 4.7 Matlab nprtool for NN implementation
CHAPTER-5
RESULTS AND DISCUSSION
From database1 we have extracted pitch feature. It was found that group F1 and M1 do not
show any distinct features and can be safely combined to a class of Child. At same time using
pitch one can clearly distinguish between child and Men; but with child and Women only
pitch was found not to be completely reliable.
o Children: ≤ 15 years, male (M1) and female(F1)
o Young people: 15-30 years, male (M2) and female (F2)
o Adults: 30-55years, male (M3) and female (F3)
o Seniors: ≥ 55 years, male (M4) and female (F4)
Fig. 5.1 Average pitch for Females of all 4 classes from database 1
Fig. 5.2 Average pitch for Males of all 4 classes from database 1
Fig. 5.3 Waveform of one of the record from database 1 “Happy birthday”
Fig. 5.4 Pitch track for waveform shown in fig. 5.3
While we are plotting pitch track i.e. pitch contour for database 1 and 2 that it was showing
dramatic variations in pitch within that stimuli .So it was followed by collection of database 3
in which pitch contour is near to average value of pitch at all time.
With the help of database 3 one more fact we come to know that,
For Male < 12 and Female <18 are showing distinct results as compare to others.
It was observed that for boys there is change in pitch after age of 12 years, which is 18 years
for girls.
So it was deciding factor for fixing classification group as given in Table 1.1.According to which
child come under category of any human having less than 18 years age.
In following section with one stimuli example we will give results obtained and calculation
part while following algorithm
Hz
Frame no.
5.1 Unknown stimuli results
Fig. 5.5 Pitch track- Unknown stimuli
Fig. 5.6: 13 MFCC coefficients- Unknown stimuli
0 10 20 30 40 50 60 70 80 90 100-200
-150
-100
-50
0
50
100
150
frame no
Hz
0 2 4 6 8 10 12 14-30
-25
-20
-15
-10
-5
0
5
Fig. 5.7: 12 dMFCC coefficients- Unknown stimuli
Fig. 5.8: 11 ddMFCC coefficients- Unknown stimuli
0 2 4 6 8 10 12-5
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11-35
-30
-25
-20
-15
-10
-5
0
5
Fig. 5.9 Feature vector of 37 x 1 for Unknown stimuli
[Mean (pitch) 13-MFCC 12-dMFCC 11-ddMFCC]
Finput = [ pitch = 106.4090 ddMFCC (10) = 0.1201 ]
Discriminant score
DS1 = [106.4090 0.1201] * [0.027 0.77] T + 5.93 = -2.9325 DS2 = [106.4090 0.1201] *[-0.006 2.3797] T + 0.9263 = 0.5349 F1 = [-2.93 0.53] Centroid c0 (Child) = [2.5286 -0.3146] Centroid c1 (Male) = [0.4167 0.3881] Centroid c2 (Female) = [-2.2844 -0.1795] Here, distance between c2 & F1 is smaller than other distances & c2 belongs to group of male. Classification stage 1 result: Male Now, in classification stage 2, it will go through NN2 framework. In this stage it will again passes through NNA, NNB & NNC networks.
Table 5.1 Neural networks output –unknown stimuli
0 5 10 15 20 25 30 35 40-40
-20
0
20
40
60
80
100
120
Feature number
Network Op1 Op2
NNA 0.7367 0.1617
NNB 0.9899 0.0201
NNC 0.9628 0.0194
Therefore,
P1= [0.7367 0.1617 0] P2= [ 0 0.9628 0.0194] P3= [0.9899 0 0.0201]
C1= [0.5755 0.3748 0.0132]…..centroid L1= 0.5664 L2= 0.8499 L3=1.2023
This means again class 1 means Young group. And final classification will be Male-Young means group 5-YM So output of our algorithm is from Matlab environment: --------Group no------------------------------------------------------- Child = 1 Female<30 = 2 Female<55 = 3 Female>55 = 4 Male<30 = 5 Male<55 = 6 Male>55 = 7 ---------------------------- -----And answer is-------- group = 5 --------------------------------------------------------------------- Group 5 is nothing but YM and it was true positive result.
5.2 Classification result stage one – Canonical Discriminant Analysis
Classification results of stage one using database one as training database.
Table 5.2
Table 5.3
Canonical Discriminant Function Coefficients
.027 -.006
.779 2.380
-5.939 .926
pitch
ddmf cc10
(Constant)
1 2
Function
Unstandardized coef f icients
Functions at Group Centroids
2.529 -.315
.417 .388
-2.284 -.179
group
.00
1.00
2.00
1 2
Function
Unstandardized canonical discriminant
f unct ions ev aluated at group means
Fig. 5.10 Discriminant score plot for all 3 groups
Table 5.4 classification result of stage 1 with database 1
Classification Resultsb,c
22 5 0 27
4 31 4 39
0 1 36 37
81.5 18.5 .0 100.0
10.3 79.5 10.3 100.0
.0 2.7 97.3 100.0
21 6 0 27
4 31 4 39
0 1 36 37
77.8 22.2 .0 100.0
10.3 79.5 10.3 100.0
.0 2.7 97.3 100.0
group
.00
1.00
2.00
.00
1.00
2.00
.00
1.00
2.00
.00
1.00
2.00
Count
%
Count
%
Original
Cross-validateda
.00 1.00 2.00
Predicted Group Membership
Total
Cross validation is done only for those cases in the analy sis. In cross
validation, each case is classif ied by the f unctions deriv ed f rom all cases other
than that case.
a.
86.4% of original grouped cases correctly classif ied.b.
85.4% of cross-validated grouped cases correctly classif ied.c.
5.3 Classification result (Confusion Matrices) stage two- Neural Network
Fig 5.11 NN1 framework – NNA network & this deal with YM and AM category
Fig 5.12 NN1 framework – NNB network & this deal with SM and AM category
Fig. 5.13 NN1 framework – NNC network & this deal with YM and SM category
Fig. 5.14 NN2 framework – NNA network & this deal with YF and AF category
Fig. 5.15 NN2 framework – NNB network & deals with SF and AF category
Fig. 5.16 NN2 framework – NNC network & this deal with YF and SF category
Fig. 5.17 Stage 1 + NN2 framework – all females
Fig. 5.18 Stage 1 + NN1 framework – all males
Fig 5.19 Overall classification result when whole database 3 as testing samples
CHAPTER-6
CONCLUSION AND FUTURE WORK
Proposed Automatic Age and Gender estimating system is implemented with the help
of Matlab toolbox.Figure 6.1 compares classification rates obtained by applying
database 3 for testing at the end of second/last classification stage.
It is found that overall male category is having good classification rate .Except AF other
results are quite satisfactory including overall classification rate which was 69.4%.
Fig. 6.1 Comparison chart for successful estimation of class.
As part of further work, we need to train /neural network whenever true class of
user we know and it is showing different class. And not only there is need to collect
more stimuli but also one needs to explore more features.
0
10
20
30
40
50
60
70
80
90
100
Child YF AF SF YM AM SM
classification rate %
classification rate
REFERENCES
1. Welch, P. D. (1967); “The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms”, IEEE Trans. on Audio Electroacoustic, Volume AU-15,pages 70-73.
2. Rabiner, L. et. al. (1976); “A comparative performance study of several pitch
detection algorithms”, IEEE Tran. Acoustics, Speech and Signal Processing, Volume
24, Issue 5, Page 399-418.
3. Rabiner, L. et. al. (1977); “On the use of autocorrelation analysis for pitch
detection”, IEEE Tran. Acoustics, Speech and Signal Processing, Volume 25, Issue 1,
Page 24-33.
4. Childers, D. et.al.(1977); “The Cepstrum: A guide to processing”, Proc. IEEE . Volume
65, issue 10,pp 1428-1443
5. William R. Klecka (1980); “Discriminant Analysis”, sage university paper
6. Minematsu, N.et.al (1993); “Automatic estimation of one’s age with his/her speech
based upon acoustic modelling techniques of speakers”. ICASSP-93, 1993 IEEE
International Conference Acoustics, Speech, and Signal Processing
7. Moller, M. (1993); “A scaled conjugate algorithm for fast supervised learning”,
Neural Networks, volume 6(4), 523-533.
8. Braun,Aet. Al (1999); “Estimating speaker age across Languages” , The International
Congress of Phonetic Sciences -IPhS99
9. Cerrato, L. et. Al (2000); “subjective age estimation of telephonic voices”, Speech
Communication archive. Volume 31, Issue 2-3 (June 2000), Elsevier.
10. Huang, X. et .al. (2001); "Spoken Language Processing: A guide to Theory,
Algorithm, and System Development," Prentice Hall.
11. Krauss, R. M.et.al (2002); “Inferring speakers, physical attributes from their voices”,
Journal of Experimental Social Psychology, 38, Page 618-625.
12. Shafran, I. et. al. (2003); “Voice signatures”, In Proc. IEEE Automatic Speech
Recognition and Understanding Workshop, ASRU 2003.
13. Mcleod,P. and Wywill, G. (2005); “A smarter way to find pitch”, Proc. , International
computer Music CONFERENCE, Barcelona, July 2005,pp 300-303
14. Metze, F. et.al. (2007); “Comparison of four approaches to age and gender
recognition for telephone application”, ICASSP.
15. Naotoshi Seo (2008); “ENEE632 Project4 Part I: Pitch Detection”, ECE dept.,
Maryland university.
16. Spiegl, W. et. al. (2009); “Analysing Features for Automatic Age Estimation on
Cross-Sectional Data “10th Annual Conference of the International Speech
Communication Association, Brighton. Page 1-4.
17. SPSS ver. 14 manual on algorithms titled “Discriminant”