Detection of hand gestures in micro-Doppler radar data ...

Jasper Sarrazin

deep learning techniquesDetection of hand gestures in micro-Doppler radar data using

Academic year 2017-2018Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Bart DhoedtDepartment of Information Technology

Master of Science in Engineering PhysicsMaster's dissertation submitted in order to obtain the academic degree of

Counsellor: Nicolas KnuddeSupervisors: Prof. dr. ir. Tom Dhaene, Dr. ir. Ivo Couckuyt

Preface

A year ago I found myself musing about what I will make of my future. This was caused by the im-portant choice a student has to make in its career, the thesis. A thesis, apart from the results, showswhere the student its intentions and/or interests lies. The education of Engineering Physics has a wideoffer in disciplines essential for an engineer: electromagnetism, quantum mechanics, nuclear physics,statistical physics, ... A thesis related to machine learning would be a fruitful extension to my diploma.It would teach me technological aspects of being an engineer, and especially a computer engineer.Eventually, machine learning techniques are useful in any engineering discipline. The experience willbe advantageous for future studies and career as a digital era has arrived. The gap between comput-ers and humans is getting narrower. Computers were always superior in solving complex computa-tional problems, where the human brain lacks the capacity. However computers could not performeasy tasks, where the human brain works instantly, e.g. recognition. It is a blessing to be working inthe research domain.

I am thankful for professor Dhaene, who stimulated my interests and gave me the chance to explorethe deep learning domain. I also thank other members of the department. A word of thanks to Dr.Couckuyt and Dr. van der Herten for their practical help when I struggled with computer issues andfor ir. Nicolas Knudde for his weekly feedback and clear view on the topic to keep me on track for thegoal of this thesis. Nicolas also taught me how to properly work in Keras and Tensorflow. With the’Slack’ chat, problems and results could be discussed in a friendly way, which was only beneficial forthe results and the work progress. I thank everyone who attended the meetings and coached me inpresenting the obtained results. By these presentations, and by the assistance of Nicolas, I enrolledmy knowledge of how to set up a step-by-step plan in similar problems. While writing this thesis, itbecame clear once again, that the aid of all the mentioned people were indispensable in delivering thiswork.

Further I want to thank all the persons supporting me, especially my parents. The frustrations a the-sis brings with it, made me sometimes an unpleasant person. They tolerated my grumpy presence andmade me do forget my worries. At home, they created the optimal environment to work in. I hadnothing to worry about and could focus on my work. By extension, this is also true for the past andcoming years. My friends and girlfriend Karen were an aid to keep a good mood for which I am grate-ful. I hope to be surrounded by the same people in upcoming projects, which I am able to sail intowith the experience learned in this thesis year.

The author gives permission to make this master dissertation available for consultation and to copyparts of this master dissertation for personal use. In the case of any other use, the copyright termshave to be respected, in particular with regard to the obligation to state expressly the source whenquoting results from this master dissertation.

Jasper SarrazinJune 1, 2018

Abstract

This thesis is aiming to close the gap between humans and machine interfaces by recognizing hand ges-tures in micro-Doppler radar data. It is based on Google’s Soli project [1], where the gestures are cap-tured in millimeter wave radar data. Google’s Soli project built a deep neural network in the PyTorchlibrary containing a sequence of convolution, recurrent and fully connected layers. The imitation inKeras, built in this thesis, performs similarly on the last frame with an accuracy of 90%, but scores6% worse on all the frames (81%). The performance of the deep neural network is used as reference fornew techniques, namely variational recurrent neural networks. Variational recurrent neural networksassume that the perceived data is part of a process, involving a lower dimensional space on which thedata depends. This is called the latent space. Variational recurrent neural networks model the poste-rior p(z|x) and likelihood p(x|z), where x represents the data space and z the latent space. With thelearned parameters of the posterior distribution, a data augmentation can be performed by samplingfrom the latent space. This is a Bayesian way of modeling and prevents the model from overfitting. Aclassifier is built on top of this latent space. It performs on all frames as well as on the last frame bet-ter than the deep neural network. On all frames it achieves an accuracy of 90%, while achieving 96%on the last frame.

Key words: Deep learning, Bayesian modeling, micro-Doppler radar

Detection of hand gestures in micro-Doppler radardata using deep learning techniques

Jasper Sarrazin

Supervisor(s): Prof. dr. ir. Tom Dhaene, dr. ir. Ivo Couckuyt, ir. Nicolas Knudde

Abstract—In this article neural networks are built to recognize hand ges-tures. State-of-the-art techniques are compared to deep neural networks(DNNs). The DNNs contain a sequence of convolutional, recurrent andfully connected layers, while the state-of-the-art techniques are based onvariational autoencoders. The used data is available on [1] and contains 40frames of micro-Doppler data. An additional research on cross-validationstrategies is performed. In the first strategy is each person delivering datafor the training set and the test set. In the second strategy a person de-livers data to either the training set or either the test set. When applyingthe first cross-validation strategy, the DNN has an accuracy of 90% on thelast frame. On the other hand the variational autoencoder achieves an ac-curacy of 96.8%. When applying the second cross-validation strategy, theaccuracies on the last frame are respectively 72.7% and 80.5%. The state-of-the-art techniques obtain higher accuracies than the DNNs. The firstcross-validation strategy scores better than the second one. However it suf-fers from data leakage.

Keywords—Deep learning, gesture recognition, variational autoencoders

I. INTRODUCTION

IN hand gesture recognition a machine is taught to recognizehand gestures. Succeeding in this could narrow the gap be-

tween humans and machine interfaces. Next to recognizing handgestures, the motion of other limbs or organs could be detected.Fast recognition of uncommon heart contractions could be appli-cable in automotive and health care industries. Several compa-nies are active in this domain, using cameras or microphones asinput for the machine. The Google Soli project explores the ca-pability of radar systems for recognizing hand gestures. Earlier,radars were only used to detect vehicles. Due to the recent ad-vances in semiconductor technologies, the radar is miniaturizedand hence more applicable in embedded devices. The data fedto the models is available at [1] and consists of micro-Dopplerradar frames. The term ’micro’ stresses the ability to detect mi-cro motions w.r.t. the moving object. Furthermore [1] proposesa train-test split of the available data. However each person isperforming gestures for both training set and test set. As thismodel can suffer from data leakage, a second cross-validationstrategy is researched. In this strategy a person is performinggestures for either training set or test set. Section II starts withthe description of the data, which will be fed to DNNs. Thearchitecture of DNNs and their results are listed in Section III.The relevant concepts of machine- and deep learning are ex-plained in [2]. Section IV introduces the mathematical frame-work of the state-of-the-art techniques: variational neural net-works (VRNNs). The results of applying these are given in Sec-tion V.

J. Sarrazin is a thesis student at the SUMO Lab (UGent), Ghent, Belgium.E-mail: [email protected] .

II. MICRO-DOPPLER RADAR DATA

A common radar system transmits an electromagnetic waveand scans the object in front it. This mechanism is called beamsteering and results in a high spatial resolution. The spatial res-olution is given by Equation 1, where r is the range betweenobject and radar, l the aperture size and λ the wavelength.

resa =rλ

l(1)

Assume a radar at a working frequency of 60 GHz. If this radarneeds to have a spatial resolution of 1 cm at a range of 20 cm,its aperture size will be equal 10 × 10 cm. Radar systems ofthat size are not usable in wearable technologies. The Soli radardoes not use this beam steering technique. Instead it illuminatesthe complete object with one wide beam. The lack in spatialresolution is compensated by a high temporal resolution. TheSoli sensor chip is an FMCW (Frequency Modulated Continu-ous Wave) SiGe radar chip. It modulates the transmitted waveleading to a periodic pulses at a high frequency. The period ofone such pulse is given by RRI , the radar repetition interval.The transmitting signal is expressed in Equation 2, where u de-scribes the envelope of the modulated signal.

str(t, T ) = u(t− T )ej2πfct (2)

There are two distinct time scales involved. The unit of the slowtime T is RRI . The fast time t is the time scale during oneperiod.

The hand to be recognized is modeled by Nsc scattering cen-tres. These centres scatter the transmitted wave and reflectionsof these scatterings are captured by the receiving antenna ele-ment in the radar. A general expression for the superpositionof these scattered waves is given by Equation 3, with ρi(T ) thereflectivity parameter and ri(T ) the radial distance of scatteringcenter i. These do not vary during one modulation period.

y(r, T ) =

Nsc∑

i=1

ρi(T )δ(r − ri(T )) (3)

A more exact expression for the raw received signal is given byEquations 4 and 5.

sraw(t, T ) =

Nsc∑

i=1

si,raw(t, T ) (4)

si,raw(t, T ) =ρi(T )

r4i (T )u

(t− 2ri(T )

c

)ej2πfc(t−

2ri(T )

c ) (5)

After processing the raw signals, the received signal for eachscattering center is given by Equation 6, where h is the pointtarget response. It depends on the modulation scheme, trans-mission parameters and preprocessing steps.

si,rec(t, T ) =ρi(t)

r4i (t)ej

4πri(T )

λ h

(t− 2ri(T )

c

)(6)

From the fast time delay of h, the range can be deduced. Theexceptional motion sensitivity of the Soli radar is due to thephase change. The phase change of a center i moving duringthe the interval [T1, T2] is given by Equation 7.

∆φi(T1, T2) =4π

λ(ri(T2)− ri(T1)) (7)

The velocity of a scattering center is assumed to be constantover coherent processing time Tcpi. Tcpi has to be larger thanRRI , as the range was constant during a modulation period.Using the definition of angular frequency and Equation 7, theDoppler frequency (shifted frequency) of the received signal canbe related to the velocity of the detected scattering center.

fD,i(T ) =1

2πωi(t) =

1

2π

dφi(T )

dT=

2vi(T )

λ(8)

The received signal srec(t, T ) can now be transformed to amore informative representation. The transformation at hand isa Fourier transformation in the slow time dimension. As men-tioned, the slow time scale is the relevant scale over which scat-tering centres properties change. Equation 8 does not dependon fast time. The transformation of the signal is expressed inEquation 9. This leads to a 3D representation of the data. Theslow time T is still a dimension of the data because the Fouriertransformation was only performed in intervals of length Tcpiinstead of the complete domain.

S(t, f, T ) =

T+Tcpi∫

T

srec(t, T′)e−j2πfT

′dT′ (9)

The frequency dimension f in Equation 9 can be converted toa velocity dimension using Equation 8. The fast time dimensiont can be converted to a range dimension using the fast time delayt = 2r

c . This leads to micro-Doppler radar frames.

RD(r, v, T ) = S

(2r

c,

2v

λ, T

)(10)

The Soli radar system has 4 receiving antenna elements, lead-ing to 4 different representations (channels). Only one channelwill be used as the cross correlations between the channels donot improve the accuracies, as stated by [1]. More informationcan be found in [3]. The data consists of 40 frames, each 32×32large. After reshaping, the dimensions of the data are: set size×40 × 1024. A train-test split is proposed by [1]. The trainingset contains 10 different persons performing gestures, the test-ing set contains gestures of the same 10 persons. In the cross-validation strategy, proposed in [1], the training set is split in 10folds. Each fold contains the data of one person. This strategywill be denoted by the ’10-fold cross-validation’. As the pro-posed train-test split can suffer from some data leakage, another

split is made. The data of the 10 persons is split in a way thatthe gestures of a person can only be in either training or eithertest set. A similar cross-validation strategy will be used on thissplit, however as there is only data of 5 different persons in thetraining set, a 5-fold cross-validation is performed. This strat-egy will be denoted by ’5-fold cross-validation’. The number offolds denote which train-test split is used.

III. DNN PERFORMANCES

The model to classify the data samples is a DNN based on [1].The complete architecture is given in Table I. The values fromthe dropout layers are changed w.r.t. [1], based on accuraciesobtained on validation sets. In [1], the first three dropout layershave a dropout rate of 0.4 and the last two a rate of 0.5. Inthe self-built DNN, the respectively rates are 0.25 and 0.3. Theconvolutional layers use a kernel of size 3× 3 and strides 1× 1.Each possible layer has a rectified linear unit activation function(ReLu), except for the last layer, which has a softmax activation.The DNN is always trained with the Adam optimizer using alearning rate 10−5 for 50 epochs. The batch size is equal to25. The training data is min-max scaled over all pixels beforetraining. Subsequently the obtained min-max scaler acts uponthe test set. The DNN models will be benchmarks for the state-

TABLE ISUMMARY OF THE ARCHITECTURE OF THE DNN.

Type layer Output dimensionInput layer batch × 40 × 1024

Reshape layer batch × 40 × 1 × 32 ×32Batch Normalization batch × 40 × 1 × 32 ×32

Time distributed convolutional layer batch × 40 × 32 × 30 × 30Batch Normalization layer batch × 40 × 32 × 30 × 30

Dropout layer batch × 40 × 32 × 30 × 30Time distributed convolutional layer batch × 40 × 64 × 28 × 28

Batch Normalization layer batch × 40 × 64 × 28 × 28Dropout layer batch × 40 × 64 × 28 × 28


Dropout layer batch × 40 × 128 × 28 × 28Reshape layer batch × 40 × 86528

Fully connected layer batch × 40 × 512Batch Normalization batch × 40 × 512

Dropout layer batch × 40 × 512Fully connected layer batch × 40 × 512Batch Normalization batch × 40 × 512

Dropout layer batch × 40 × 512LSTM batch × 40 × 512

Batch Normalization batch × 40 × 512Dropout layer batch × 40 × 512

Fully connected layer = output batch × 40 × 11

of-the art techniques described in Section IV.

A. 10-fold cross-validation

Figure 1 shows the obtained accuracies on all frames, whileFigure 2 shows the obtained accuracies on the last frame. Theaverage accuracy over all frames is 81.8%, while on the lastframe the average accuracy is 90%.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.53 0.00 0.04 0.04 0.06 0.02 0.00 0.01 0.00 0.01 0.29

0.01 0.88 0.01 0.01 0.01 0.00 0.00 0.02 0.01 0.01 0.05

0.10 0.00 0.55 0.05 0.03 0.00 0.00 0.01 0.00 0.03 0.22

0.07 0.00 0.04 0.56 0.01 0.00 0.01 0.01 0.03 0.03 0.24

0.01 0.00 0.02 0.00 0.87 0.03 0.01 0.00 0.00 0.03 0.01

0.00 0.00 0.00 0.00 0.00 0.93 0.01 0.00 0.00 0.05 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00

0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.88 0.01 0.01 0.07

0.00 0.00 0.01 0.05 0.00 0.00 0.00 0.01 0.86 0.02 0.04

0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.00

0.00 0.00 0.01 0.00 0.03 0.00 0.00 0.02 0.00 0.01 0.94

Normalized confusion matrix

0.0

0.2

0.4

0.6

0.8

Fig. 1. Confusion matrix over all frames. Preprocessing consisted only of amin-max scaling. The validation strategy was a 10-fold cross-validation.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.66 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.10 0.00 0.80 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.06

0.05 0.00 0.02 0.62 0.00 0.00 0.00 0.01 0.00 0.00 0.30

0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.01

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03

0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.01

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99

Normalized confusionmatrix

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 2. Confusion matrix on the last frame. Preprocessing consisted only of amin-max scaling. The validation strategy was a 10-fold cross-validation.

B. 5-fold cross-validation

Figure 3 shows the obtained accuracies on all frames, whileFigure 4 shows the obtained accuracies on the last frame. Theaverage accuracy over all frames is 64.8%, while on the lastframe the average accuracy is 72.7%.

IV. VARIATONAL RECURRENT NEURAL NETWORKS(VRNNS)

The DNNs will be compared to variational recurrent neuralnetworks (VRNNs) which are discussed in [4]. A VRNN is atype of variational autoencoders (VAEs), discussed in [5]. Thissection will introduce the concepts of VAEs. They assume thatoccurrence of a data sample xi is part of a process, where ini-tially an underlying and unknown state zi is drawn. Conse-quently xi depends on the state of zi. This is schematicallygiven in the graphical model of Figure 5. A VAE tries to thelearn the identity function to reconstruct the input x. It seemstrivial to make a function, which maps its input on the input it-

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.52 0.00 0.08 0.07 0.16 0.01 0.01 0.02 0.00 0.02 0.12

0.01 0.81 0.03 0.00 0.01 0.00 0.00 0.04 0.07 0.01 0.01

0.24 0.00 0.38 0.04 0.06 0.00 0.00 0.00 0.00 0.06 0.22

0.10 0.00 0.14 0.41 0.04 0.01 0.01 0.04 0.04 0.05 0.15

0.08 0.00 0.15 0.06 0.34 0.14 0.04 0.00 0.00 0.14 0.05

0.00 0.00 0.00 0.00 0.00 0.80 0.07 0.00 0.00 0.12 0.00

0.00 0.00 0.01 0.01 0.00 0.02 0.89 0.04 0.01 0.02 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.11 0.84 0.01 0.02 0.00

0.01 0.00 0.11 0.02 0.01 0.00 0.00 0.04 0.79 0.02 0.01

0.00 0.00 0.00 0.00 0.00 0.07 0.06 0.00 0.00 0.87 0.00

0.10 0.00 0.18 0.01 0.00 0.01 0.00 0.18 0.00 0.02 0.48


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fig. 3. Confusion matrix over all frames. Preprocessing consisted only of amin-max scaling. The validation strategy was a 5-fold cross-validation.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.48 0.00 0.06 0.10 0.13 0.02 0.00 0.03 0.00 0.01 0.18

0.00 0.93 0.04 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00

0.30 0.00 0.53 0.01 0.02 0.00 0.00 0.00 0.00 0.05 0.10

0.12 0.00 0.10 0.46 0.04 0.00 0.00 0.05 0.00 0.02 0.21

0.00 0.00 0.12 0.02 0.59 0.03 0.00 0.00 0.00 0.15 0.08

0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.02 0.00

0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.02 0.00

0.00 0.00 0.00 0.00 0.00 0.04 0.30 0.64 0.00 0.02 0.00

0.00 0.00 0.18 0.00 0.01 0.00 0.00 0.00 0.82 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.99 0.00

0.05 0.00 0.00 0.02 0.00 0.03 0.00 0.27 0.00 0.02 0.62


0.0

0.2

0.4

0.6

0.8

Fig. 4. Confusion matrix on the last frame. Preprocessing consisted only of amin-max scaling. The validation strategy was a 5-fold cross-validation.

self. However some constraints are set to the VAE. At one pointwhile learning, the input has to be transformed in the lower di-mensional latent space representation. This representation canbe used for dimensionality reduction (a classic autoencoder) ordata augmentation (VAE). A classic autoencoder only learns therepresentations, while a VAE learns its underlying distribution,from which can be sampled.

In the light of probability theory and with the two spaces sev-eral distributions can be constructed. The evidence pθ(x), thelikelihood pθ(x|z), the posterior pθ(z|x) and the prior pθ(z).The θ denotes the parameters of the corresponding distribution.In addition, VAEs assume that the prior is a unit normal distribu-tion and the likelihood a general normal distribution. The varia-tional autoencoder tries to maximize pθ(x) by inferring it frommodeled joint probability p(x, z). The joint probability can beconstructed using the assumption on the graphical model: firsta latent space state zi is drawn from pθ(z), followed by a datasample drawn from the likelihood pθ(x|z), given zi. To infer

z

x

N

Fig. 5. Graphical model for a variational autoencoder. A filled circle means theinformation is known. Arrows indicate the information flow. The N withinthe rounded rectangle means this process is repeated each time.

pθ(z|x) and pθ(x), the integral appearing in Bayes’ law has tobe evaluated.

pθ(z|x) =pθ(x|z)pθ(z)

p(x)=

pθ(x|z)pθ(z)∫pθ(x|z)pθ(z)dz

(11)

However this integral is intractable as the integration reachesover the complete latent space. The integral can be discretizedbut the summation will take too long. The summation dependson the random sampling of latent space states. Most of theserandom samples will not lead to relevant data space counterpartand hence pθ(x|z) will be very low. To feed the summation withrelevant latent space states, the posterior can be used. Hence theproblem is looped back to the integral in Equation 11. To breakthe loop, the posterior is approximated by qφ(z|x), which is alsoassumed to be a normal distribution. This is not necessary butwill simplify the mathematics. Constructing the loss functionstarts from maximizing the evidence.

log pθ(x) = Eqφ(z|x)[

logpθ(x, z)

pθ(z|x)

]

= Eqφ(z|x) [log pθ(x, z)]− Eqφ(z|x) [log pθ(z|x)]

(12)

This cannot be used as loss function as the intractable pos-terior is involved. With the addition and subtraction ofEqφ(z|x) [log qφ(x|x)], the evidence can be transformed to

log pθ(x) = L(φ,θ;x) +DKL [qφ(z|x)||pθ(z|x)] (13)

with DKL the Kullback-Leibler (KL) divergence and

L(φ,θ;x) = Eqφ(x|x) [log pθ(x, z)]− Eqφ(x|x) [log qφ(z|x)](14)

Applying Bayes’ rule on the joint distribution in Equation 14leads to Equation 15.

L(φ,θ;x) = Eqφ(x|x) [log pθ(x|z)]−DKL [qφ(z|x)||pθ(z)](15)

The KL divergence measures the similarity between distribu-tions. It can be shown that the KL divergence is nonnegative.From this property follows that the L function in Equation 13 isthe lower bound on the evidence. As the true posterior pθ(z|x)and approximated posterior qφ(z|x) have to be as similar as pos-sible, the KL divergence appearing in Equation 13 has to be as

minimal as possible. On the other hand, the evidence in the righthand side of this expression needs to be maximized. The prob-lem of maximizing the evidence in Equation 13 can be convertedto maximizing the lower bound L(φ,θ;x), which is given inEquation 15. In this equation only tractable distributions ap-pear. Consequently Equation 15 will be used as loss function ofthe VAE.

A VRNN contains a VAE each time step. In contrast witha standard VAE, pθ(z) is not assumed to be the unit normaldistribution. In fact the prior parameters are conditioned on theprevious state of a LSTM, ht−1.

(µprior,t,σ

2prior,t

)= ψpriorτ (ht−1) (16)

The function ψpriorτ denotes the neural network learning the pa-rameters at a time step τ . The likelihood parameters depend onits latent space input z, but now in a similar way as the prior, adependency on ht−1 is added.

(µdec,t,σ

2dec,t

)= ψdecτ (ψz

τ (zt),ht−1) (17)

The neural networks learning the likelihood parameters arecalled, the ’decoding’ network ψdecτ . The function ψz

τ showsthat as well as using the latent space samples a transformationof these can be used. For the neural network learning the poste-rior parameters a similar expression can be written down. Thisis called the ’encoding’ network.

(µenc,t,σ

2enc,t

)= ψencτ (ψx

τ (xt),ht−1) (18)

Once the posterior parameters are known, latent samples can bedrawn to feed the decoding network. This is done by the repa-rameterizing trick, introduced by [6]. Instead of approaching zas random latent variable, it can be approached as a determin-istic variable but depending on a auxiliary random variable εwhich is normal distributedN (0, I). If ε is drawn from the nor-mal distribution, z can be computed as µenc + εσenc. The lossfunction of a VRNN is the lower bound given in Equation 15 butaveraged over time steps. On each time step, the loss functionconsists of the summation of all the separate contributions perlatent dimension, which are each univariate normal distributed.

For the practical implementation of the VRNN, two data in-puts are required. On the one hand xt is used to feed the ψx

τ , onthe other hand, xt−1, a shifted version of xt, is used to feed theLSTM and obtain ht−1. Once the posterior/encoding param-eters are known, latent space representations of the input canbe drawn on which a classification model is built. The slightlynoisy drawn samples can be seen as regularizing technique. Aschematic representation of the VRNN is given in Figure 6. Thedecoding variance is set to 10−6 in order to make the reconstruc-tion more effective.

V. LATENT SPACE PERFORMANCES

The model with as input the latent space samples will becalled the ’post-VRNN classifier’. This is a sequential modeland its architecture is given in Table II. Each layer has a ReLuactivation, except for the last layer, which has a softmax acti-vation. The post-VRNN classifier will be trained for 50 epochs

xt xt−1

ht−1φx(xt)

LSTMfully connected

µenc logσ2enc

fully connected fully connected

zt

φz(zt)

sampling

µdec σ2dec = 10−6

fully connected

fully connected

µprior

fully connected

fully connected

σ2prior

Fig. 6. A schematic representation of the architecture of the variational recur-rent autoencoder. Blue represents input, and orange represents informationneeded for calculating the loss function.

with the Adam optimizer using a learning rate of 10−4 on batchsizes equal to 25. The VRNN will have the same specifics, withthe exception that it will train for 1000 epochs instead of 50. Thehyperparameters of both the VRNN and the ’post-VRNN clas-sifier’ are optimized with the tree-structrued Parzen estimator(TPE) with 25 optimization loops. The TPE optimizer from thePython Hyperopt library is used. More information on the opti-mization algorithm can be found in [7]. This section will list theresults of the optimization and the performance of the VRNNcombined with post-VRNN classifier. The input data will al-ways be transformed to a latent space of 20 dimensions. TheVRNN is optimized for one fold and the results are extended toall other folds. The post-VRNN classifier is optimized for eachdifferent fold.

TABLE IISUMMARY OF THE GENERAL ARCHITECTURE OF THE POST-VRNN

CLASSIFIER.

Type layerInput layer

Batch NormalizationTime distributed fully connected layer

Dropout layerBatch normalization

LSTMDropout layer

Batch normalizationTime distributed fully connected layer = output

A. 10-fold cross-validation

Table III shows the optimized values for the hyperparametersof the VRNN. The results for the optimized hyperparameters of

TABLE IIISUMMARY OF THE OPTIMIZED HYPER-PARAMETERS OF THE VRNN IN THE

10-FOLD CROSS-VALIDATION STRATEGY

hyper-parameter prior distribution on hyperparameter resultsdimension ht−1 discrete uniform [10,500] - step size 10 310

dimension φx(xt) discrete uniform [10,500] - step size 10 370dimension φx(xt) discrete uniform [10,500] - step size 10 330

the post-VRNN classifier differ for each fold. These are omitted

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.78 0.00 0.04 0.05 0.03 0.00 0.00 0.00 0.00 0.00 0.08

0.00 0.97 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00

0.10 0.00 0.75 0.06 0.02 0.00 0.00 0.00 0.00 0.01 0.07

0.05 0.00 0.03 0.83 0.00 0.00 0.00 0.00 0.01 0.00 0.06

0.01 0.00 0.01 0.00 0.93 0.03 0.00 0.00 0.00 0.01 0.01

0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.00 0.01 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.00 0.00

0.02 0.00 0.00 0.01 0.00 0.00 0.00 0.90 0.01 0.00 0.05

0.00 0.00 0.00 0.03 0.00 0.00 0.00 0.00 0.94 0.01 0.01

0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.01 0.96 0.00

0.03 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.95


0.0

0.2

0.4

0.6

0.8

Fig. 7. Confusion matrix over all frames obtained with the post-VRNN clas-sifier. Preprocessing consisted only of a min-max scaling. The validationstrategy was a 10-fold cross-validation.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.87 0.00 0.02 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.06

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.06 0.00 0.91 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.02

0.04 0.00 0.02 0.93 0.00 0.00 0.00 0.00 0.00 0.00 0.01

0.01 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.01 0.01

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.01 0.00 0.01

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.99 0.00

0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.99


0.0

0.2

0.4

0.6

0.8

1.0

Fig. 8. Confusion matrix on the last frame obtained with the post-VRNN clas-sifier. Preprocessing consisted only of a min-max scaling. The validationstrategy was a 10-fold cross-validation.

for the brevity of this article. With the optimized VRNN andpost-VRNN following results are obtained. Figure 7 shows theobtained accuracies on all frames, while Figure 8 shows the ob-tained accuracies on the last frame. The average accuracy overall frames is 90.7%, while on the last frame the average accu-racy is 96.8%.

B. 5-fold cross-validation

Table IV shows the optimized values for the hyperparametersof the VRNN. The results for the optimized hyperparameters ofthe post-VRNN classifier differ for each fold. These are omittedfor the brevity of this article. With the optimized VRNN andpost-VRNN following results are obtained. Figure 9 shows theobtained accuracies on all frames, while Figure 10 shows the ob-tained accuracies on the last frame. The average accuracy overall frames is 67%, while on the last frame the average accuracyis 80.5%.

TABLE IVSUMMARY OF THE OPTIMIZED HYPER-PARAMETERS OF THE VRNN IN THE

5-FOLD CROSS-VALIDATION STRATEGY

hyper-parameter prior distribution on hyperparameter resultsdimension ht−1 discrete uniform [10,500] - step size 10 250

dimension φx(xt) discrete uniform [10,500] - step size 10 300dimension φx(xt) discrete uniform [10,500] - step size 10 260

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.43 0.00 0.10 0.22 0.09 0.00 0.00 0.03 0.01 0.00 0.12

0.01 0.84 0.00 0.01 0.01 0.00 0.00 0.05 0.06 0.00 0.01

0.23 0.00 0.32 0.08 0.05 0.01 0.00 0.01 0.02 0.02 0.25

0.05 0.00 0.12 0.69 0.00 0.01 0.00 0.01 0.01 0.01 0.08

0.17 0.00 0.05 0.13 0.39 0.13 0.00 0.00 0.01 0.05 0.06

0.02 0.00 0.00 0.14 0.01 0.73 0.01 0.00 0.00 0.04 0.05

0.00 0.00 0.00 0.03 0.00 0.09 0.86 0.01 0.00 0.00 0.00

0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.96 0.00 0.00 0.00

0.00 0.00 0.06 0.04 0.00 0.00 0.00 0.02 0.86 0.01 0.01

0.00 0.00 0.00 0.03 0.00 0.16 0.01 0.00 0.01 0.78 0.00

0.07 0.00 0.03 0.04 0.00 0.01 0.00 0.34 0.00 0.01 0.51


0.0

0.2

0.4

0.6

0.8

Fig. 9. Confusion matrix over all frames obtained with the post-VRNN clas-sifier. Preprocessing consisted only of a min-max scaling. The validationstrategy was a 5-fold cross-validation.

VI. SUMMARY & CONCLUSIONS

Table V summarizes the results of this report. It is clear

TABLE VSUMMARY OF THE OBTAINED RESULTS WITH THE DNN AND LATENT

SPACE MODEL.

model folds over all acc. last frame acc.DNN 5 64.8% 72.7%DNN 10 81.8% 90%

post-VRNN 5 67% 80.5%post-VRNN 10 90.7% 96.8%

that the latent space models consistently achieves considerablyhigher accuracies than the sequential DNNs. The latent spacemodels are an improvement on the model from [1]. The factthat latent space models use approximately 23 times less train-able parameters than the DNN is even more worth mentioning.Furthermore, the 10-fold cross-validation strategy performs bet-ter than the 5-fold strategy. However the 10-fold strategy doesnot generalize well the recognition of gestures as each person isrepresented in training and test set. The 5-fold strategy is a moresuited strategy to recognize hand gestures universally. Based onall confusion matrices in this report, the gestures can be splitin two groups. A group which can be easily recognized and agroup which is harder to recognize. The first group have a fairlyhigh accuracy over all frames. The latter only have respectableaccuracies on the last frame, meaning they have benefit from thetemporal modeling by the LSTM. In those gestures the fingerscontain the most information, like pinch pinky. The more eas-

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.34 0.00 0.11 0.33 0.07 0.00 0.00 0.05 0.00 0.00 0.10

0.00 0.95 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00

0.22 0.00 0.59 0.06 0.02 0.00 0.00 0.00 0.04 0.02 0.05

0.01 0.00 0.17 0.77 0.00 0.00 0.00 0.00 0.01 0.00 0.05

0.00 0.00 0.02 0.02 0.76 0.03 0.00 0.00 0.03 0.10 0.04

0.00 0.00 0.00 0.06 0.00 0.90 0.00 0.00 0.00 0.01 0.02

0.00 0.00 0.00 0.02 0.00 0.00 0.98 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.01 0.04 0.00 0.00 0.00 0.00 0.00 0.95 0.00 0.00

0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.98 0.00

0.04 0.00 0.02 0.03 0.00 0.01 0.00 0.26 0.00 0.00 0.64


0.0

0.2

0.4

0.6

0.8

1.0

Fig. 10. Confusion matrix on the last frame obtained with the post-VRNN clas-sifier. Preprocessing consisted only of a min-max scaling. The validationstrategy was a 5-fold cross-validation.

ily recognized gestures are gestures where the hand as a wholeperforms the motion, like push or slow swipe.

The same could be concluded by comparing the results fromthe 10-fold cross-validation with those of the 5-fold cross-validation. The 5-fold strategy was a more suited approach inthe universally recognition. The gestures, which are more easilyrecognized by a 5-fold strategy, are again those where the handmoves as whole.

ACKNOWLEDGMENTS

This report is the summary of a thesis conducted at the SUMOLab at UGent. The work was supported and supervised by ir.Nicolas Knudde, dr. ir. Ivo Couckuyt and prof. dr. ir. TomDhaene.

REFERENCES

[1] Wang, S. (2016). Gesture Recognition Using Neural Networks withGoogle’s Project Soli Sensor. Github, consulted on Februay 1, 2018 viahttps://github.com/simonwsw/deep-soli

[2] Goodfellow I., Bengio Y. & Courville A. (2016). Deep learning. MIT, con-sulted on July 25, 2018 via https://github.com/janishar/mit-deep-learning-book-pdf

[3] Lien J., Gillian N., Karagozler E., Amihood P., Schwesig C, Olson E., RajaH. & Poupyrev I. (2016). Soli: Ubiquitous Gesture Sensing with MillimeterWave Radar. ACM Transactions on Graphics (TOG), 35, Article No. 142

[4] Chung J., Kastner K., Dinh L., Goel K., Courville A. & Bengio Y. (2016).A Recurrent Latent Variable Model for Sequential Data. NIPS’15 Proceed-ings of the 28th International Conference on Neural Information ProcessingSystems - Volume 2, 28, pages 2980-2988. Consulted on September 26,2017 via https://arxiv.org/pdf/1506.02216.pdf

[5] Kingma P. & Welling M. (2013) Auto-Encoding Variational Bayes. CoRR,Consulted on September 26, 2017 via https://arxiv.org/pdf/1312.6114.pdf

[6] Kingma D. P., Salimans T. & Welling M. (2015). Variational Droputand the Local Reparameterization Trick NIPS’15 Proceedings of the28th International Conference on Neural Information Processing Sys-tems, 28 nr. 2 pages 2575-2583 Consulted on May 31, 2018 viahttps://arxiv.org/abs/1506.02557

[7] Bergstra J., Bardenet R. ,Bengio Y. & Balazs K. (2011). Algo-rithms for Hyper-Parameter Optimization. NIPS’11 Proceedings ofthe 24th International Conference on Neural Information Process-ing Systems, 24, pages 2546-2554. Consulted on April 25, 2018via https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

Contents

Abbreviations

1 Introduction 1

2 Machine Learning and Neural Network Concepts 32.1 Machine Learning - Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Underfitting versus overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Bias versus variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.5 Common regularization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.6 Generative models and Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Convolutional neural network (CNN) Convolutional neural network (CNN) . . . . . . . . 142.4 Recurrent neural network (RNN) Recurrent neural network (RNN) . . . . . . . . . . . . . 142.5 Long-Short Term Memory unit (LSTM) Long short-term memory unit (LSTM) . . . . . . 152.6 Adam (Adaptive moment) optimizer Adaptive moment (Adam) . . . . . . . . . . . . . . . 172.7 Regularization of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.8 Hyperparameter Optimization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8.1 Gaussian processes (GPs) Gaussianprocess(GP ) . . . . . . . . . . . . . . . . . . . 192.8.2 Tree-structured Parzen estimator (TPE) . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Data Description 223.1 Doppler effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Soli hardware and micro Doppler radar data . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 The Google Deepsoli dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Classification with deep neural networks 284.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 10-fold cross-validation with logarithmic scaling . . . . . . . . . . . . . . . . . . . . 304.3.2 10-fold cross-validation without logarithmic scaling . . . . . . . . . . . . . . . . . . 314.3.3 5-fold cross-validation with logarithmic scaling . . . . . . . . . . . . . . . . . . . . 334.3.4 5-fold cross-validation without logarithmic scaling . . . . . . . . . . . . . . . . . . 344.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Autoencoders (AEs) Autoencoder (AE) 375.1 Standard autoencoder (AE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Variational autoencoder (VAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Variational recurrent neural network (VRNN): type I . . . . . . . . . . . . . . . . . . . . . 395.4 Variational recurrent autoencoder: type II . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Classification using variational recurrent neural network (VRNN) 426.1 Results VRNN type I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.1.1 10-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.1.2 5-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Results VRNN type II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2.1 10-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2.2 5-fold cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.4 Hyperparameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.5 Intermediate results in the latent space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7 Conclusion 54

A Kullback-Leibler divergence in case of normal distributions 56A.1 univariate normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56A.2 multivariate normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B Non-negativity of Kullback-Liebler divergence 59B.1 Jensen’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59B.2 Non-negativity of KL divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Abbreviations

AdaGrad adaptive gradient. 18Adam adaptive moment. 17, 18, 28, 42, 46, 51, 54AE autoencoder. 37, 38

BO Bayesian optimization. 19, 20

CNN convolutional neural network. 14, 15

DNN deep neural network. 29

FMCW frequency modulated continuous wave. 23

GP Gaussian process. 19–21

KL Kullback-Leibler. 4, 5, 38–40, 60

LSTM long short-term memory unit. 15, 16, 28, 29,36, 39

MAP maximum a posteriori. 20ML maximum likelihood. 8, 20MSE mean squared error. 3, 5, 6, 8, 9

PoI Probability of Improvement. 20

ReLu rectified linear unit. 11–13, 28, 39, 42, 46RMSE root mean squared error. 3, 52RMSProp root mean square propagation. 18RNN recurrent neural network. 14–16, 18, 39

SGD stochastic gradient descent. 17

TPE tree-structured parzen-window estimator. 20,51, 52

VAE variational autoencoder. 37–39VRNN variational recurrent neural network. 37, 39–

43, 45–47, 49–55

Chapter 1

Introduction

The goal of this thesis is to close the gap between humans and machine interfaces. In particular, themachine is taught to recognize/understand hand gestures in micro-Doppler radar data. Swiping a fin-ger in the air could cause scrolling on a distant screen. Cameras were always preferred over radars insuch settings. Radars have nevertheless unique assets compared to the use of cameras. However radarswere in the past mainly used to detect large vehicles, like airplanes. By the recent advances in semi-conductor technology, radars can be miniaturized. Consequently radars are more easily integrable inembedded devices, which was already the case for cameras. However incidental light, obstructing ob-jects, smoke and mist on the setting could ruin the utility of camera data. In addition, radars coulddetect motions behind obstacles, like a sheet of cloth. This would make it suitable to process in cloth-ing. When used in clothing, the radar could monitor other muscles next to those in the hand, e.g. theheart. An application of such a radar in the automotive industry could be the detecting of a tireddriver and warning him. A radar could also detect pedestrians and their corresponding speed, whichcould assist the driver or could be implemented in autonomous vehicles. The detecting of pedestri-ans using radars is discussed in [2]. In the health care industry, a heart scan requires (time) expensivesettings. This combined with worried patients only enlarge the queues. Monitoring the heart with aminiature radar, could filter out patients from the queue. Another application is the automatic falldetection as described in [3]. A fallen elderly, who is found too late, can have incurred extra injuries,which prolongs the recovery. Gesture recognition is also useful in the smart home environment. By thedetection of several movements, the information of absence or presence of a person could be retrieved.This information would automatically control some household devices, like multimedia devices andcentral heating. The security industry is interested as well in the use of micro-Doppler radar frames.In critical infrastructures, such as airports and power plants, perimeters have to be set up. Objects ap-proaching to these perimeters have to be detected and classified. Such experiments were conducted by[4].

Several companies like Intel, Microsoft and Elliptic Labs are leading projects in the domain using cam-era or microphone as input channel. As mentioned, the use of micro-Doppler radar data is gettingmore attention. Human activities like running and walking are captured in micro-Doppler radar andclassified with support vector machines in [5] and [6], already obtaining accuracies of 90%. The GoogleSoli project is the most known project making use of the miniature radars and recognizing the handgestures with deep neural networks. This thesis will be based on their previous work [1]. They devel-oped the radar sensor, took care of the signal processing, described in [7] and established a recognitionmodel. The last step of this pipeline will be the guideline of this thesis. State-of-the-art techniques arecompared to their deep sequential neural networks. Deep sequential neural networks are a way of de-terministic modeling. This dissertation will research the performances of generative models. The mod-els can be compared as their used data is open source available in [8]. Everything is built in the Keraslibrary using the Tensorflow backend, while the Soli project relies on the PyTorch library.

Chapter 2 of this report starts with the elementary concepts in machine learning and deep learning,in particular of sequential models. Chapter 5 is an extension on Chapter 2, including the theory ofautoencoders. For a more extensive description, more information can be retrieved from Ian Goodfel-

1

low’s [9]. Most of the concepts are illustrated in a regression context to make analogies through thisreport. The first two steps of Soli’s pipeline: description of the data and signal processing are brieflydiscussed in Chapter 3. It also contains a data analysis of the available data. Combining the infor-mation of Chapter 2 and 3, Chapter 4 represents the result of the imitation of Soli’s deep sequentialmodel. Two preprocessing and two cross-validation strategies are compared herein. These results willbe the benchmarks to compare the state-of-the-art techniques described in Chapter 5. The results us-ing the state-of-the-art techniques are represented in Chapter 6. Two of the results obtained withthese models are optimized using a Bayesian optimization technique, described in Chapter 2. Chap-ter 6 ends with a comparison in performance and architecture between the optimized state-of-the-artmodels and the deep sequential models. This comparison is closed with a short conclusion in Chapter7.

2

Chapter 2

Machine Learning and NeuralNetwork Concepts

2.1 Machine Learning - Linear Regression

Machine learning is a very general term. As stated by [10], in machine learning a computer programis said to learn from experience E with respect to some task T and some performance measure P , ifits performance on T , as measured by P , improves with experience E. There are several examples oftypical machine learning problems. The standard textbook example is linear regression for predictinghouse prices. In this example T is the prediction itself. E are the prices of houses which are known to-gether with the house its specifics. P is in this case the root mean squared error (RMSE). Another ex-ample is the prediction whether a patient has cancer or not. Here T is again the prediction, E are theknown cases and P is now for example the misclassification error. Both problems are said to be super-vised. This means that the cases, appearing in E, have a label. There are also unsupervised problemsbut this is out of the scope of this section. The difference in both problems is that the first problempredicts a continuous value, while the second problem predicts a discrete value. Hence the first prob-lem is called a supervised regression and the second a supervised classification. In order to tackle thoseproblems a model has to be built. This includes the preprocessing of the data (E), choosing the ap-propriate algorithm (with the appropriate hyperparameters) to train on the data and choosing the ap-propriate loss function (P ). Training boils down to minimizing this loss function using a (variant of)gradient descent.

2.1.1 Loss Functions

There are several loss functions which can be minimized. Table 2.1 is an overview of the most commonloss functions. y represents a true value, while y represents a predicted value.

MSE is the loss function used in linear regression problems. It is a measure for the errors made be-

Name Expression

Mean squared error (MSE) 1N

N∑i=1

(yi − yi)2

Hinge lossN∑i=1

max (0, 1− yiyi)

Crossentropy −N∑i=1

yilog (yi)

Kullback-Leibler divergence −N∑i=1

yilog(yiyi

)

Table 2.1: Overview of the most common loss functions.

3

tween the real value and the predicted value.

Cross entropy, which finds its origin in information theory, is the most used loss function in classifi-cation problems. Entropy of a distribution P is defined in Equation 2.1. It represents the amount ofinformation in an event drawn from this distribution P .

H(P ) = −Ex∼P [log P (x)] = −∑

x

P (x)logP (x) (2.1)

The definition of (cross) entropy can be illustrated with the flipping of a coin. In such an experiment,there are two possible outcomes. p is the chance the outcome is head. Hence 1 − p is the chance theoutcome is tail. The expression of the entropy for such a distribution is, following Equation 2.1, equalto −(1 − p)log(1 − p) − plog(p). The entropy in function of p is plotted in Figure 2.1. The expectedamount of information is minimal for an experiment with a ’cheated’ coin when the result is consis-tently head or tail (p = 1 or p = 0). The expected amount of information is maximal when a normalcoin is used (p = 0.5).

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

p

H(p)

Figure 2.1: The entropy of flipping a coin. If the outcome is always head (e.g. a cheated coin), or in other words p isequal to 1. The distribution has the minimal entropy. The expected amount of information for throwing the coin is 0.The expected amount of information is maximal when p = 0.5

In the light of information theory, the entropy is the expected amount of bits (if log is base 2) or nats(if log is base e) needed to pass some information of an event drawn from a distribution P . On the onehand if p was 0 or 1, no information had to be passed. On the other hand if p was 0.5, the informationof the experiment could be passed by encoding the event as a 0 or 1 (= 1 bit). If the true distributionof an experiment is P , but a distribution Q is used to encode the experiments, a different amount ofbits will be used. This is quantified by the cross entropy given by Equation 2.2.

H(P,Q) = −Ex∼P [log Q(x)] (2.2)

In the context of constructing a loss function P corresponds with the true distribution from which anevent xi is drawn and hence yi = P (xi) is the corresponding exact probability. Q on the other handcorresponds with the approximated distribution. The event xi will have another probability yi = Q(xi)in this approximated distribution. The similarity between two distributions P and Q is measured bythe Kullback-Leibler (KL) divergence (DKL(P ||Q)). It can be seen as the entropy of P relative to Q.From statistical physics it is known that entropy is additive. Hence the cross entropy could also bedefined as Equation 2.3.

H(P,Q) = H(P ) +DKL(P ||Q) (2.3)

Or in other words, the KL divergence is defined as in Equation 2.4. One key property of the KL diver-gence is its non-negativity. (see Appendix B)

4

DKL(P ||Q) = −Ex∼P [log Q(x)] + Ex∼P [log P (x)]

= −Ex∼P[log

Q(x)

P (x)

]

= −∑

x

P (x)logQ(x)

P (x)

(2.4)

In a similar way as the cross entropy, the KL divergence can be translated to a useful loss function.Furthermore each summation can be replaced by integration in case of continuous variables.

2.1.2 Underfitting versus overfitting

The most important issue in training a machine learning algorithm is finding the balance betweenoverfitting and underfitting. The training data is at least split in two parts, a training part and test-ing part. In case of underfitting the model (algorithm) is too simple to describe the data. It will failon both training and test set. Hence the error on both sets will be high. Again the prediction of houseprices is the best example to illustrate this concept. For example, a model has to learn the depen-dency between the house price and the available living area (in m2). The used model can be of theform yi = θ1xi+θ0 with yi the house price of the sample at hand and xi its available living area. Plug-ging this expression in the MSE loss function leads to Equation 2.5.

L(θ0, θ1;x) =1

N

N∑

i=1

(yi − θ1xi − θ0)2

(2.5)

The derivative to θ0 and θ1 can be calculated and lead to their optimal value, using gradient descent.However the underlying truth of the dependency could be quadratic and not linear as the model a pri-ori insinuates. This is an example of underfitting. This is illustrated in Figure 2.2. Next to choosing atoo simple model, the model can also be chosen to be too complex. For example y = θ4x

4 + θ3x3 +

θ2x2 + θ1x+ θ0. In this case the error on the training set will be low. However the model is not gener-

alising enough and the error on the test set is still high. The case of overfitting is illustrated in Figure2.3.

Figure 2.2: Example of underfitting. The modelwill never be able to predict well on unseen samples

Figure 2.3: Example of overfitting. The model willnever be able to predict well on unseen samples

2.1.3 Bias versus variance

In the light of underfitting and overfitting in regression problems, two other concepts are introduced:bias and variance of a model. Assume a function f has to be approximated in the presence of additive

5

noise on the image space (see Equation 2.6). The noise has zero mean and σ2noise variance.

y(x) = f(x) + ε (2.6)

The model bias is the capability of the model to approach the ground truth. The model variance de-scribes the sensitivity of the model to variability in the training data. A model depends on the func-tion hypothesis space, training approach, approach for tuning the hyperparameters, the training dataset and the training set size. The model bias and variance address to the mean and variance across es-timators trained on all possible same size sets of training data which could be sampled from the world.A more mathematical expression is given in Equations 2.7 and 2.8. The expectation ranges over differ-ent θ due to the several different training sets. The data sample x is an unseen sample. These averagesare impossible to compute because f(x) is not exactly known.

biasθ = E[fθ(x)− f(x)

](2.7)

varθ = E[(fθ(x)− E(fθ(x)

)2]

= E[(fθ(x)

)2]−(E[fθ(x)

])2

(2.8)

To bypass this problem, the expected MSE for the unseen data sample x can be calculated. Equa-tion 2.9 shows that the expected MSE can be written in function of the model bias term, the modelvariance term and a noise term. The latter is the lower bound of the expected MSE. The calculationsmake use of the definition of the variance of a random variable. From Equation 2.6 and the distribu-tion of the noise follows that E [y(x)] = f(x) and var(y(x)) = σ2

noise. From the fact that f(x) is deter-ministic (it does not depend on model parameters), follows E [f(x)] = f(x).

E[(y(x)− fθ(x)

)2]

= E[(y(x))2 − 2fθ(x)y(x) + (fθ(x))2

]

= E[(y(x))

2]− 2E

[fθ(x)y(x)

]+ E

[(fθ(x))2

]

= var (y(x)) + (E [y(x)])2 − 2E

[fθ(x)y(x)

]+(E[fθ(x)

])2

+ varθ

= σ2noise + (E [f(x)])

2 − 2E[fθ(x)

]E [y(x)] +

(E[fθ(x)

])2

+ varθ

= σ2noise + (E [f(x)])

2 − 2E[fθ(x)

]E [f(x)] +

(E[fθ(x)

])2

+ varθ

= σ2noise +

(E[fθ(x)− f(x)

])2

+ varθ

= varθ + (biasθ)2

+ σ2noise

(2.9)

Increasing variance leads to decreasing bias and vice versa. This leads to a balance problem. The modelhas to generalize well, but also has to fit the training data well. A model suffering high variance islikely to overfit, while a model suffering high bias is likely to underfit.

2.1.4 Cross-validation

In the illustration of under- and overfitting in Section 2.1.2, the data was split in a training and test-ing part. The reason for this splitting is because the score obtained on the training set is not repre-sentative for the performance of the model. An independent set is required to validate the model. Inthe light of linear regression, Section 2.1.2 shows that the degree of the model introduces a non-explicithyperparameter. It can not directly be optimized with a gradient descent. The training set has to besplit once again for tuning this hyperparameter. It is not possible to select a few degrees, train themodel for each degree, calculate the score on the testing set and select the degree with the best test-ing score. This is again not representative for the model performance. Therefore the dataset has to besplit in three sets: training, validation and test set. The training set is used to learn the parametersθi for several degrees. The obtained models for each possible degree are then validated using the vali-dation set. This set is used to choose the optimal degree. Finally the test set is again the independenttest set to validate the model. The dataset can also be split in multiple parts. This is done in k-fold

6

cross validation. A first split is made for testing and this part will only be used for calculating the per-formance of the model. The rest will be split in k parts. The model will be k times trained on k − 1parts (green parts in Figure 2.4) and validated on the remaining part (red parts in Figure 2.4). The kscores on the validation parts are averaged and the hyperparameters corresponding with the best meanscore are used to calculate the performance of the model.

Dataset

First split

Test set

For each set hyperparameters:Split in folds

...

score 1

score 2

score 5

mean score

Select hyperparametersof best mean score

to predict

Figure 2.4: A schematic representation of a 5-fold cross-validation. The green parts represent the training parts, thered parts represent the validation part

K-fold cross validation can also be used to make k predictions on the test set. If k is too low, the train-ing set is small and will probably not be a good representation of the testing set. Hence the modelwill suffer from high bias. If k is too high, the consecutive training sets will be strongly correlated ornearly the same (the extreme case is leaving one sample out for validation). In such a case the modelwill suffer from high variance. In most problems k is chosen between 5 and 10.

2.1.5 Common regularization techniques

In general regularization is preventing the model from being too complex. This can be done in severalways. Those for neural networks will be discussed separately. This section will cover the most commonstrategies, ending with Bayesian estimation, which will be one of the key strategies in Chapter 5.

Feature selection

Feature selection is one of the most straightforward techniques. The polynomial degrees in the housepricing problem can be seen as an example of feature selection. A fourth degree model was too com-plex, but a quadratic model would suffice. This is feature selection on a polynomial expansion of onesingle feature (x1, x

21, x

31, ...). Instead of a polynomial expansion of the feature ’available living area’,

other features could be used (categorical or continuous): number of rooms, city, ... A feature selectioncan also be performed on such a set of features (x1, x2, x3, ...). Examples of feature selection algo-rithms are: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

Weight decay

In weight decay the weights of the model are forced to smaller values. This can be done by adding anextra term to the the loss function as shown in Equation 2.10. The inner product of the weight vectorw does not include the constant term!

y =

N∑

i=1

wixi + w0 + λw>w = w>x+ w0 + λw>w (2.10)

7

The model has an additional hyperparameter, λ. If λ is large, the additional term will cause the weightsto decrease. In the extreme case all weights are zero, except for the constant term. This model is toobiased. If λ = 0, the model is prone to overfitting. There are several ways for applying weight decay.The term w>w is an inner product ||w||22, also called the `2-norm. A more general definition is de-fined in Equation 2.11.

||w||p=(

N∑

i=1

|wi|p) 1p

(2.11)

There exists even more examples of norms, each satisfying some algebraic conditions, but this is out ofthe scope of this thesis.

2.1.6 Generative models and Bayesian estimation

In the beginning of this chapter, two supervised problems were mentioned, the house pricing problemand the cancer detection problem. They can both be summarized as predicting the output y, givensome input x. Introducing probability theory, this can be rephrased as modeling or maximizing p(y|x),the likelihood. These are called discriminative models. They are superior in prediction, but are notable to generate data. For discriminative models, generating data implies the knowledge of p(x). Thecounterpart of discriminative models are called generative models, modeling the joint distributionp(x, y). Such models are able to generate x, y pairs.

First it will be shown that linear regression is equivalent to maximum likelihood (ML) estimation inthe presence of normal distributed noise. Assume that the targets y follow Equation 2.12.

y = f(x;w) + ε = x>w + ε (2.12)

Here f is the function to be estimated f : X → Y with X and Y respectively the domain space andthe image space. The last term ε represents the noise, which is assumed to be normal distributed.Hence the distribution p(y|x,w, β) can be expressed as N

(y|f(x,w), β−1

). β is the inverse of the

variance and is called the precision. The training set is denoted by X = [x1,x2, ...,xn] with corre-sponding training labels y = [y1, y2, ..., yn]. The data samples are assumed to be independent. Hencethe conditional probability can be factorized as shown in Equation 2.13

p(y|X,w, β) =

N∏

i=1

p(yi|xi,w, β) =

N∏

i=1

=

√β

2πe−β

(yi−x>i w)2

2 = N (y|X>w, β−1I) (2.13)

The likelihood in Equation 2.13 has to be maximized, but the calculations simplify if the logarithm ofthe likelihood is maximized. As the log-function is strictly increasing, this does not change the maxi-mization solution.

log [p(y|X;w, β)] =

N∑

i=1

log[N(yi|f(xi;w), β−1

)]

=

N∑

i=1

1

2

[log

(β

2π

)− β(yi − f(xi;w))2

]

=

N∑

i=1

1

2

[log

(β

2π

)]− β

2

N∑

i=1

(yi − f(xi;w))2

= −β2

N∑

i=1

(yi − f(xi;w))2 + C

(2.14)

The log likelihood is simplified in Equation 2.14. Maximizing last expression is, except for some fac-tors, equal to minimizing the MSE loss function. This shows that linear regression in discriminativemodeling can be translated to maximizing the likelihood in probability theory. In addition, probabilitytheory also gives a measure for uncertainty of the estimation.

8

ML estimation started from the question: What is the probability that the given data and parametershave generated the considered output. This is optimized w.r.t. the parameters. Hence the parame-ters are assumed to have fixed values. But the question could also be changed to: What is the prob-ability that a certain model with parameters w has generated these input-output pairs? Therefore aprior p(w|α) has to be introduced. The prior is the knowledge you know (or assume to know) aboutthe parameters. For simplicity the prior is assumed to be a normal distribution N (0, α−1I). Includingthis prior leads to a generative way of modeling. With this prior, a posterior distribution can be con-structed using Bayes’ theorem. Equation 2.15 shows only the proportionality of Bayes’ theorem. Asthe denominator is only used for a normalization, it can be omitted.

p(w|y,X, β, α) ∝ p(y|w,X, β)p(w|α) (2.15)

Again the logarithm of this expression will simplify the maximization.

log [p(w|y,X, β, α)] ∝ log [p(y|w,X, β)] + log [p(w|α)] (2.16)

Inserting Equations 2.17 and 2.18 for the normal distributions in Equation 2.16 leads to Equation 2.19.

p(w|α) =( α

2π

)M+12

e−α2w>w

log [p(w|α)] = C1 −α

2w>w

(2.17)

p(y|w,X, β) =

N∏

i=1

√β

2πe−β

(yi−f(xi,w))2

2

log [p(y|w,X, β)] = C2 −β

2

N∑

i=1

(yi − f(xi,w))2

(2.18)

log [p(w|y,X, β, α)] ∝ C1 + C2 −β

2

N∑

i=1

(yi − f(xi,w))2 − α

2w>w (2.19)

Maximizing Equation 2.19 is equivalent to minimizing the MSE with an additive weight decay. In con-trast with ML method (discriminative modeling), the MAP method (generative modeling) has a abuilt-in regularization mechanism. The built-in regularization mechanism finds its origin in the priorbelief on the weights. If a strong narrowed prior is assumed on the weights, α will be high and themodel will be highly regularized and vice versa. A more general prior on the weights w is N (0,Σp).Using this prior and the expression for the likelihood in Equation 2.13, the proportionality of the pos-terior can be expressed as Equation 2.20. After some calculations, the proportionality can be changedto Equation 2.21.

p(w|X,y) ∝ exp

(−β (y −X>w)>(y −X>w)

2

)exp

(−1

2w>Σ−1

p w

)(2.20)

p(w|X,y) ∝ exp

(−1

2(w − w)>(βXX> + Σ−1

p )(w − w)

)

∝ exp

(−1

2(w − w)>A−1(w − w)

)

∼ N (βA−1Xy,A−1)

(2.21)

Equation 2.21 shows that the posterior is again a normal distribution with mean w = β(βXX> + Σ−1

p

)−1Xy =

βA−1Xy and covariance matrix A−1 where A = βXX> + Σ−1p .

9

To make a prediction on a test sample x∗ an average is calculated over all possible weights, weightedby their posterior probability. This is different from the initial discriminative approach of regressionproblems, where the weights w were fixed. Equation 2.22 shows the result of the predictive distribu-tion which can be obtained after extensive calculations. In these calculations it was assumed that thefirst distribution appearing in the integral was also a normal distribution. (see beginning of this sec-tion). The mean of the resulting normal distribution can be easily interpreted as the MAP estimate ofthe weights (see Equation 2.21) multiplied with the test sample.

p (f(x∗)|x∗,X,y) =

∫p(f(x∗)|x∗,w)p(w|X,y)dw

= N (βx>∗ A−1Xy,x>∗ A

−1x∗)(2.22)

An extension to higher dimension can easily be made by replacing X with φ(X). For example a scalarinput can be transformed to the polynomial feature space: φ(x) = (1, x, x2, ...). The predictive distri-bution of Equation 2.22 changes to the one defined in Equation 2.23

p (f(x∗)|x∗,X,y) = N (βφ(x∗)>A−1φ(X)y, φ(x∗)

>A−1φ(x∗)) (2.23)

With Φ = φ(X) and φ(x∗) = φ∗ and some algebra it is possible to proof that the terms appearingin the variance and mean of the predictive distribution are of the form: Φ>ΣpΦ, φ>∗ ΣpΦ or φ>∗ Σpφ∗.

These are generalized inner products. As the covariance matrix Σp is positive definite, Σ1/2p can be

defined. Introducing ψ(x) = Σ1/2p φ(x), all the generalized inner products can be expressed as classic

inner products: ψ(X)>ψ(X), ψ(x∗)>ψ(X) and ψ(x∗)>ψ(x∗). Algorithms which only depend on suchinner products can profit of using kernel functions: K(x,x′) = ψ(x)>ψ(x′). Instead of thinking on theappropriate feature space, all the engineering is embedded in the kernel function.

2.2 Neural Network

Previous section discussed the basics of linear estimators. They are linear as the targets scales linearlywith the weights. In other words the classification decision boundaries are linear. Some estimators canbe extended to non-linear variants. Examples are: linear regression, logistic regression and supportvector machines. By the increasing memory and calculation capacities, neural networks are more oftenused. The main difference between neural networks and the previous listed estimators is the capabil-ity of learning the relevant features. In the aforementioned estimators this was not possible and themachine learning engineer has to select the relevant features. Although a neural network seems to besuperior, there are some drawbacks. Minimizing the loss function is more complex as the parameterspace is non-convex. The gradient descent can get stuck in local minimums, which will be discussedlater. Another drawback is the tendency to overfit.

2.2.1 Basics

A neural network consists of several layers of neurons with a certain activation. The name ’neuron’ isbased on the similarity between a neural network and the operation of a human brain. Figure 2.5 illus-trates the working of a small neural network in a classification context with two layers of two neurons(= hidden unit). Normally each layer except for the output layer contains an extra hidden unit withvalue +1. This is the bias term and is similar to the θ0 in the house pricing estimation example. Theoutput y of the small neural network in Figure 2.5 can be written as function of the hidden variables

h(1)1 and h

(1)2 .

y = σ2(h(1)1 w

(2)11 + h

(1)2 w

(2)22 ) (2.24)

The hidden variables can be written as function of the input x1 and x2.

h(1)1 = σ1(x1w

(1)11 + x2w

(1)12 ) = σ1(a

(1)1 ) (2.25)

10

x1

x2

h(1)1

h(1)2

y

w(1)11

w(1)21

w(1)12

w(1)22

w(2)12

w(2)11

Figure 2.5: A schematic representation a small neural network

h(1)2 = σ1(x1w

(1)21 + x2w

(1)22 ) = σ1(a

(1)2 ) (2.26)

In these equations σ1 and σ2 are called the activation functions. They introduce the non-linearities inthe estimator. Typical activation functions are the rectified linear unit (ReLu) and the softmax func-tion. The output y is fed to a loss function and this loss function is minimized w.r.t. the parameters

w(1)11 , w

(1)12 , w

(1)21 , w

(1)22 , w

(2)11 and w

(2)12 .

2.2.2 Activation functions

Figure 2.6 shows a sigmoid activation function. The sigmoid activation function is given in Equation2.27. This activation function is used in binary classification. As shown in Equation 2.27, a model getsas input x and gives as output the value a. The activation function converts this value a to the proba-bility that the input x belongs to the first class.

a = ln

(P (C1|x)

P (C2|x)

)

= ln

(P (C1|x)

1− P (C1|x)

)

P (C1|x) =1

1 + e−a

(2.27)

−4 −2 2 4

−1

−0.5

0.5

1

x

y

Figure 2.6: The sigmoid activation function

The sigmoid activation function can be extended to a multiclass classification problem. In this case itis called the softmax activation function. Its expression is given by Equation 2.28.

P (Cj |x) =ezj

K∑k=1

ezk=

ex>wj

K∑k=1

ex>wk(2.28)

11

Figure 2.7 shows the ReLu function, its practical use will become clear explaining backpropagation.For negative values the ReLu is 0, for positive values the ReLu is linear in x. The softplus activationfunction is a smooth approximation of the ReLu activation function, which is also shown in Figure 2.7.Its expression is given by Equation 2.29. The smooth approximated function is differentiable for x ∈ R,while the ReLu function is not differentiable for x = 0, which can be disadvantageous in the backprop-agation algorithm.

−4 −2 2 4

−4

−2

2

4

x

y

Figure 2.7: The ReLu activation function (blue), with its smooth approximation, the softplus activation function (red)

σ(z) = log(1 + ez) (2.29)

2.2.3 Backpropagation

In order to minimize the loss function with a gradient descent-like algorithm, the gradient has to bedetermined. Figure 2.8, a more general, case will be used as example. The loss function will be de-noted as L(y, ytrue). The layers are indexed from 1 to L. The numbers of hidden neutrons for eachlayer are written as M (l) with l the index of the layer. A certain hidden state of a certain layer can becalculated as in Equation 2.30.

h(l)j = σ

M(l−1)∑

i=0

w(l−1)ji h

(l−1)i

= σ(a

(l)j ) (2.30)

The derivative of a hidden state h(l)j to a weight w

(l−1)ji can consequently be expressed as followed.

∂hj(l)

∂w(l−1)ji

=∂σ

∂a(l)j

∂a(l)j

∂w(l−1)ji

= σ′(a(l)j )h

(l−1)i (2.31)

Before calculating the derivatives of the loss function, an error symbol is introduced δ.

δ(l)j =

∂L∂a

(l)j

(2.32)

In other words it is the error on the input of a layer. The error on the input of the last layer is shown

in Figure 2.8 in blue. If for one layer l all the errors δ(l)i are known, then the errors on the layer l − 1

can be calculated using the chain rule as shown in Equation 2.33.

12

δ(l−1)i =

∂L∂a

(l−1)i

=

M(l)∑

j=1

∂L∂a

(l)j

∂a(l)j

∂a(l−1)i

=

M(l)∑

j=1

δ(l)j

∂a(l)j

∂a(l−1)i

=

M(l)∑

j=1

δ(l)j

∂a(l)j

∂h(l−1)i

h(l−1)i

∂a(l−1)i

=

M(l)∑

j=1

δ(l)j w

(l−1)ji σ′(a(l−1)

i )

(2.33)

Hence this is called backpropagation. It is illustrated with the red arrows in Figure 2.8. To calculatethe loss function, the input has to move from left to right through the network (forward propagation).While to calculate the errors, the calculation starts right and move to the left (backward propagation).Consecutive application of Equation 2.33 leads to multiplication of derivatives σ′. The smaller thisis, the faster the multiplication becomes zero. This problem is called ’vanishing gradients’. The ReLuactivation function is a good solution to this problem. The derivative of ReLu activation function isequal to 1 in its positive domain.

...

+1

x1

x2

xd

+1

h(1)1

h(1)2

h(1)d′

...

+1

h(2)1

h(2)2

h(2)d′′

...

δ(L)1

δ(2)1

δ(2)2

δ(2)d′′

Figure 2.8: A schematic representation a neural network. Here the bias neurons are added unlike Figure 2.5

With these ingredients it is possible to calculate the derivative of the loss function to a certain weight

w(l−1)ji .

∂L∂w

(l−1)ji

=∂L∂a

(l)j

∂a(l)j

∂w(l−1)ji

= δ(l)j h

(l−1)i (2.34)

13

2.3 Convolutional neural network (CNN) CNN

Convolutional layers use, as their name suggests, the convolutional operator. There are several defi-nitions for the convolutional operator, but only the two-dimensional definition will be discussed. Theconvolution between an image I and a two dimensional kernel K is given by Equation 2.35.

S(i, j) = (I ∗K)(i, j) =∑

m

∑

n

I(m,n)K(i−m, j − n) (2.35)

The most basic convolutional working is illustrated hereunder.a11 a12 a13

a21 a22 a23

a31 a32 a33

∗

[b11 b12

b21 b22

]=

[a11b11 + a12b12 + a21b21 + a22b22 a12b11 + a13b12 + a22b21 + a23b22

a21b11 + a22b12 + a31b21 + a32b22 a22b11 + a23b12 + a32b21 + a33b22

]

A theoretical description on the use of CNNs in image and speech recognition was first given by [11].Nowadays, CNNs have been successful implemented in object recognition on e.g. the ImageNet dataset,which consists of 15 million images, spread over 22000 classes. A first implementation involving CNNson this dataset was performed by [12] in 2012.

There exists several variants. The input can be zeropadded, so that the output has the same dimen-sions as the non-zeropadded input. The moving kernel can also skip some places, which is called strides.If the strides for a two-dimensional convolutional are for example (2,3), the kernel will jump 2 placesfurther horizontally and 3 places vertically. The most important advantage of CNNs is the concept ofparameter sharing. In the fully connected layer each weight was used only once, while this is not thecase for a convolutional layer. This is illustrated in Figures 2.9 and 2.10.

Figure 2.9: Illustration of the weights in fullyconnected layer. Every black edge representsa different weight.The filled circles in the lastlayer are the ones affected by the gray one inthe first layer.

Figure 2.10: Illustration of the parameter sharingin CNN. Each edge with the same color represents thesame weight. The filled circles in the last layer are theones affected by the gray one in the first layer.

A CNN can therefore be seen as a fully connected neural network with a strong prior on some weights.Some of the weights are namely set to zero. This prior is justified for image classification due to thespatial correlations. In Figure 2.10, one set of red, green and blue edges leaving a node is called a fil-ter. In one convolutional layer there are typical more filters learned, leading to several representationsof the input as shown in Figure 2.11.

In most cases, pooling layers are added. Pooling layers simplify the output of a layer. The principleis the same as that of a convolution but instead of calculating a convolution, it calculates averages ormaximums on the parts overlapping the input matrix.

2.4 Recurrent neural network (RNN) RNN

Next to fully connected and CNNs, RNNs are important. Where convolutional layers are useful in twodimensional data, recurrent layers are useful in data containing time series. They can be used for clas-sification of time series or forecasting. Again the concept of parameter sharing is useful. The problem

14

Figure 2.11: Example of a typical CNN (Figure from [13])

of predicting something at timestep t with a time series input x(τ) can be solved using the graph inFigure 2.12.

h(t)

x(t)

h(t−1)

x(t−1)

h(t−2)

x(t−2)

...

Figure 2.12: An unfolded graph

The output h(t) can be written as function of all the inputs so far h(t−1) = g(x(t), x(t−1), ..., x(2), x(1)).Following the graph, each hidden state takes into account the previous state or input. Hence such amodel could be used to predict the next word in a sentence. However a different model has to be builtfor each different sequence length. Parameter sharing solves this problem. Instead of applying a func-tion g, the function g can be factorized in consecutive applications of function f . The unfolded graphin Figure 2.12 wraps up to the graph in Figure 2.13.

hf

x

Figure 2.13: A folded graph. The black square denotes a delay of one time step.

Instead of finding the right parameters for each function g for each sequence length, the parameters forf have to be found once and can be reused for every sequence length.

2.5 Long-Short Term Memory unit (LSTM) LSTM

Several different constructions can be built with recurrent edges, but the most interesting and the onewhich will be used extensively in this dissertation is the LSTM, first described by [14] in 1997. RNNsare also sensitive to vanishing gradients. In fact the recursion step in a RNN can be described by amatrix multiplication with the weight matrix W . For a model without input x, or activation func-tion, the state h(t) is equal to Wh(t−1) or W th(0). The weight matrix W can be transformed to its

15

eigenvalue decomposition W = QΛQ>. The eigenvalues of W t are equal to the eigenvalues of Wraised to the power t. Eigenvalues which are smaller than 1, will vanish. On the contrary eigenvalueswhich are larger than 1, will explode. Every component of h(0) not aligned with the eigenvector spaceof eigenvalues close to 1, will be discarded. In other words, it is harder for a RNN to learn long terminteractions than short term interactions.

There are several ways to tackle those vanishing gradients. One way is to simply add edges skipping dtimesteps. The gradients for these edges vanish d times slower than the gradients which are updatedeach timestep. However the latter can still vanish or explode. Furthermore skipping d timesteps is apossible false prior made on the dataset. Instead of adding long term edges, the long term edges canalso replace the short term edges.

Another way is the usage of leakage units. This can be illustrated with the running average of values vas shown in following equation.

µ(t) = αµ(t−1) + (1− α)v(t) (2.36)

The α determines whether the average reminds previous values of v or not. If α is close to 1, the run-ning average will remember all the previous values. But once the α is set to 0, the running average willrestart. The hidden units of RNN behave similarly when using leaky gates.

Gated RNNs, of which LSTM is an example, are a generalisation of RNNs with leakage units. Theleakage units accumulate the information over previous timesteps, but at some point it has to forgetthe gathered information. An example is the prediction of a word in a sentence. The words in the sen-tence itself are probably relevant in the prediction, but the words in the seventh sentence ahead areless relevant. Instead of manually deciding to forget the currents state, a gated RNN can learn it itself.Figure 2.14 shows the structure of an LSTM cell. The gate, named input makes a prediction basedon the input vector x(t). The gate named input gate takes in addition the previous hidden state vec-tor h(t−1) into account. The outcome of this gate filters the predictions from input. The output gatechooses an output amongst all possible final predictions. But the most critical part of the block dia-gram is the forget gate. It learns which states can be forgotten. A copy from the previous hidden statevector h(t−1) was made and fed to the edge called self-loop. The output of the forget gate is a vectorcontaining numbers between 0 and 1. As long as a certain output number is not zero, the correspond-ing information in h(t−1) will not leave the loop. Hence the neural network can teach itself how long ithas to keep some information.

Figure 2.14: Block diagram of an (LSTM) cell. The circles with plus symbol represents an elementwise addition. Thecircles with a cross symbol represents an elementwise multiplication. The circles with an s-like symbol represent a neuralnetwork with a sigmoid activation function. (Figure from [9])

16

2.6 Adam (Adaptive moment) optimizer Adam

As mentioned in the beginning of this chapter, training a machine learning model or estimator is find-ing a minimum of the loss function, which depends on the estimators (hyper)parameters. Such mini-mization is typically solved by a gradient descent-like algorithm. The most common gradient descentis defined as follows. A loss function L is defined, which is the summation of the individual errors ofeach training sample. The loss function L depends on a parameter set denoted by θ and can be seenas a hypersurface in a hyperspace spanned by its parameters. From calculus it is known that the gra-dient is the vector pointing in the direction where the loss function increases the most. A minimizationstarts with a parameter initialization. Then the loss function and its gradient are calculated using allthe training samples (or a part of the training samples to speed calculations up). After the gradient isknown, the initialized parameters change, according to a fraction of the direction the negative gradi-ent points to. The fraction of changing is called the learning rate and is denoted by ε. Equation 2.37expresses everything mentioned mathematically and is called an update equation. Starting from somepoint θn in the parameter space, the learning rate combined with the gradient changes this point toθn+1, where the loss function and gradient are again calculated. This is repeated until the gradientbecomes zero and a minimum is achieved. The gradient term can be seen as a measure of change ofparameter. Note that this minimum is not necessarily the global minimum. If the learning rate is toosmall, it is possible that the solution is a local minimum. On the other hand if the learning rate is toohigh, it is possible that the algorithm will jump over the global minimum without noticing. Conse-quently the learning rate is in fact a hyperparameter to tune.

θ ← θ − ε∇θL(θ) (2.37)

The above described algorithm is called the stochastic gradient descent (SGD) algorithm. A first mod-ification is the introduction of momentum. The update equation changes from Equation 2.37 to 2.38.In physics the momentum is defined as mass times velocity. In the gradient descent algorithm, themass is considered to be unit. Hence the momentum can also be seen as a velocity. The momentumand gradient term are added as shown in the first part of Equation 2.38. The momentum accumulatesall the previous calculated gradients. The parameter α, yet another hyperparameter, indicates how im-portant the previous gradients are.

v ← αv − ε∇θL(θ)

θ ← θ + v(2.38)

Figure 2.15: The red path indicates the results of stochastic gradient descent with momentum. The black path iswithout momentum. (Figure from [9])

A second modification to SGD is keeping track of all partial derivatives separately. The learning ratefor each parameter scales with the inverse of the square root of the sum of all squared partial deriva-

17

tives. The faster a hyperparameter changes, the faster its learning rate will decrease. In the adap-tive gradient (AdaGrad) algorithm, first published by [15], the algorithm keeps track of all the partialderivatives. In another algorithm root mean square propagation (RMSProp), first published in [16],the algorithm uses an exponentially weighted average of partial derivatives. The RMSProp updateequations are listed in Equation 2.39. The weighted average of the squared gradient (r) is initializedat zero.

g ← ∇θL(θ)

r ← ρr + (1− ρ)g · g∆θ ← − ε√

r + δ· g

θ ← θ + ∆θ

(2.39)

The Adam optimizer, first introduced in by [17] in 2015, is a combination of the RMSProp optimizerwith momentum. The similarity between the Adam optimizer and RMSProp is that the Adam opti-mizer also calculates the second moment of the gradients, but unlike RMSProp, the Adam optimizeralso calculates correction factors. With a closer look to the second update equation of Equation 2.39,it is clear that if the initial guess of r is 0, the first updates for r will be biased by this initial guess.But this can be solved dividing by (1 − ρt), which is a correction factor. This can easily be checkedfor the first few steps with some standard values: rinitial = 0 and ρ = 0.98. For the first two stepsr1 will be equal to 0.02||g||22 and r2 will be equal to 0.0396||g||22. The correction factor changes this torespectively ||g||22 and ||g||22. Next to this correction factor, the Adam optimizer incorporates momen-tum through the calculation of the first moment of the gradients, for which it also calculates correctionfactors. The update equations are listed in Equation 2.40. s and s are respectively the first momentof the gradients and the corresponding correction factor. r and r are respectively te second moment ofthe gradient and the corresponding correction factor.

g ← ∇θL(θ)

t← t+ 1

s← ρ1s+ (1− ρ1)g

s← s

1 + ρt1r ← ρ2r + (1− ρ2)g · gr ← r

1 + ρt2

∆θ ← −ε s√r + δ

· g

θ ← θ + ∆θ

(2.40)

The Adam optimizer is the most robust one and will be extensively used in this thesis.

2.7 Regularization of Neural Networks

There are several ways to apply regularization to a neural network. Dropout is one of the possibilities,where some edges are removed in each training loop. Recently [18] has shown that the variational in-ference of RNN weights is identical to implementing dropout in RNNs with the same network unitsdropped at each time step. The variational inference will be discussed in Chapter 5. Another wayof applying regularization is ’early stopping’. After each training loop, the neural network calculatesthe loss on a validation set. The neural network stops training the moment the loss on the validationstarts increasing. A Bayesian way of modeling also prevents the model to overfit as discussed in Sec-tion 2.1.6.

18

2.8 Hyperparameter Optimization Technique

From previous section it should be clear that the hyperparameters of a model are important. Theycan lead to overfitting or underfitting results. The effect of hyperparameters have again to be evalu-ated on well chosen unseen data samples to obtain a generalization error. This can be done with thedescribed cross validation from previous sections. The most straightforward way to select the opti-mal parameters is looping over each possible combination and evaluate them on unseen samples. Thisis called a grid search. Achieving the optimal solution is only doable if the training and evaluationtimes are limited, which is not the case for neural networks. Instead of evaluating all possible combina-tions a random search selects randomly some combinations to evaluate. This reduces the computationtime. However, in selecting random combination it is risky to skip the ideal solutions and calculatetoo many combinations of hyperparameters in a bad part of hyperspace. In other words, the randomsearch algorithm tries to find the minimum of a certain unknown hyperparameter cost function byrandom sampling hyperparameter settings. Another approach is the application of a second machinelearning algorithm, next to the built model from which the hyperparameters have to be optimized.If a few hyperparameter settings are evaluated, a machine learning algorithm can solve a regressionproblem to find the unknown hyperparameter cost function. The minimum of this solution can eas-ily be found. With the hyperparameter settings corresponding to this minimum, a new evaluation ofthe hyperparameter cost function can be made. This evaluation can be added to the ones, which werealready known. On this augmented set of evaluations the second machine algorithm can find a newsolution for the hyperparameter cost function and its minimum. After a few iterations, the minimalsolution should converge. Now, two machine learning models are involved. On the one hand, a modelwhich solves a general classification or regression problem. On the other hand, a model which is builtto find the optimal hyperparameters of the former one. It would be inefficient if the latter model alsohas a lot of hyperparameters. Hence non-parametric models are used for hyperparameter optimiza-tion. These non-parametric models can again be either discriminative or either generative. The latterwill be referred to as Bayesian optimization (BO) techniques of which Gaussian processes and the tree-structured Parzen estimator are two examples.

2.8.1 Gaussian processes (GPs) GP

The concepts of GPs and their use in regression or classification problems are extensively described in[19], while their use in hyperparameter optimization is discussed by [20].

A GP is a generalization of a multivariate Gaussian distribution. A GP is a collection of random vari-ables, any finite number of which have a joint Gaussian distribution. In fact a GP can be seen as ainfinite dimensional normal distribution. A Gaussian distribution up of 2 dimensions can be easily rep-resented in an axial system. For higher dimensional Gaussian distributions, another representationscheme can be used, the parallel coordinates. Both schemes are shown in Figures 2.16 and 2.17 for abivariate Gaussian distribution.

−2 −1 0 1 2 3x1

−1

0

1

2

3

4

x 2

Figure 2.16: A scatter plot of samples drawn froma bivariate Gaussian distribution. The red dot repre-sents the mean.

x1 x2

−2−101234

Figure 2.17: A parallel coordinates plot of a bivari-ate Gaussian distribution. The red line represents themean.

19

A multivariate Gaussian distribution is denoted by N (µ,Σ) with µ the mean Σ the covariance ma-trix. A GP is denoted by GP(m(x), k(x,x′)) with m(x) the mean function and k(x,x′) the covariancefunction.

GPs can be applied to regression problems. A Bayesian weight-space view on the regression problemis already discussed, leading to ML and maximum a posteriori (MAP) estimation in section 1.6. Thissection ended with the introduction of kernel function which leads to GPs. With GPs a function-spaceview is approached. Instead of averaging over the distribution of weights, a function can be drawnfrom a GP.

f(x) ∼ GP(m(x), k(x,x′)) (2.41)

Comparing with the weigh-space view where f(x) = φ(x)>w with w ∼ N (0,Σp), the correspondingGP has as mean and covariance function defined in Equation 2.42.

E [f(x)] = φ(x)>E [w] = 0

E [f(x)f(x′)] = φ(x)>E[ww>]φ(x′) = φ(x)>Σpφ(x′)(2.42)

Sampling functions of this GP prior is not interesting. The knowledge of the training data has to beincorporated. Given the definition of a GP, the joint probability of the prior and the training samples(xi, fi) is also normal distributed as given in Equation 2.43. Making a prediction is nothing more thanconditioning the distribution in Equation 2.43 to f which is expressed in Equation 2.44.

[f

f∗

]∼ N

(0,

[K(X,X) K(X,X∗)

K(X∗,X) K(X∗,X∗)

])(2.43)

p(f∗|X∗,X,f) ∼ N(K(X∗,X)K(X,X)f ,K(X∗,X∗)−K(X∗,X)K(X,X)−1K(X,X∗)

)(2.44)

GPs can be used as hyperparameter optimization technique. The hyperparameter optimization boilsdown to the minimization of the generalization error. The generalization error can be seen as a func-tion depending on the several hyperparameters. The GP prior defines the distribution of functions.Next, this prior can be conditioned on evaluations of the machine learning model, leading to a poste-rior distribution. After the posterior is known, it would be useful if the algorithm could chose for itselfwhich set of hyperparameters is promising to evaluate. Therefore the proxy optimization of an acqui-sition function is introduced. An acquisition function depends on the hyperparameters of the GP, aswell as, the previous observations. The most common acquisition function is the Probability of Im-provement (PoI). The PoI is defined in Equation 2.45, where fmin represents the temporal minimum inthe algorithm.

P (y∗ < fmin) =

fmin∫

−∞

N (y|µ∗,Σ∗)dy (2.45)

BO algorithms, like the one using GP’s, differ from gradient decent-like algorithms in the sense thatgradient decent only performs local calculations. BO algorithms on the other hand construct a proba-bilistic model for the function f and then exploits the complete hyperspace.

2.8.2 Tree-structured Parzen estimator (TPE)

The GP with PoI is an algorithm making use of the best observation. As training a neural networkmodel always contain some randomness, making use of the best observation is disadvantageous. Possi-ble best observations could be missed due to the randomness. The BO using GP’s will not be used inthis thesis. Instead the tree-structured parzen-window estimator (TPE) algorithm will be used, intro-duced by [21]. It will overcome the problem of only using the best observation.

20

Assume a certain distribution for each hyperparameter. Some results are obtained for hyperparametersettings selected from these distributions. The results are split in two groups based on a performancethreshold. The first group contains typically the 10 - 25 % best results. The second group contains therest of the results. Both groups define a likelihood distribution which can be constructed with Parzen-window kernel estimators. The likelihood for the first group is denoted by l(x) and the likelihood forthe second group by g(x). The likelihood estimation differs from the GP approach where the posterioris estimated. With the two likelihood distributions a new acquisition function can be built, which isgiven by Equation 2.46.

acquisition(x) =l(x)

g(x)(2.46)

The split in two groups ensures that not only the best observation is used, but the collection of thefirst n best results.

21

Chapter 3

Data Description

The concepts introduced in previous chapter will now be used to classify hand gestures of differentpersons based on micro Doppler radar data. This chapter will describe how the Soli micro Dopplerradar data looks like and how it is obtained. Additional information can be found in [1] and [7].

3.1 Doppler effect

The Doppler effect, named after the Austrian physicist Christian Doppler, is a phenomenon where therelative motion between sender and receiver of a signal causes a shift in the perceived frequency. Themost known example is a siren passing by a pedestrian. The siren that the pedestrian hears when thevehicle approaches is different from the siren when the vehicle is leaving. This is illustrated in Fig-ure 3.1, where the mid white circle represents the moving vehicle. This example is a situation wherethe sender was moving and the receiver not. The same is valid for a reversed situation. The shift inthe perceived frequency is an measure for the velocity of an object. The relation is captured in Equa-tion 3.1. The subscripts ’r’ and ’s’ stands respectively for receiver and sender. ’c’ is the velocity of thewave.

fr = fsc+ vrc+ vs

(3.1)

3.2 Soli hardware and micro Doppler radar data

There exists several kinds of radars, but they all work by the same basic principle. A radar systemconsists of at least one antenna transmitting electromagnetic waves which reflect on subjects in frontof it. The informative reflections are captured by a at least one receiving antenna.

Figure 3.1: Illustration of Doppler effect. The white mid circle is a source moving to the left. The wavefronts to theleft are closer packed than to the right leading to a higher frequency (w.r.t. original frequency source) to the left andlower frequency to the right. (Figure from [22])

22

The Soli project uses a solid-state millimeter-wave radar. The magnitude of the wavelengths is of theorder millimeter. Classic radars prefer a high spatial resolution. In this section it will become clearthat the Soli sensor differs from classic radars, in the sense that it prioritizes a high temporal resolu-tion. To obtain a high temporal resolution it transmits modulated pulses at a high frequency (between1-10 kHz). The Soli sensor chip is a 12 × 12 mm frequency modulated continuous wave (FMCW) SiGeradar chip manufactured by Infineon using embedded Wafer Level Ball Grid Array technology. Thechip consists of two transmitting antenna elements and four receiving antenna elements. The receiv-ing elements deliver 4 channels of data. For the recognition models in the following chapters, paper [1]states that due to cross-correlations one channel suffice to perform the computations on. This radarsystem provides the micro-Doppler data.

To describe the gathered data, some radar fundamentals have to be introduced, starting with the rangeresolution. The radar range resolution is the minimum physical separation between to distinguishablepoints in the line-of-sight of the radar. An expression for the range resolution is given by Equation 3.2where c stands for the velocity of light and BW the bandwidth of the radar.

resr =c

2BW(3.2)

Common gesture detection radar systems consist of antenna arrays which are digital driven to producea scanning beam. This is called beam steering. This beam steering provides a sufficiently high angularresolution. The angular/spatial resolution at a range r is given by Equation 3.3, where b is antennabeam width, λ is the wavelength and l is the aperture size.

resa = rb =rλ

l(3.3)

Assume an antenna with a working frequency of 60 GHz. If this antenna needs to have an angular/spatialresolution of 1 cm at a range of 20 cm, an aperture size of 10 cm × 10 cm is required. Radar systemsof that size are not usable in wearable technologies. Consequently Soli radar does not use the beamsteering technique. Instead it illuminates the hand with one wide beam of 150 degrees. The lack inspatial resolution is compensated by a high temporal resolution. The transmitted signal is modulatedto a high-frequent periodic waveform which is expressed in Equation 3.4. The function u describes theenvelope of the waveform in one period. There are two distinct time scales involved. The unit of theslow time T is denoted by RRI, the radar repetition interval. RRI is the time between two starts ofconsecutive modulation periods. With this RRI, a radar repetition frequency RRF = 1

RRI can bedefined. The fast time t is the time scale during one modulation period.

str(t, T ) = u(t− T )exp(j2πfct) T = 0, RRI, 2RRI, ... (3.4)

The gesture performing hand is modeled using Nsc scattering centres. Each scattering centres is pa-rameterized by a complex reflectivity parameter ρi(T ) and its radial distance ri(T ). The superpositionof all centres is given in Equation 3.5.

y(r, T ) =

Nsc∑

i=1

ρi(T )δ(r − ri(T )) (3.5)

It is assumed that the reflectivity parameter and the radial distance from each centre do not changeduring one modulation period. Therefore Equation 3.5 does not depend on the fast time t. The rawreceived signal sraw(t, T ) consists of all reflections from each scattering centre.

sraw(t, T ) =

Nsc∑

i=1

si,raw(t, T ) (3.6)

Each individual reflected wave can be expressed as Equation 3.7

si,raw(t, T ) =ρi(T )

r4i (T )

u

(t− 2ri(T )

c

)exp

(j2πfc(t−

2ri(T )

c)

)(3.7)

23

The amplitude is proportional to r−4i (t). This path loss is idealized but the range measurement will

not be based on the amplitude. After processing the raw signals, the received signal for each scatteringcentre is transformed to Equation 3.8, where h is the point target response. This point target responsedepends on the modulation scheme, transmission parameters and preprocessing steps.

si,rec(t, T ) =ρi(t)

r4i (t)

exp

(j

4πri(T )

λ

)h

(t− 2ri(T )

c

)(3.8)

Each scattering centre reflects a point target response but each delayed in fast time, which is informa-tive for the range. In addition, the range influences the phase of the received signal. The reflectivityinfluences the amplitude. The exceptional motion sensitivity of the Soli radar is due to the capturedphase change. The phase change of a scattering centre i, moving in a time interval [T1, T2], is given byEquation 3.9.

∆φi(T1, T2) =4π

λ(ri(T2)− ri(T1)) (3.9)

Furthermore the velocity of a scattering centre is assumed to be constant over some coherent process-ing time Tcpi. Tcpi has to be larger than RRI, as the range was assumed to be constant over one mod-ulation period. The angular frequency of a wave and its phase are related through Equation 3.10.

ωi(t) =dφi(t)

dt(3.10)

Using Equations 3.9 and 3.10, the (Doppler) frequency of a reflected wave can be related to the veloc-ity of the corresponding scattering centre.

fD,i(T ) =1

2πωi(t) =

1

2π

dφi(T )

dT=

2vi(T )

λ(3.11)

The received signal srec(t, T ) can now be transformed to a more informative representation. The trans-formation at hand is a Fourier transformation in the slow time dimension. As mentioned, the slowtime scale is the relevant scale over which scattering centres properties change. Equation 3.11 doesnot depend on fast time. The transformation of the signal is expressed in Equation 3.12. This leads toa 3D representation of the data. The slow time T is still a dimension of the data because the Fouriertransformation was only performed in time intervals of length Tcpi instead of the complete domain.

S(t, f, T ) =

T+Tcpi∫

T

srec(t, T′)exp(−j2πfT ′)dT ′ (3.12)

The frequency dimension f in Equation 3.12 can be converted to a velocity dimension using Equation3.11. The fast time dimension t can be converted to a range dimension using the fast time delay t =2rc .

RD(r, v, T ) = S

(2r

c,

2v

λ, T

)(3.13)

The resulting frames are micro-Doppler radar frames of which Figure 3.2 is an example. The horizon-tal axis of a frame indicates the velocity of the detected scattering centres. The vertical axis indicatesthe range between the detected scattering centres and radar. The radar can detect small movementsw.r.t. the movement of the complete object. For example fingers moving faster w.r.t. the hand. Thecapability to detect these small movements is included in the term ’micro’.

3.3 The Google Deepsoli dataset

The Google Deepsoli dataset, available at [8], is a dataset containing 11 different hand gestures, whichare shown in Figure 3.3.

24

Figure 3.2: An example of micro-Doppler radar frame. The horizontal axis represents velocity while the vertical axisrepresents the range of the detected hand. The colorbar indicates the captured intensity.

Figure 3.3: The 11 different hand gestures with their corresponding label (Figure from [1])

25

Figure 3.4: Distribution of performers in the train-ing set

Figure 3.5: Distribution of performers in the testset

Figure 3.6: The distribution over hand gesture labels in the training set

The dataset contains 5500 samples of different sequence length. Half of the gestures are preformed byone user and this part of the set should be used to make per-user classification. This is not the pur-pose of this thesis. The other half is preformed by 10 different persons and will be used for training,validation and testing. At [8], a random train-test split of approximately 50%-50% is proposed. Thetraining set contains 1386 samples and the test set contains 1364 samples. Figures 3.4 and 3.5 give re-spectively the distribution of performers in training and test set. The labels of the performers have tobe known for a cross-validation, but this will be discussed in Chapter 4.

Figure 3.6 shows that the distribution of hand gesture labels in the training set is perfectly in balance.Each hand gesture is 126 times represented. Figure 3.2 showed an example of a micro-Doppler frame.In those frames there are only a few pixels illuminated. Figures 3.7 and 3.8 show the mean and stan-dard deviation of the pixel values over all frames. These figures show where and with which velocitythe hand gestures are mostly performed. As preprocessing is a part of the built model, the influencesof preprocessing will be shown in Chapter 4.

26

Figure 3.7: Mean over frames in the training setFigure 3.8: Standard deviation over frames in thetraining set

27

Chapter 4

Classification with deep neuralnetworks

Combining the concepts and data of respectively Chapters 2 and 3, a model will be build using Keras.The model is based on [1] and will be a benchmark for novel models. All the models are trained on aGeForce GTX 1080 GPU produced by NVIDIA. The GPUs had a capacity of 8112 MB, which enabledto run cross-validations in parallel.

4.1 Preprocessing

Paper [1] proposes to perform a logarithmic scaling of the data followed by a min-max scaling. As thedata is very sparse a logarithmic scaling of the data will be computational unstable. The addition of adummy value η = 10−2 will overcome this instability. Although [1] proposes this logarithmic scaling,the results of omitting this logarithmic scaling will also be researched. The min-max scaling is per-formed over all pixel values and not per pixel, as such scaling would destroy the spatial correlations.In a similar way a scaling per frame would destroy the temporal correlations. Scaling is necessary forthe gradient descent to work better.

4.2 Model

As the model is based on [1], no complete parameter tuning has been performed. Sometimes the cho-sen hyperparameters differ from those in [1] based on the result on the validation set. This can be seenas the random search for 1 or 2 hyperparameters. The architecture consists of several layers of convo-lutional layers followed by fully connected layers. Finally an LSTM layer and a fully connected layerwith softmax activation output are added. All these layers are extensively described in Chapter 2.The first convolutional layers learn which representation of the data are suited as input for the LSTMlayer. On its turn the LSTM layer learns the dynamic behaviour in the data. The global architectureis listed in Table 4.1. Each convolutional layer uses a kernel of size 3 × 3 with strides of 1 × 1. Theinput of convolutional layers is not zeropadded. In each layer the ReLu activation function is used ex-cept for the last one, where a softmax activation is used. The first three dropout layers have a dropoutrate of 0.25, while the last three have a dropout rate of 0.3, whereas [1] proposes rates of respectively0.4 and 0.5. Cross-validations with these rates lead to unsatisfying results on the validation sets. Theloss function to minimize is the categorical cross entropy, defined in Chapter 2. The working modeluses a batch size equal to 25 and is trained for 50 epochs with the Adam optimizer using a learningrate of 10−5.

As proposed in [1], a leave-one-person-out cross-validation is performed, leading to 10 folds (= 10-fold cross-validation strategy). However in this case, all the performers of the training set are also inthe test set. Another train-test split is researched, where the data of one person is in either the train-ing set or either test set. All the data samples of performers 2, 3, 5, 6 and 8 will be used for a cross-

28

validation in the training set, while all the samples of performers 9, 10, 11, 12 and 13 will be used fortesting. The same idea of leave-one-person-out is used, leading to 5 folds (= 5-fold cross-validationstrategy). The number of folds denote which train-test split is applied.

The results of 4 different cases will be shown. They differ in either the preprocessing or either cross-validation strategy.

Type layer Output dimensionInput layer batch × 40 × 1024

Reshape layer batch × 40 × 1 × 32 ×32Batch Normalization batch × 40 × 1 × 32 ×32


Dropout layer batch × 40 × 32 × 30 × 30Time distributed convolutional layer batch × 40 × 64 × 28 × 28

Batch Normalization layer batch × 40 × 64 × 28 × 28Dropout layer batch × 40 × 64 × 28 × 28


Dropout layer batch × 40 × 128 × 28 × 28Reshape layer batch × 40 × 86528

Fully connected layer batch × 40 × 512Batch Normalization batch × 40 × 512

Dropout layer batch × 40 × 512Fully connected layer batch × 40 × 512Batch Normalization batch × 40 × 512

Dropout layer batch × 40 × 512LSTM batch × 40 × 512

Batch Normalization batch × 40 × 512Dropout layer batch × 40 × 512

Fully connected layer = output batch × 40 × 11

Table 4.1: Summary of the architecture of the deep neural network (DNN).

29

4.3 Results

4.3.1 10-fold cross-validation with logarithmic scaling

Figures 4.1 and 4.2 show the learning histories for one of the ten different cross-validations. In thesefigures the data of person 5 is used as validation set. All other cross-validations result in similar fig-ures.

Figure 4.1: Learning history of cross entropy loss. Bluerepresents the loss on the training set containing perform-ers 2, 3, 6, 8, 9, 10, 11, 12 and 13. Orange represents theloss on the validation set containing performer 5.

Figure 4.2: Learning history of accuracy. Blue repre-sents the accuracy on the training set containing perform-ers 2, 3, 6, 8, 9, 10, 11, 12 and 13. Orange represents theaccuracy on the validation set containing performer 5.

With the 10 cross-validations, following results on the test set are obtained. Figure 4.3 shows the con-fusion matrix over all frames. The average accuracy is 64.8%. It is trivial that model cannot pre-dict on the first frame as well as on the last frame. At the first frame the gesture is barely executed.Therefore Figure 4.4 shows how the accuracy per gesture changes over the sequence length. As men-tioned, the accuracy starts in general low and increases. The confusion matrix on the last frame isshown in Figure 4.5. The average accuracy at the last frame is 73.6%. The frequency at which theframes are obtained is 40 Hz. Consequently, on the last frame, the gesture has been performed for 1second.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.49 0.00 0.05 0.27 0.01 0.00 0.00 0.02 0.00 0.00 0.15

0.03 0.75 0.09 0.02 0.00 0.00 0.00 0.05 0.02 0.00 0.03

0.19 0.00 0.30 0.26 0.01 0.00 0.00 0.02 0.00 0.00 0.21

0.09 0.00 0.05 0.68 0.00 0.00 0.00 0.04 0.00 0.00 0.13

0.14 0.00 0.09 0.18 0.49 0.02 0.00 0.01 0.00 0.00 0.07

0.01 0.00 0.00 0.05 0.01 0.88 0.01 0.00 0.00 0.02 0.02

0.01 0.00 0.00 0.03 0.00 0.02 0.93 0.00 0.00 0.01 0.00

0.02 0.00 0.02 0.03 0.00 0.00 0.00 0.87 0.00 0.00 0.05

0.02 0.00 0.05 0.43 0.00 0.00 0.00 0.06 0.40 0.00 0.03

0.03 0.00 0.04 0.15 0.01 0.02 0.00 0.06 0.00 0.63 0.05

0.08 0.00 0.06 0.06 0.01 0.00 0.00 0.03 0.00 0.00 0.77


0.0

0.2

0.4

0.6

0.8

Figure 4.3: Confusion matrix over all frames. Preprocessingconsisted of a logarithmic and min-max scaling. The valida-tion strategy was a 10-fold cross-validation.

Figure 4.4: Accuracy per frame per gesture. Preprocessingconsisted of a logarithmic and min-max scaling. The valida-tion strategy was a 10-fold cross-validation.

30

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.56 0.00 0.01 0.23 0.00 0.00 0.00 0.01 0.00 0.00 0.19

0.03 0.88 0.06 0.00 0.01 0.00 0.00 0.00 0.02 0.00 0.00

0.15 0.00 0.21 0.48 0.00 0.00 0.00 0.00 0.00 0.00 0.16

0.02 0.00 0.01 0.85 0.00 0.00 0.00 0.00 0.00 0.00 0.12

0.09 0.00 0.07 0.14 0.67 0.00 0.00 0.01 0.00 0.00 0.02

0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.00 0.00 0.02

0.02 0.00 0.01 0.01 0.00 0.00 0.97 0.00 0.00 0.00 0.00

0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.02

0.01 0.00 0.02 0.52 0.00 0.00 0.00 0.01 0.43 0.00 0.02

0.00 0.00 0.04 0.15 0.02 0.00 0.00 0.07 0.00 0.68 0.04

0.08 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.90


0.0

0.2

0.4

0.6

0.8

Figure 4.5: Confusion matrix on the last frame. Preprocessing consisted of a logarithmic and min-max scaling. Thevalidation strategy was a 10-fold cross-validation.

4.3.2 10-fold cross-validation without logarithmic scaling

Figures 4.6 and 4.7 show the learning histories for one of the ten different cross-validations. In thesefigures the data of person 5 is used as validation set. All other cross-validations result in similar fig-ures.

Figure 4.6: Learning history of cross entropy loss. Bluerepresents the loss on the training set containing perform-ers 2, 3, 6, 8, 9, 10, 11, 12 and 13. Orange represents theloss on the validation set containing performer 5.

Figure 4.7: Learning history of accuracy. Blue repre-sents the accuracy on the training set containing perform-ers 2, 3, 6, 8, 9, 10, 11, 12 and 13. Orange represents theaccuracy on the validation set containing performer 5.

With the 10 cross-validations, following results on the test set are obtained. Figure 4.8 shows the con-fusion matrix over all frames. The average accuracy is 81.8%. It is trivial that model cannot pre-dict on the first frame as well as on the last frame. At the first frame the gesture is barely executed.Therefore Figure 4.9 shows how the accuracy per gesture changes over the sequence length. As men-tioned, the accuracy starts in general low and increases. The confusion matrix on the last frame isshown in Figure 4.10. The average accuracy at the last frame is 91%. The frequency at which theframes are obtained is 40 Hz. Consequently, on the last frame, the gesture has been performed for 1second.

31

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.53 0.00 0.04 0.04 0.06 0.02 0.00 0.01 0.00 0.01 0.29

0.01 0.88 0.01 0.01 0.01 0.00 0.00 0.02 0.01 0.01 0.05

0.10 0.00 0.55 0.05 0.03 0.00 0.00 0.01 0.00 0.03 0.22

0.07 0.00 0.04 0.56 0.01 0.00 0.01 0.01 0.03 0.03 0.24

0.01 0.00 0.02 0.00 0.87 0.03 0.01 0.00 0.00 0.03 0.01

0.00 0.00 0.00 0.00 0.00 0.93 0.01 0.00 0.00 0.05 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00 0.01 0.00

0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.88 0.01 0.01 0.07

0.00 0.00 0.01 0.05 0.00 0.00 0.00 0.01 0.86 0.02 0.04

0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.00

0.00 0.00 0.01 0.00 0.03 0.00 0.00 0.02 0.00 0.01 0.94


0.0

0.2

0.4

0.6

0.8

Figure 4.8: Confusion matrix over all frames. Preprocessingconsisted only of a min-max scaling. The validation strategywas a 10-fold cross-validation.

Figure 4.9: Accuracy per frame per gesture. Preprocessingconsisted only of a min-max scaling. The validation strategywas a 10-fold cross-validation.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.66 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.32

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.10 0.00 0.80 0.03 0.01 0.00 0.00 0.00 0.00 0.00 0.06

0.05 0.00 0.02 0.62 0.00 0.00 0.00 0.01 0.00 0.00 0.30

0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.01

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.03

0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.01

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00

0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.99


0.0

0.2

0.4

0.6

0.8

1.0

Figure 4.10: Confusion matrix on the last frame. Preprocessing consisted only of a min-max scaling. The validationstrategy was a 10-fold cross-validation.

32

4.3.3 5-fold cross-validation with logarithmic scaling

Figures 4.11 and 4.12 show the learning histories for one of the five different cross-validations. In thesefigures the data of person 5 is used as validation set. All other cross-validations result in similar fig-ures.

Figure 4.11: Learning history of cross entropy loss. Bluerepresents the loss on the training set containing perform-ers 2, 3, 6 and 8. Orange represents the loss on the valida-tion set containing performer 5.

Figure 4.12: Learning history of accuracy. Blue repre-sents the accuracy on the training set containing perform-ers 2, 3, 6 and 8. Orange represents the accuracy on thevalidation set containing performer 5.


0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.08 0.00 0.01 0.67 0.03 0.00 0.01 0.16 0.01 0.00 0.03

0.02 0.65 0.05 0.02 0.00 0.00 0.00 0.17 0.08 0.00 0.01

0.07 0.00 0.11 0.56 0.02 0.00 0.00 0.09 0.00 0.00 0.14

0.02 0.00 0.02 0.76 0.01 0.00 0.00 0.14 0.01 0.00 0.04

0.08 0.00 0.21 0.46 0.03 0.01 0.02 0.06 0.01 0.01 0.12

0.05 0.00 0.00 0.39 0.00 0.30 0.03 0.01 0.00 0.16 0.07

0.00 0.00 0.00 0.11 0.01 0.03 0.80 0.01 0.00 0.03 0.00

0.00 0.00 0.00 0.04 0.00 0.00 0.00 0.95 0.00 0.00 0.00

0.00 0.00 0.06 0.23 0.02 0.00 0.00 0.21 0.47 0.00 0.01

0.01 0.00 0.00 0.36 0.00 0.07 0.02 0.07 0.01 0.45 0.00

0.01 0.00 0.09 0.20 0.00 0.00 0.00 0.51 0.00 0.00 0.19


0.0

0.2

0.4

0.6

0.8

Figure 4.13: Confusion matrix over all frames. Preprocess-ing consisted of a logarithmic and min-max scaling. The vali-dation strategy was a 5-fold cross-validation.

Figure 4.14: Accuracy per frame per gesture. Preprocessingconsisted of a logarithmic and min-max scaling. The valida-tion strategy was a 5-fold cross-validation.

33

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.00 0.00 0.01 0.67 0.00 0.00 0.00 0.24 0.00 0.00 0.08

0.01 0.79 0.06 0.00 0.00 0.00 0.00 0.06 0.07 0.00 0.01

0.02 0.00 0.10 0.61 0.01 0.00 0.00 0.13 0.00 0.00 0.14

0.01 0.00 0.01 0.75 0.00 0.00 0.00 0.22 0.00 0.00 0.02

0.06 0.00 0.22 0.43 0.02 0.00 0.01 0.08 0.00 0.02 0.16

0.02 0.00 0.00 0.38 0.00 0.30 0.01 0.07 0.00 0.03 0.18

0.01 0.00 0.01 0.14 0.00 0.00 0.82 0.00 0.00 0.02 0.00

0.00 0.00 0.00 0.06 0.00 0.00 0.00 0.94 0.00 0.00 0.00

0.00 0.00 0.05 0.23 0.02 0.00 0.00 0.18 0.53 0.00 0.00

0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.10 0.00 0.46 0.00

0.00 0.00 0.02 0.17 0.00 0.00 0.00 0.55 0.00 0.00 0.26


0.0

0.2

0.4

0.6

0.8

Figure 4.15: Confusion matrix on the last frame Preprocessing consisted of a logarithmic and min-max scaling. Thevalidation strategy was a 5-fold cross-validation.

4.3.4 5-fold cross-validation without logarithmic scaling

Figures 4.16 and 4.17 show the learning histories for one of the five different cross-validations. In thesefigures the data of person 5 is used as validation set. All other cross-validations result in similar fig-ures.

Figure 4.16: Learning history of cross entropy loss. Bluerepresents the loss on the training set containing perform-ers 2, 3, 6 and 8. Orange represents the loss on the valida-tion set containing performer 5.

Figure 4.17: Learning history of accuracy. Blue repre-sents the accuracy on the training set containing perform-ers 2, 3, 6 and 8. Orange represents the accuracy on thevalidation set containing performer 5.


34

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.52 0.00 0.08 0.07 0.16 0.01 0.01 0.02 0.00 0.02 0.12

0.01 0.81 0.03 0.00 0.01 0.00 0.00 0.04 0.07 0.01 0.01

0.24 0.00 0.38 0.04 0.06 0.00 0.00 0.00 0.00 0.06 0.22

0.10 0.00 0.14 0.41 0.04 0.01 0.01 0.04 0.04 0.05 0.15

0.08 0.00 0.15 0.06 0.34 0.14 0.04 0.00 0.00 0.14 0.05

0.00 0.00 0.00 0.00 0.00 0.80 0.07 0.00 0.00 0.12 0.00

0.00 0.00 0.01 0.01 0.00 0.02 0.89 0.04 0.01 0.02 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.11 0.84 0.01 0.02 0.00

0.01 0.00 0.11 0.02 0.01 0.00 0.00 0.04 0.79 0.02 0.01

0.00 0.00 0.00 0.00 0.00 0.07 0.06 0.00 0.00 0.87 0.00

0.10 0.00 0.18 0.01 0.00 0.01 0.00 0.18 0.00 0.02 0.48


0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 4.18: Confusion matrix over all frames. Preprocess-ing consisted only of a min-max scaling. The validation strat-egy was a 5-fold cross-validation.

Figure 4.19: Accuracy per frame per gesture.Preprocessingconsisted only of a min-max scaling. The validation strategywas a 5-fold cross-validation.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.48 0.00 0.06 0.10 0.13 0.02 0.00 0.03 0.00 0.01 0.18

0.00 0.93 0.04 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00

0.30 0.00 0.53 0.01 0.02 0.00 0.00 0.00 0.00 0.05 0.10

0.12 0.00 0.10 0.46 0.04 0.00 0.00 0.05 0.00 0.02 0.21

0.00 0.00 0.12 0.02 0.59 0.03 0.00 0.00 0.00 0.15 0.08

0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.00 0.02 0.00

0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.00 0.00 0.02 0.00

0.00 0.00 0.00 0.00 0.00 0.04 0.30 0.64 0.00 0.02 0.00

0.00 0.00 0.18 0.00 0.01 0.00 0.00 0.00 0.82 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.99 0.00

0.05 0.00 0.00 0.02 0.00 0.03 0.00 0.27 0.00 0.02 0.62


0.0

0.2

0.4

0.6

0.8

Figure 4.20: Confusion matrix on the last frame. Preprocessing consisted only of a min-max scaling. The validationstrategy was a 5-fold cross-validation.

4.3.5 Summary

The results are summarized in Table 4.2. It is clear that not using a logarithmic scaling is superior tousing it. Furthermore. The model using a 10-fold cross-validation performs better than the one usinga 5-fold cross-validation. It seems that the model using 10-fold cross-validation suffers from some dataleakage. Based on the learning curves, the 5-fold cross-validation with logarithmic scaling leads to abiased model as the validation loss does not decrease in Figure 4.11.

preprocess used folds overall last frame best worstlog + min-max 10 64.8% 73.6% 5, 6, 7 0, 2, 4, 8, 9log + min-max 5 43.5% 45.2% 8 0, 4

min-max 10 81.8% 91% 1, 4, 5, 6, 7, 8, 9, 10 0, 3min-max 5 64.8% 72.7% 6, 9 3, 4

Table 4.2: Overview of the results with direct classification.

The best results are obtained with the 10-fold cross-validation without logarithmic scaling. The resultsin [1] are also obtained using a 10-fold cross-validation but with a logarithmic scaling. The accura-

35

cies over all frames are compared in Table 4.3. The model in [1] is in general better than the own builtmodel.

gesture accuracy own model (10-fold + only min-max) accuracy paper model [1]Pinch Index 53% 67.72%Pinch Pinky 88% 71.09%Finger Slide 55% 77.75%Finger Rub 56% 94.48%Slow Swipe 87% 84.84%Fast Swipe 93% 98.45%

Push 98% 98.63%Pull 88% 88.89%

Palm Tilt 86% 94.85%Circle 98% 89.56%

Palm Hold 94% 92.63%Overall 81.8% 87.17%

Table 4.3: Overview of the results with direct classification compared to the classifier in [1].

Figures 4.21 and 4.22 compare the accuracies over sequence progress (The colors do not match!). Bothfigures share some similarities. Simple gestures like palm hold and push are easily recognized. They areaccurately classified on the first frame and do not really need the LSTM layer. Other gestures benefitfrom modeling the temporal effects with an LSTM. In general these gestures include the ones wherethe fingers are the most informative, like pinch index and finger slider. The accuracies on the lastframe are for all gestures similar, except for pinch pinky which is classified worse with the self-builtmodel than in [1].

Figure 4.21: Accuracy during sequenceprogress of the classifier in [1]. The labels areconverted to lowercase letters.

Figure 4.22: Accuracy during sequence progressof own built classifier. Preprocessing consisted onlyof a min-max scaling. The validation strategy was a10-fold cross-validation.

Figure 4.23: This is the same figure as Figure 3.3 to easily match the class labels in Figures 4.21 and 4.22 with thecorresponding gestures.

36

Chapter 5

Autoencoders (AEs) AE

This chapter introduces an alternative to make a classification. It is based on a VRNN of which twotypes will be described. Before moving on to the variational autoencoder (VAE), the principles of abasic AE in image/video reconstruction are introduced.

5.1 Standard autoencoder (AE)

In previous sections neural networks were used in supervised problems. Standard AEs can be usedfor unsupervised problems. An AE tries to map the input samples x on themselves. In other wordsit tries to learn a unit function. The error between an image input sample x(i) ∈ Rd and its recon-struction output x(i) ∈ Rd is given by Equation 5.1.

L(x(i), x(i)) =1

d

d∑

k=1

(x

(i)k − x

(i)k

)2

(5.1)

It seems trivial to learn an unit function, but making constraints on the neural network complicatesthe learning process. However, these constraints make it possible to discover interesting structures inthe data. The most important constraint is the number of hidden units. This is illustrated in Figure5.1. In Figure 5.1 a simple AE based on fully connected layers is shown. The left part of the AE iscalled, the encoder. It encodes the d-dimensional data space in a 2-dimensional latent space. The rightpart is called the decoder. It decodes the 2-dimensional latent space to the original d-dimensional dataspace. The layers of an AE are not limited to fully connected layers. Every possible architecture maybe used as encoder or decoder. The most important use of a standard AE is dimensionality reduction.AEs with convolutional layers have recently been implemented for micro-Doppler data by [23].

5.2 Variational autoencoder (VAE)

VAEs are the Bayesian extension to standard AEs. An excellent introduction on VAEs is provided by[24]. The most important advantage of a VAE is the ability to generate new data. In order to get tothis ability, some concepts are introduced starting from probability theory. The VAE is an example ofthe Bayesian way of modeling as described in Chapter 2. The model starts from an assumption thatthe real data x is drawn from a underlying unknown process involving variables z, called the latentvariables. In graphical modeling this is illustrated by Figure 5.2,in which x represents samples in thedata space and z samples in the latent space.

With these two spaces, several distributions can be constructed: pθ(x) is called the evidence (p is thename for the distribution function and θ represents its corresponding parameters), pθ(z) is called theprior, pθ(x|z) the likelihood and pθ(z|x) the posterior. In a process described by Figure 5.2, first arandom latent value is drawn from the prior, followed by the perceived data drawn from the likelihood.The likelihood describes the probability that the data sample is perceived given a latent variable. The

37

...

x1

x2

xd

...

x1

x2

xd

z1

z2

Figure 5.1: A schematic representation of a simple standard AE.

z

x

N

Figure 5.2: Graphical model for a VAE. A filled circle means the information is known. Arrows indicate the informa-tion flow. The N within the rounded rectangle means this process is repeated each time.

prior pθ(z) is assumed to be the unit normal distribution. This is not necessary but it will simplifythe mathematics. The likelihood pθ(x|z) is also assumed to be a normal distribution. With these twoprobabilities, the joint probability can be constructed as in Equation 5.2.

p(x, z) = p(x|z)p(z) (5.2)

Once this joint probability is known, the posterior pθ(z|x) and evidence p(x) can be inferred. Oftenp(x) has to be maximized. However, to do so, the integral in Equation 5.3 has to be calculated. Thisintegral is intractable as there has to be integrated over each possible latent state.

pθ(z|x) =pθ(x|z)pθ(z)

p(x)=

pθ(x|z)pθ(z)∫pθ(x|z)pθ(z)dz

(5.3)

To make the likelihood integral more tractable, it is replaced by a summation

1N

N∑i=1

pθ(x|z(i))pθ(z(i)). Sampling this summation at random will still take too much time. Random

samples in the latent space will lead most of the time to irrelevant samples in the data space or pθ(x|z(i))will be too low. The NMIST dataset is a clear example in which a random sampling in latent spacewill most of the time lead to random colored pixels in the data space. The posterior pθ(z|x) can beused to feed the summation with relevant samples from the latent space. However the problem is loopedback to Equation 5.3. To break the loop, the posterior is approximated by qφ(z|x). The similarity be-tween the true posterior and the approximation can be calculated with the KL divergence. On the onehand, this divergence has to be minimized. On the other hand log p(x) has to be maximized. Bothexpressions are intractable and hence gradient descent cannot be applied. However with some calcu-lus, we can combine both expressions and rewrite them to an useful loss function. Starting from p(x),

38

sampling latent samples z leads to Equation 5.4

log pθ(x) = Eqφ(z|x)

[pθ(x, z)

pθ(z|x)

]= Eqφ(z|x) [log pθ(x, z)]− Eqφ(z|x) [log pθ(z|x)] (5.4)

With the addition and subtraction of Eqφ(z|x) [qφ(z|x)] in the right hand side of Equation 5.4, Equa-tion 5.4 can be rewritten to Equation 5.5.

log pθ(x) = Eqφ(z|x) [log pθ(x, z)]− Eqφ(z|x) [log qφ(z|x)] +DKL [qφ(z|x)||pθ(z|x)]

= L(φ,θ;x) +DKL [qφ(z|x)||pθ(z|x)](5.5)

As the KL divergence is non-negative (see Appendix B), the L function appearing in 5.5 is the lowerbound on log pθ(x).

log pθ(x) ≥ L(φ,θ;x) (5.6)

In other words, if the left hand side has to be maximized, but the KL divergence minimized, L(φ,θ;x)has to be maximized as well. Applying Bayes’ theorem at the lower bound expression in Equation 5.5leads to Equation 5.7.

The loss function of the model is the negative of this expression as the standard build-in gradient de-scent algorithms search for a minimum instead of a maximum.

L(φ,θ;x) = Eqφ(z|x) [log pθ(x|z)] + Eqφ(z|x) [log pθ(z)]− Eqφ(z|x) [log qφ(z|x)]

= Eqφ(z|x) [log pθ(x|z)]−DKL [qφ(z|x)| |pθ(z)](5.7)

Appendix A shows how the loss function looks like when the distributions are normal. Furthermorethis loss function has the same property as the one appearing in Chapter 2 discussing generative mod-els. The loss function consists of a reconstruction term and built-in regularizing term (KL divergence).

Now neural networks will be incorporated. The prior pθ(z) was assumed to be the unit normal distri-bution ∝ N (0, I). The posterior pθ(z|x) was approximated by qφ(z|x), which is also assumed to bea normal distribution. But now, neural networks learn the corresponding parameters φ. This corre-sponds with the encoding part of the AE and is called, the ’inference network’. These networks learnthe mean and variance of the posterior or also called µenc and σ2

enc. With these parameters, latentsamples z have to be drawn from the distribution qφ(z|x). Instead of approaching z as random la-tent variable, it can be approached by a deterministic variable but depending on an auxiliary variableε with an independent marginal distribution p(ε). So instead of drawing z of the normal distributionqφ(z|x), the auxiliary variable ε is drawn from N (0, I) and z is calculated as µenc + ε · σenc. Thisis called the reparameterization trick, which was first introduced by [25]. Once the latent samples areknown, they are used in the calculations of the likelihood pθ(x|z), which was also assumed to be a nor-mal distribution. Just as the approximation of the posterior, it learn its parameters θ by neural net-works. This is called the ’generative network’ and corresponds with the decoding part of an AE. Theparameters are the mean and variance of the likelihood and are denoted by µdec and σ2

dec. In fact,µdec is the reconstruction of the input x and σdec is a measure for uncertainty of the reconstruction.

5.3 Variational recurrent neural network (VRNN): type I

The VRNN is an example on the VAE using RNNs. There are two types from which the first is dis-cussed in this section. The first type is straight forward and is very similar to the previous section. AnLSTM transforms the input data (with dimensions batch size × sequence length × input features) toan intermediate hidden state h from which µenc and log σ2

enc are learned using fully connected layers(no activation function). The encoding parameters are used for sampling the latent states z. The la-tent samples are transformed to a second intermediate hidden state hz with a fully connected layer(activation is ReLu). This hidden state is used to learn the µdec. The decoding variance is fixed is at10−6 in order to make the reconstruction more effective. The prior is assumed/fixed to be a normaldistribution N (0, I). The loss function in Equation 5.7 reduces to Equation 5.8, using Equation A.5

39

for the KL divergence. Each latent space dimension is univariate normal distributed. The loss functionper frame is the summation of each contribution of a single latent space dimension, as expressed inEquation A.5. Finally the total loss function is the average over frames. A scheme of the architectureis given in Figure 5.3.

x

h

LSTM

fully connectedfully connected

µenc logσ2enc

z

fully connected

hz

sampling

fully connected


Figure 5.3: A schematic representation of the first type of VRNN. Blue represents input, and orange represents infor-mation needed for calculating the loss function.

DKL [qφ(z|x)| |pθ(z)] = −1

2

⟨zdim∑

i=1

(1 + log(σ2

enc) + µ2enc − σ2

enc

)bti

⟩

Eqφ(z|x) [log pθ(x|z)] = −1

2

⟨xdim∑

j=1

(log (2πσ2

dec) +(x− µdec)2

σ2dec

)

btj

⟩

L(µenc,µdec,σenc,σdec;x) = DKL [qφ(z|x)| |pθ(z)]− Eqφ(z|x) [log pθ(x|z)]

(5.8)

5.4 Variational recurrent autoencoder: type II

The second type is more sophisticated and is introduced by [26]. The prior is still assumed to be anormal distribution but its parameters are learned and depends on the state of a RNN ht−1. The in-dex t − 1 denotes that the prior depends on the previous timestep. The same is true for the approx-imated posterior and the likelihood. The posterior depends on the data xt or a transformed versionbut also on the hidden state ht−1. The likelihood depends on the latent space samples zt or a trans-formed version but also on the hidden state ht−1. For its practical implementation, the model takestwo inputs. The input xt which has as dimension: batch size × sequence length × input dimension.The other input, xt−1, is the shifted version of xt. They have the same dimensions. However the timeindices of x reach from 1 to 40, while those of xt−1 reach from 0 to 39. Figure 5.4 shows a schematicrepresentation. The loss function of the VRNN type II differs from the VRNN type I by the parame-ters of the prior distribution. For type II, these are µprior and σ2

prior, where for type I, these were re-spectively a zero matrix and unit matrix. Hence the loss function for type II is expressed by Equation5.9.

40

DKL [qφ(z|x)| |pθ(z)] = −1

2

⟨zdim∑

i=1

(1 + log(σ2

enc)− log(σ2prior) +

(µenc − µprior)2

σ2prior

− σ2enc

)

bti

⟩

Eqφ(z|x) [log pθ(x|z)] = −1

2

⟨xdim∑

j=1

(log (2πσ2

dec) +(x− µdec)2

σ2dec

)

btj

⟩

L(µenc,µdec,σenc,σdec;x) = DKL [qφ(z|x)| |pθ(z)]− Eqφ(z|x) [log pθ(x|z)]

(5.9)

xt xt−1

ht−1φx(xt)

LSTMfully connected

µenc logσ2enc

fully connected fully connected

zt

φz(zt)

sampling


fully connected

fully connected

µprior

fully connected

fully connected

σ2prior

Figure 5.4: A schematic representation of the second type of VRNN. Blue represents input, and orange representsinformation needed for calculating the loss function.

41

Chapter 6

Classification using VRNN

In this chapter each model consists of two parts: the VRNN and a classifier with the latent state sam-ples as input. The latter will be referred to as ’post-VRNN classifier’. There is no logarithmic scalingperformed on the data fed to the VRNN as the results in Chapter 4 were too unsatisfactory. The la-tent space dimension is set to 20 and the specifics of the post-VRNN classifier, used in the beginningof this chapter, are given in Table 6.1. These parameters will be optimized at the end of this chapter.The calculations were accelerated by snipping the frames. From Figures 3.7 and 3.8 it is clear no use-ful data is present in the bottom half of the frames.

Type layer Output dimensionInput layer batch × 40 ×20

Time distributed fully connected batch × 40 ×80LSTM batch × 40 ×80

Time distributed fully connected = output batch × 40 ×11

Table 6.1: Summary of the architecture of the post-VRNN classifier.

All the layers have a ReLu activation function, except for the last one, which has a softmax activationfunction. The post-VRNN classifier is trained with the Adam optimizer with a learning rate of 10−4

for 50 epochs. The model uses a batch size equal to 25.

6.1 Results VRNN type I

Table 6.2 contains the dimensional specifics and the used activations of the VRNN type I, which willbe used in the beginning of this chapter. For the names and architecture, is referred to Figure 5.3.The model is trained with an Adam optimizer with learning rate 10−4 for 1000 epochs and uses abatch size equal to 25.

Name Output dimension Used activationx batch × 40 ×1024 Noneh batch × 40 ×20 ReLu

µenc, log σ2enc & z batch × 40 ×20 None

hz batch × 40 ×20 ReLuµdec & σdec batch × 40 ×1024 sigmoid & softplus

Table 6.2: Summary of the dimensions and activation functions of VRNN type I.

42

6.1.1 10-fold cross-validation

Figure 6.1 shows the VRNN loss history of one of the ten possibilities. Figures 6.2 and 6.3 show re-spectively the cross entropy loss and accuracy history of the corresponding post-VRNN classifier. Inthese figures the data of person 5 is used as validation set. All other cross-validations result in similarfigures.

Figure 6.1: VRNN loss using type I. Bluecurve represents history on the trainingdataset containing person labels: 2, 3, 6, 8,9, 10, 11, 12 and 13. Orange curve representshistory on the validation dataset containingperson label 5.

Figure 6.2: Learning history of cross entropyloss. Blue represents the loss on the trainingset containing performers 2, 3, 6, 8, 9, 10, 11,12 and 13. Orange represents the loss on thevalidation set containing the data of performer5.

Figure 6.3: Learning history of accuracy. Blue represents the accuracy on the training set containing performers 2, 3,6, 8, 9, 10, 11, 12 and 13. Orange represents the accuracy on the validation set containing the data of performer 5.

With the 10 cross validations, following results on the test set are obtained. Figure 6.4 shows the con-fusion matrix over all frames. The average accuracy is 85.2%. It is trivial that model cannot pre-dict on the first frame as well as on the last frame. At the first frame the gesture is barely executed.Therefore Figure 6.5 shows how the accuracy per gesture changes over the sequence length. As men-tioned, the accuracy starts in general low and increases. The confusion matrix on the last frame isshown in Figure 6.6. The average accuracy at the last frame is 96.4%. The frequency at which theframes are obtained is 40 Hz. Consequently, on the last frame, the gesture has been performed for 1second.

43

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.72 0.00 0.05 0.07 0.06 0.01 0.00 0.01 0.00 0.00 0.08

0.01 0.91 0.02 0.01 0.02 0.00 0.00 0.01 0.01 0.00 0.02

0.14 0.00 0.63 0.07 0.03 0.00 0.00 0.01 0.00 0.02 0.10

0.08 0.00 0.06 0.74 0.01 0.00 0.00 0.01 0.02 0.02 0.06

0.02 0.00 0.01 0.00 0.90 0.02 0.01 0.00 0.00 0.02 0.01

0.00 0.00 0.00 0.00 0.01 0.93 0.02 0.00 0.00 0.04 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.98 0.00 0.00 0.01 0.00

0.01 0.01 0.01 0.01 0.00 0.00 0.00 0.91 0.02 0.01 0.02

0.01 0.00 0.01 0.04 0.00 0.00 0.00 0.01 0.90 0.02 0.02

0.00 0.01 0.00 0.00 0.02 0.03 0.01 0.00 0.00 0.93 0.00

0.07 0.00 0.04 0.01 0.03 0.00 0.00 0.03 0.00 0.00 0.82


0.0

0.2

0.4

0.6

0.8

Figure 6.4: Confusion matrix over all frames. Theinput data was min-max scaled and the 10-fold cross-validation strategy was applied. A VRNN type I wasused to create the latent space.

Figure 6.5: Accuracy per frame per gesture. Theinput data was min-max scaled and the 10-fold cross-validation strategy was applied. A VRNN type I wasused to create the latent space.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.90 0.00 0.02 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.05

0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.06 0.00 0.91 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

0.05 0.00 0.07 0.87 0.00 0.00 0.00 0.00 0.00 0.00 0.01

0.01 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.99 0.01 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.98 0.00

0.02 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.96


0.0

0.2

0.4

0.6

0.8

1.0

Figure 6.6: Confusion matrix on the last frame. The input data was min-max scaled and the 10-fold cross-validationstrategy was applied. A VRNN type I was used to create the latent space.

44


Figure 6.7 shows the VRNN loss history of one of the five possibilities. Figures 6.8 and 6.9 show re-spectively the cross entropy loss and accuracy history of the corresponding post-VRNN-classifier. Inthese figures the data of person 5 is used as validation set. All other cross-validations result in similarfigures.

Figure 6.7: VRNN loss using type I. Bluecurve represents history on the trainingdataset containing person labels: 2, 3, 6, 8,9, 10, 11, 12 and 13. Orange curve representshistory on the validation dataset containingthe data of performer 5.

Figure 6.8: Learning history of cross entropyloss. Blue represents the loss on the trainingset containing performers 2, 3, 6, 8, 9, 10, 11,12 and 13. Orange represents the loss on thevalidation set containing the data of performer5.



45

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.46 0.01 0.17 0.14 0.10 0.00 0.00 0.04 0.00 0.00 0.07

0.01 0.90 0.00 0.00 0.01 0.00 0.00 0.04 0.03 0.00 0.00

0.33 0.01 0.38 0.07 0.04 0.00 0.00 0.01 0.02 0.01 0.12

0.03 0.01 0.09 0.68 0.00 0.00 0.00 0.04 0.07 0.01 0.06

0.06 0.06 0.15 0.03 0.45 0.11 0.01 0.01 0.01 0.07 0.05

0.04 0.00 0.00 0.05 0.00 0.61 0.06 0.00 0.01 0.21 0.02

0.00 0.00 0.02 0.01 0.00 0.07 0.88 0.00 0.00 0.02 0.00

0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.01 0.00

0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.04 0.92 0.01 0.00

0.00 0.00 0.00 0.00 0.00 0.12 0.07 0.00 0.01 0.80 0.00

0.09 0.02 0.09 0.04 0.00 0.00 0.00 0.29 0.00 0.00 0.45


0.0

0.2

0.4

0.6

0.8

Figure 6.10: Confusion matrix over all frames. Theinput data was min-max scaled and the 5-fold cross-validation strategy was applied. A VRNN type I wasused to create the latent space.

Figure 6.11: Accuracy per frame per gesture. Theinput data was min-max scaled and the 5-fold cross-validation strategy was applied. A VRNN type I wasused to create the latent space.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.59 0.00 0.10 0.15 0.02 0.00 0.00 0.06 0.00 0.00 0.08

0.01 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.00 0.00

0.37 0.00 0.50 0.06 0.04 0.00 0.00 0.00 0.02 0.00 0.00

0.02 0.00 0.10 0.80 0.00 0.00 0.00 0.02 0.01 0.00 0.04

0.00 0.08 0.18 0.06 0.60 0.05 0.00 0.00 0.00 0.03 0.01

0.04 0.00 0.00 0.01 0.00 0.85 0.02 0.00 0.01 0.07 0.00

0.00 0.00 0.01 0.02 0.00 0.01 0.96 0.01 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.00

0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.98 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.06 0.00 0.00 0.93 0.00

0.09 0.00 0.01 0.06 0.00 0.00 0.00 0.27 0.00 0.00 0.57


0.0

0.2

0.4

0.6

0.8

Figure 6.12: Confusion matrix on the last frame. The input data was min-max scaled and the 5-fold cross-validationstrategy was applied. A VRNN type I was used to create the latent space.

6.2 Results VRNN type II

Table 6.3 contains the dimensional specifics and the used activation functions of the VRNN type II,which will be used in the beginning of this chapter. At the end of this chapter an optimization willbe performed. For the names and architecture, is referred to Figure 5.4. The model is trained with anAdam optimizer with learning rate of 10−4 for 1000 epochs. The batch size is equal to 25.

Name Output dimension Used activationxt batch × 40 ×1024 Nonext−1 batch × 40 ×1024 Noneφx(xt) batch × 40 ×20 ReLuht batch × 40 ×20 ReLu

µenc, log σ2enc ,µprior, log σ2

prior & zt batch × 40 ×20 Noneφz(zt) batch × 40 ×20 ReLu

µdec & σdec batch × 40 ×1024 sigmoid & softplus

Table 6.3: Summary of the dimensions and activation functions of VRNN type II.

46


Figure 6.13 shows the VRNN loss history of one of the ten possibilities. Figures 6.14 and 6.15 showrespectively the cross entropy loss and accuracy of the corresponding post-VRNN-classifier. In thesefigures the data of person 5 is used as validation set. All other cross-validations result in similar fig-ures.

Figure 6.13: VRNN loss using type II.Blue curve represents history on the trainingdataset containing person labels: 2, 3, 6, 8,9, 10, 11, 12 and 13. Orange curve representshistory on the validation dataset containingthe data of performer 5.

Figure 6.14: Learning history of cross en-tropy loss. Blue represents the loss on thetraining set containing performers 2, 3, 6, 8,9, 10, 11, 12 and 13. Orange represents theloss on the validation set containing the dataof performer 5.


With the 10 cross validations, following results on the test set are obtained. Figure 6.16 shows theconfusion matrix over all frames. The average accuracy is 85.6%. It is trivial that model cannot pre-dict on the first frame as well as on the last frame. At the first frame the gesture is barely executed.Therefore Figure 6.17 shows how the accuracy per gesture changes over the sequence length. As men-tioned, the accuracy starts in general low and increases. The confusion matrix on the last frame isshown in Figure 6.18. The average accuracy at the last frame is 96.1%. The frequency at which theframes are obtained is 40 Hz. Consequently, on the last frame, the gesture has been performed for 1second.

47

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.72 0.00 0.06 0.09 0.05 0.00 0.00 0.01 0.00 0.00 0.06

0.01 0.91 0.02 0.01 0.02 0.00 0.00 0.01 0.01 0.00 0.02

0.14 0.00 0.63 0.07 0.03 0.00 0.00 0.01 0.00 0.02 0.10

0.07 0.00 0.05 0.76 0.01 0.00 0.00 0.01 0.03 0.02 0.05

0.02 0.00 0.02 0.00 0.91 0.01 0.01 0.00 0.00 0.01 0.01

0.00 0.00 0.00 0.00 0.00 0.95 0.01 0.00 0.00 0.03 0.00

0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.01 0.00

0.01 0.01 0.01 0.02 0.00 0.00 0.00 0.90 0.02 0.01 0.02

0.01 0.00 0.01 0.05 0.00 0.00 0.00 0.01 0.89 0.01 0.01

0.00 0.00 0.00 0.00 0.02 0.02 0.00 0.00 0.01 0.95 0.00

0.07 0.00 0.04 0.02 0.02 0.00 0.00 0.03 0.00 0.00 0.81


0.0

0.2

0.4

0.6

0.8

Figure 6.16: Confusion matrix over all frames. Theinput data was min-max scaled and the 10-fold cross-validation strategy was applied. A VRNN type II wasused to create the latent space.

Figure 6.17: Accuracy per frame per gesture. Theinput data was min-max scaled and the 10-fold cross-validation strategy was applied. A VRNN type II wasused to create the latent space.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.85 0.00 0.02 0.10 0.00 0.00 0.00 0.01 0.00 0.00 0.02

0.00 0.98 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00

0.05 0.00 0.91 0.03 0.00 0.00 0.00 0.00 0.01 0.00 0.00

0.07 0.00 0.04 0.89 0.00 0.00 0.00 0.00 0.00 0.00 0.01

0.00 0.00 0.02 0.00 0.98 0.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.01 0.98 0.00

0.01 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.98


0.0

0.2

0.4

0.6

0.8

1.0

Figure 6.18: Confusion matrix on the last frame. The input data was min-max scaled and the 10-fold cross-validationstrategy was applied. A VRNN type II was used to create the latent space.

48


Figure 6.19 shows the VRNN loss history of one of the five possibilities one of the five histories of theloss of the. Figures 6.20 and 6.21 show respectively the cross entropy loss and accuracy of the cor-responding post-VRNN-classifier. In these figures the data of person 5 is used as validation set. Allother cross-validations result in similar figures.

Figure 6.19: VRNN loss using type II.Blue curve represents history on the train-ing dataset containing person labels: 2, 3, 6, 8,9, 10, 11, 12 and 13. Orange curve representshistory on the validation dataset containingthe data of performer 5.

Figure 6.20: Learning history of cross en-tropy loss. Blue represents the loss on thetraining set containing performers 2, 3, 6, 8,9, 10, 11, 12 and 13. Orange represents theloss on the validation set containing the dataof performer 5.



49

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l0.43 0.02 0.18 0.13 0.13 0.00 0.00 0.03 0.01 0.00 0.07

0.01 0.90 0.00 0.00 0.01 0.00 0.00 0.03 0.05 0.00 0.00

0.29 0.02 0.40 0.06 0.03 0.00 0.00 0.01 0.03 0.02 0.13

0.04 0.02 0.10 0.62 0.01 0.00 0.00 0.03 0.14 0.01 0.04

0.05 0.10 0.16 0.05 0.48 0.08 0.01 0.00 0.02 0.03 0.02

0.02 0.00 0.00 0.02 0.01 0.67 0.07 0.00 0.07 0.12 0.02

0.00 0.00 0.00 0.02 0.00 0.03 0.94 0.00 0.00 0.01 0.00

0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.97 0.01 0.00 0.00

0.00 0.01 0.01 0.01 0.00 0.00 0.00 0.04 0.92 0.01 0.00

0.00 0.00 0.00 0.03 0.00 0.07 0.10 0.01 0.05 0.75 0.00

0.08 0.04 0.23 0.06 0.00 0.00 0.00 0.21 0.02 0.00 0.36


0.0

0.2

0.4

0.6

0.8

Figure 6.22: Confusion matrix over all frames. Theinput data was min-max scaled and the 5-fold cross-validation strategy was applied. A VRNN type II wasused to create the latent space.

Figure 6.23: Accuracy per frame per gesture. Theinput data was min-max scaled and the 5-fold cross-validation strategy was applied. A VRNN type II wasused to create the latent space.

0 1 2 3 4 5 6 7 8 9 10Predicted label

0

1

2

3

4

5

6

7

8

9

10

True

labe

l

0.47 0.00 0.21 0.18 0.06 0.00 0.00 0.02 0.00 0.00 0.07

0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.00

0.26 0.00 0.60 0.04 0.02 0.00 0.00 0.00 0.07 0.00 0.00

0.02 0.00 0.14 0.79 0.00 0.00 0.00 0.01 0.02 0.00 0.02

0.00 0.10 0.18 0.02 0.66 0.02 0.00 0.00 0.00 0.01 0.01

0.00 0.00 0.00 0.00 0.00 0.97 0.01 0.00 0.02 0.01 0.00

0.00 0.00 0.02 0.02 0.00 0.00 0.97 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.99 0.00 0.00 0.00

0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.98 0.00 0.00

0.00 0.00 0.00 0.00 0.01 0.04 0.04 0.00 0.05 0.86 0.00

0.03 0.00 0.09 0.10 0.00 0.00 0.00 0.30 0.00 0.00 0.47


0.0

0.2

0.4

0.6

0.8

Figure 6.24: Confusion matrix on the last frame. The input data was min-max scaled and the 5-fold cross-validationstrategy was applied. A VRNN type II was used to create the latent space.

6.3 Summary

The results are summarized in Table 6.4 The results for type I are similar to those of type II. Further-more. The model using a 10-fold cross validation performs better than the one using a 5-fold cross-validation. It seems that the model using 10-fold cross-validation again suffers from some data leakage.

vrrn type used folds overall last frame best worstI 10 85.2% 96.4% 1, 4, 5, 6, 7, 8, 9, 10 0, 2, 3I 5 68.2% 79.4% 6, 7, 8, 9 0, 2, 4, 10II 10 85.6% 96.1% 1, 4, 5, 6, 7, 8, 9, 10 0 , 2, 3II 5 67.6% 79.3% 1, 5, 6, 7, 8 0, 2, 4, 10

Table 6.4: Overview of the results with the VRNN and the post-VRNN classifier.

Learning curves in Figures 6.8 and 6.20 show that the models are overfitting when a 5-fold cross-validationstrategy is applied. For this strategy, there is still room for improvement, which will be explored nextsection, using Bayesian optimization techniques.

50

6.4 Hyperparameter optimization

Both types of VRNN have similar results. Only a few hyperparameters of a type II VRNN will be op-timized for the two cross-validation strategies. The architecture of the post-VRNN classifier will beslightly changed w.r.t. Table 6.1. Some batch normalization layers and dropout layers are added. Thenew architecture is given in Table 6.5. The activation functions do not change. The VRNN and thepost-VRNN classifier will both be trained/optimized with the Adam optimizer using a learning rate10−4 with a batch size equal to 25. The VRNN will be trained for 1000 epochs and the post-VRNNclassifier for 50 epochs. The models will be optimized with the TPE algorithm of the Python hyperoptlibrary, explained at the end of Chapter 2. For both VRNN and post-VRNN classifier, the prior madeon the hidden units is a discrete uniform distribution between 10 and 500 with a step size equal to 10.For the dropout rates a continuous uniform distribution is assumed between 0 and 0.5.

Type layerInput layer

Batch normalizationTime distributed fully connected

Dropout layerBatch Normalization

LSTMDropout layer

Batch NormalizationTime distributed fully connected = output

Table 6.5: Summary of the architecture of the post-VRNN classifier.

First, the model using a VRNN type II on a 10-fold cross-validation will be discussed. The optimizedresults of the hyperparameters are listed in Table 6.6. The names appearing in this table can be foundin Figure 5.4. Table 6.7 shows the optimized hyperparameters of the post-VRNN classifier for eachpossible fold as validation set.

hyperparameter resulting valuedimension ht−1 310

dimension φx(xt) 370dimension φz(zt) 330

Table 6.6: Overview of the results of the TPE algorithm performed on a VRNN type II using a 10-fold cross-validationstrategy (3rd model of Table 6.4).

hyperparameter person label of validation fold2 3 5 6 8 9 10 11 12 13

output dimension 1st fully connected layer 410 490 250 210 410 200 410 300 190 3301st dropout rate 0.38 0.44 0.38 0.13 0.38 0.14 0.38 0.01 0.41 0.31

output dimension LSTM 150 100 20 100 150 70 150 290 30 2202nd dropout rate 0.28 0.34 0.21 0.12 0.28 0.08 0.28 0.40 0.10 0.37

Table 6.7: Overview of the results of the TPE algorithm performed on the post-VRNN classifier (type II) using a 10-fold cross-validation strategy (3rd model of Table 6.4)

With this optimization, an accuracy of 90.7% over all frames is achieved, which is an increase of 5.1%compared to Table 6.4. The accuracy on the last frame is increased from 96.1% to 96.8%.

Secondly, the model using a VRNN type II on a 5-fold cross-validation will be discussed. The opti-mized hyperparameters are listed in Table 6.8. The names appearing in this table can be found in Fig-ure 5.4. Table 6.9 shows the optimized hyperparameters of the post-VRNN classifier for each possiblefold as validation set.

51

hyperparameter resulting valuedimension ht−1 250

dimension φx(xt) 300dimension φz(zt) 260

Table 6.8: Overview of the results of the TPE algorithm performed on a VRNN type II using a 5-fold cross-validationstrategy (4th model of Table 6.4).

hyperparameter person label of validation fold2 3 5 6 8

output dimension 1st fully connected layer 460 210 10 300 2201st dropout rate 0.31 0.01 0.14 0.004 0.16

output dimension LSTM 120 410 90 50 1202nd dropout rate 0.22 0.005 0.1 0.32 0.40

Table 6.9: Overview of the results of the TPE algorithm performed on the post-VRNN classifier (type II) using a 5-fold cross-validation strategy (4th model of Table 6.4)

With this optimization, an accuracy of 67% over all frames is achieved, which is not really an increasecompared to Table 6.4. The accuracy on the last frame is increased from 72.7% to 80.5%.

6.5 Intermediate results in the latent space

This section will discuss the intermediate results obtained in the latent space and how well the opti-mized VRNN has reconstructed the test data.

A closer look to the 20-dimensional latent space reveals that no latent space features are close to 0. Asthe VRNN is a generative model and hence self-regularizing, it would set unnecessary dimensions to0. The reconstruction of the test set with this 20-dimensional latent space and 10-fold cross-validationstrategy will be discussed. The RMSE is used as error metric between reconstructed and original frames.Figure 6.25 shows the mean of RMSEs per label and per gesture. The results are obtained from one ofthe VRNNs in the 10-fold cross-validation where the data of person 5 was used as validation set. It isremarkable that the reconstruction error drops at the 10th and 21st frame. Furthermore the gestureswhich were classified the most accurately have worser reconstructions.

0 5 10 15 20 25 30 35 40frame

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

⟨RM

SE⟩

Pinch indexPalm tiltFinger sliderPinch pinckySlow swipeFast swipePushPullFinger rubCirclePalm hold

Figure 6.25: The mean of RMSE per frame and per gesture. Figure is obtained from the 10-fold cross-validation strat-egy. The data of person 5 was used as validation set.

52

To visualize the latent space, a projection on 2 or 3 dimensions has to be made. Instead of project-ing from a higher dimensional space a VRNN is built with a latent space dimension equal to 2. In thislatent space is researched which trajectory a test sample takes. So each frame of 32× 32 pixels is com-pressed to 2 dimensions. A data sample consists of a sequence of 40 frames. Hence its latent space rep-resentation will be a sequence of 40 times 2 latent space features. Figure 6.26 shows such a result of aVRNN trained with the 5-fold cross validation strategy and where data of person 5 was used as valida-tion set.

Figure 6.26: Trajectory of 4 samples in the test through the latent space. Gestures where the palm of the hand doesnot move have typically a less spread trajectory through latent space. Figure is obtained from the 5-fold cross-validationstrategy. The data of person 5 was used as validation set.

The easy recognizable gestures are typical large gestures, like push and pull. The harder recogniz-able gestures are typical gestures where the palm of the hand does not move (but the fingers maydo). Palm hold and pinch pinky are examples of such gestures. In other figures similar to Figure 6.26and obtained with the 5-fold cross-validation, the gestures where the palm of the hand does not movehave a less wider trajectory than the other gestures. In Figure 6.26, the gestures palm hold (orange)and pinch pinky (red) have a less spread trajectory than circle (blue) and fast swipe (brown). The la-tent space representation of a ’large’ gesture will take a wide path through latent space. A wider paththrough latent space corresponds typically with a better classification. Figure 6.27 is similar to Figure6.26, but is constructed with the 10-fold cross-validation strategy where each gesture is better classi-fied. Palm hold and pinch pinky are better classified with this strategy and have a wider trajectory.

Figure 6.27: Trajectory of 4 samples in the test through the latent space. Gestures where the palm of the hand doesnot move have typically a less spread trajectory through latent space. Figure is obtained from the 5-fold cross-validationstrategy. The data of person 5 was used as validation set.

53

Chapter 7

Conclusion

In this dissertation deep neural networks (DNN), classifying hand gestures in micro-Doppler radardata, are compared to classifiers built on the resulting latent space of variational recurrent neural net-works (VRNNs). The former networks are discriminative models, while the latter are generative net-works. Generative models can be used for noisy data augmentation, which works regularizing for theclassification on a small dataset. Data augmentation is possible as a generative model is a model forthe joint distribution of input and labels. From this joint distribution, the posterior can inferred, whichis directly modeled by discriminative models. From Bayes’ rule follows that the prior believe, miss-ing in discriminative models, leads to built-in regularizing ability. To compare both models, a DNNis built, based on [1]. Both models have the same architecture and hyperparameters, except for thedropout rates and the used optimizer. Instead of the stochastic gradient descent optimizer, the Adamoptimizer is used. Where [1] uses dropout rates of 0.4 and 0.5, the Keras DNN uses rates of respec-tively 0.25 and 0.30. These changes were implemented as the accuracies obtained on validation setsare too low. Furthermore a batch size of 25 is used instead of 16. The 50% − 50% train-test split andcross-validation strategy are again proposed by [1]. They split the data set in such way that each per-son performing hand gestures for the training set also performs gestures for the test set. The data setcontains hand gestures of 10 different persons. In the proposed cross-validation strategy each valida-tion fold only contains training data of one person, leading to a 10-fold cross-validation. As this strat-egy combined with the train-test split is suffering from some data leakage, an additional 50% − 50%train-test split is researched. In this train-test split, the data of one person was either in the train oreither in the test set. A similar cross-validation strategy is applied, but as there are only 5 differentpersons in the training set, only 5 folds are made.

With the 10-fold cross-validation, the self-built DNN achieves an accuracy of 81.8% over all framesand 91% on the last frame. The corresponding model built in [1] obtains accuracies of respectively87.2% and approximately 90% (the latter is an estimate based on Figure 7 in [1]). The differences incross-validations, preprocessing and hyperparameters mentioned in the first paragraph lead to deviantaccuracies. The performances of these two models will be used as benchmarks. In the first place, it ishoped that the models based on VRNNs outperform the self-built DNN and in addition the results of[1]. The 5-fold strategy cannot be compared with [1] as they did not build such a model. Nevertheless,this strategy is researched, resulting in an accuracy of 64.8% over all frames and 72.7% on the lastframe for the DNN.

The VRNN and the post-VRNN classifier are both optimized with the tree-structured Parzen estima-tor (TPE). The VRNN is optimized for one validation fold, extending the optimized hyperparametersto other validation folds. The post-VRNN classifier is optimized for each validation fold. The prepro-cessed data, as fed to the DNNs, is again used. An additional min-max scaling is applied on the la-tent state representation samples. Both VRNN and post-VRNN classifier are trained with the Adamoptimizer using a learning rate of 10−4 and batch size equal to 25. The VRNN is trained for 1000epochs, while the post-VRNN classifier is trained for 50 epochs. The VRNN transforms the data toa 20-dimensional latent space. The results of the latent space models can be summarized to an accu-racy of 90.7% over all frames and 96.8% on the last frame for the 10-fold cross-validation and respec-

54

tively 67% and 80.5% for the 5-fold cross-validation. The results are listed in Table 7.1. It is clearthat the latent space models outperform the DNNs built in Keras and the model from [1]. In additionthe latent space models use approximately 23 times less trainable parameters than the DNNs, whichuse over 46 million parameters.

model folds over all acc. last frame acc.DNN 5 64.8% 72.7%DNN 10 81.8% 90%

post-VRNN 5 67% 80.5%post-VRNN 10 90.7% 96.8%

Table 7.1: Summary of the obtained results with the DNN and latent space model.

From the change in accuracy over the sequence length and confusion matrices can be deduced whichhand gestures are easily classified. For these gestures, the accuracies on the first frame start fairly highand do not benefit as much as the other gestures of the dynamic modeling of the LSTM. The moreeasily recognized gestures are gestures where the hand as a whole makes the motion, like push andfast/slow swipe. In the gestures, which are hard to recognize, the motion of the fingers are more in-formative. They have an additional speed w.r.t the palm of the hand, like pinch index, which is themost difficult gesture to recognize. The same could be concluded by comparing the results of the 5-fold and 10 - fold cross-validation strategies . As the 10-fold strategy could suffer from some data leak-age, the 5-fold strategy is more suited to recognize gestures universally. The gestures, which do notreally benefit from the LSTM, are easily classified by the 5-fold strategy (similar accuracies as the 10-fold strategy). The other gestures are less accurately classified (much lower accuracies than the 10-foldstrategy).

In future work, the dimension of the latent space can be increased to improve the reconstruction. Abetter reconstruction will lead to better latent space representations. The dimension should be in-creased until the VRNN starts to set latent space features to 0. The VRNN and the post-VRNN clas-sifier will need to undergo a new hyperparameter optimization, for which other algorithms can be ex-plored. Furthermore, the folds in the 5-fold cross-validations can be split. The current folds in the 5-fold strategy are too large, which lead to a too biased model.

55

Appendix A

Kullback-Leibler divergence in caseof normal distributions

A.1 univariate normal distributions

Assume p(x) and q(x) univariate normal distributions p(x) with mean and variance µ1 and σ21 . q(x)

with mean and variance µ2 and σ22 . Equation A.1 repeats the definition for the Kullback-Leibler diver-

gence in case of continuous variables.

DKL(p||q) = −∫p(x)log

(q(x)

p(x)

)dx (A.1)

Inserting the expressions for normal distributions in Equations A.1 leads to Equation A.2

−∫p(x)log

1√2πσ2

2

e− (x−µ2)2

2σ22

1√2πσ2

1

e− (x−µ1)2

2σ21

dx = −

∫p(x)log

(√σ2

1√σ2

2

)dx

−∫p(x)

1

2

[− (x− µ2)2

2σ22

+(x− µ1)2

2σ21

]dx

(A.2)

The first term can be evaluated using the normalisation property of a distribution. With a next re-ordering of the terms Equation A.3 can be obtained.

DKL(p||q) =1

2log

(σ2

2

σ21

)− 1

2σ21

∫p(x)(x− µ1)2dx+

1

2σ22

∫p(x)(x− µ2)2dx (A.3)

The second term in Equation A.3 can be simplified using the definition of variance. With adding andsubtracting µ1 in the last integral, the integral is split in three terms as done in Equation A.4. Thefirst integral is again evaluated using the definition of variance. The second integral is simplified usingthe normalisation property.

DKL(p||q) =1

2log

(σ2

2

σ21

)− 1

2+

1

2σ22

∫p(x)(x− µ1 + µ1 − µ2)2dx

=1

2log

(σ2

2

σ21

)− 1

2+

1

2σ22

∫p(x)(x− µ1)2dx+

1

2σ22

∫p(x)(µ1 − µ2)2dx

+1

2σ22

(µ1 − µ2)

∫p(x)(x− µ1)dx

=1

2log

(σ2

2

σ21

)− 1

2+

σ21

2σ22

+(µ1 − µ2)2

2σ22

+1

2σ22

(µ1 − µ2)

∫p(x)(x− µ1)dx

(A.4)

56

The last integral in Equation A.4 is zero using the definition of the mean. Hence the Kullback-Leiblerdivergence between two univariate normal distributions is given by Equation A.5

DKL(p||q) =1

2log

(σ2

2

σ21

)− 1

2+σ2

1 + (µ1 − µ2)2

2σ22

(A.5)

A.2 multivariate normal distributions

The calculus of the univariate normal distributions can be used for the multivariate normal distribu-tions as well. The determinant of matrix A is denoted by |A|. Assume p(x) and q(x) multivariate nor-mal distributions p(x) with mean vector and covariance matrix µ1 ∈ Rd×1 and Σ1 ∈ Rd×d. q(x) withmean vector and covariance µ2 ∈ Rd×1 and Σ2 ∈ Rd×d.

DKL(p||q) = −∫p(x)log

1√(2π)d|Σ2|

e−12 (x−µ2)>Σ−1

2 (x−µ2)

1√(2π)d|Σ1|

e−12 (x−µ1)>Σ−1

1 (x−µ1)

dx

=1

2log

( |Σ2||Σ1|

)−∫p(x)

1

2

[−(x− µ2)>Σ−1

2 (x− µ2) + (x− µ1)>Σ−11 (x− µ1)

]dx

(A.6)

To simplify the Kullback-Leibler divergence, some algebra has to be introduced. Starting from the def-inition of the covariance matrix, an expression for E(xx>) can be obtained.

Σ = E[(x− µ)(x− µ)>

]

= E(xx>)− E(xµ>)− E(µx>) + E(µµ>)

= E(xx>)− µ>E(x)− µE(x>) + µµ>

= E(xx>)− µµ>

(A.7)

The integrals appearing in Equation A.6 are the expectation values of the form x>Ax, which is ascalar. Scalars are equal to their trace. The trace of a matrix A ∈ Rn×n is given by definition in Equa-tion A.8. As a scalar can be seen as a 1 × 1 matrix, it is equal to its trace. The definition of the tracecan be extended to higher dimensions. But this is out of the scope of this thesis.

Tr(A) =

N∑

i=1

Aii (A.8)

It is easy to proof that Tr(ABC)=Tr(BCA)=Tr(CAB) with A ∈ Rm×n, B ∈ Rn×p and C ∈ Rp×m.

Tr(ABC) =

m∑

i=1

(ABC)ii =

m∑

i=1

p∑

j=1

(AB)ijCji =

m∑

i=1

k∑

j=1

n∑

k=1

AikBkjCji

=

m∑

i=1

k∑

j=1

n∑

k=1

BkjCjiAik =

m∑

i=1

n∑

k=1

(BC)kiAik =

n∑

k=1

(BCA)kk = Tr(BCA)

(A.9)

With these property it is easy to calculate the expectation value of x>Ax.

E(x>Ax) = E[Tr(x>Ax)

]= E

[Tr(Axx>)

]= Tr

[E(Axx>)

]

= Tr[AE(xx>)

] (A.10)

Using Equation A.7 and the commutativity of the trace operator, Equation A.10 can be simplified.

E(x>Ax) = Tr[A(Σ + µµ>

)]= Tr(AΣ) + Tr(Aµµ>) = Tr(AΣ) + µ>Aµ (A.11)

Equation A.11 can be used to simplify the first integral in Equation A.6.

DKL(p||q) =1

2log

( |Σ2||Σ1|

)− 1

2Tr(Σ−1

1 Σ1)− 1

2(µ>1 − µ>1 )Σ−1

1 (µ1 − µ1)

+1

2Tr(Σ−1

2 Σ1) +1

2(µ>1 − µ>2 )Σ−1

2 (µ1 − µ2)

(A.12)

57

This results in Equation A.13.

DKL(p||q) =1

2log

( |Σ2||Σ1|

)− d

2+

1

2Tr(Σ−1

2 Σ1) +1

2(µ>1 − µ>2 )Σ−1

2 (µ1 − µ2) (A.13)

58

Appendix B

Non-negativity of Kullback-Lieblerdivergence

This section will proof the non-negativity of the KL divergence for general distributions. The proof isbased on Jensen’s inequality which will be proven as well.

B.1 Jensen’s inequality

There are several ways to express the Jensen’s inequality but this thesis will approach it in the contextof probability theory.

Theorem 1. (Jensen’s inequality) If X is a random variable and g is a convex function, then:

g (E[X]) ≤ E [g(X)]

Proof. Consider a convex function g: X → Y and value x1 ∈ X with x1 = E [X]. As g is convex, eachtangent line is lower than g itself. So this is also the case for the tangent line in x1, which will have ageneral expression of the form L(x) = ax + b. The image y1 ∈ Y of x1 ∈ X can be expressed as ei-ther g(x1) or either L(x1) as the tangent line and function are identical in the tangent point. For eachvalue x2 ∈ X, follows that g(x2) ≥ L(x2), where the equality is valid if x1 = x2. Hence the inequal-ity can also be noted for the expectation values: E [g(x)] ≥ E [L(x)] This inequality can be worked outusing the general expression of the tangent line.

E [g(X)] ≥ E [L(X)]

E [g(X)] ≥ E [aX + b]

E [g(X)] ≥ aE(X) + b

E [g(X)] ≥ L(E[X])

(B.1)

But by construction of the tangent line, L(E[X]) = g(E[X]). Hence Jensen’s inequality is proven.

E [g(X)] ≥ g(E[X]) (B.2)

A typical example of Jensen’s inequality is the variance of a random variable X, where the convexfunction is the quadratic.

var(X) = E[X2]− (E [X])

2 ≥ 0

E[X2]≥ (E [X])

2(B.3)

A generalization of Jensen’s inequality can be deduced by introducing the dependency of random vari-able X to another random variable Z with f : X → Z. In other words Z = f(X). Applying Jensen’sinequality to random variable Z leads to Equation B.4.

59

E [g(Z)] ≥ g(E[Z] (B.4)

By substituting the dependency on X, Equation B.4 changes to Equation B.5.

E [g(f(X))] ≥ g(E[f(X)] (B.5)

B.2 Non-negativity of KL divergence

Theorem 2. (Non-negativity of KL divergence) If P and Q are two probability distributions, then:

DKL(P ||Q) ≥ 0

The non-negativity can be easily proven using Jensen’s inequality. Remind the definition of the KLdivergence.

Proof.

DKL(P ||Q) = −Ex∼P[log

Q(x)

P (x)

]= Ex∼P

[−log

Q(x)

P (x)

](B.6)

The negative logarithmic function is a convex function. Hence Jensen’s inequality can be applied.

Ex∼P[−log

Q(x)

P (x)

]≥ −log

(Ex∼P

[Q(x)

P (x)

])

≥ −log

(∫P (x)

Q(x)

P (x)dx

)

≥ −log

(∫Q(x)dx

)

≥ −log (1)

≥ 0

(B.7)

60

Bibliography

[1] Wang S., Song J., Lien J., Poupyrev I. & Hilliges O. (2016). Interacting with Soli: Exploring Fine-Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum. Proceedings of the 29thAnnual Symposium on User Interface Software and Technology, 29, pages 854-860. Consulted onFebruary 5, 2018 via https://dl.acm.org/citation.cfm?id=2984565

[2] Rohling H., Heuel S. & Ritter H. (2010). Pedestrian detection procedure integrated into an 24 GHzautomotive radar. 2010 IEEE Radar Conference, pages 1229-1233. Consulted on May 29, 2018 viahttps://ieeexplore.ieee.org/document/5494432/

[3] Liu L., Popescu M. & Skubic M. (2011). Automatic Fall Detection Based on Doppler Radar Mo-tion Signature. 2011 5th International Conference on Pervasive Computing Technologies forHealthcare (PervasiveHealth) and Workshops, 5, pages 222-226. Consulted on May 29, 2018 viahttps://ieeexplore.ieee.org/document/6038799/

[4] Bjorklund S., Johansson T. & Petersson H. (2016). Target classification in perimeter protectionwith a micro-Doppler radar. Radar Symposium (IRS), 2016 17th International, 8 Consulted onMay 29, 2018 via https://ieeexplore.ieee.org/document/7497363/

[5] Youngwook K. & Ling H. (2009). Human Activity Classification Based on Micro-DopplerSignatures Using a Support Vector Machine. newblockIEEE Transactions on Geoscienceand Remote Sensing, 47, nr. 5, pages 1328-1337. Consulted on May 29, 2018 viahttps://ieeexplore.ieee.org/document/4801689/

[6] Li J., Lam Phung S. & Hing Chi Tivive F. (2012). Automatic classification of human motions us-ing Doppler radar. The 2012 International Joint Conference on Neural Networks (IJCNN), 22Consulted on May 29, 2018 via https://ieeexplore.ieee.org/document/6252625/

[7] Lien J., Gillian N., Karagozler E., Amihood P., Schwesig C, Olson E., Raja H. & Poupyrev I.(2016). Soli: Ubiquitous Gesture Sensing with Millimeter Wave Radar. ACM Transactions onGraphics (TOG), 35, nr. 4, Article No. 142

[8] Wang, S. (2016). Gesture Recognition Using Neural Networks with Google’s Project Soli Sensor.Github Consulted on Februay 1, 2018 via https://github.com/simonwsw/deep-soli

[9] Goodfellow I., Bengio Y. & Courville A. (2016). Deep learning. MIT Consulted on July 25, 2018via https://github.com/janishar/mit-deep-learning-book-pdf

[10] Ng A. (2011). Lecture 1: Introduction. Coursera: Machine learning, Consulted on July 16, 2017via https://www.coursera.org/learn/machine-learning#

[11] LeCun Y. & Bengio Y. (1995). Convolutional Networks for Images, Speech and Time-Series. The Handbook of Brain Theory and Neural Networks. Consulted on May 30, 2018 viahttps://pdfs.semanticscholar.org/e26c/c4a1c717653f323715d751c8dea7461aa105.pdf

[12] Krizhevsky A., Sutskever I.& Hinton G. E. (2012). ImageNet classification with deep convo-lutional neural networks. NIPS’12 Proceedings of the 25th International Conference on Neu-ral Information Processing Systems, 25 nr. 1 pages 1097-1105. Consulted on May 30, 2018via https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

61

[13] Tejani, S. (2016). Machines that can see: Convolutional Neural Networks. Github. Consulted forfigure on May 24 2018 via https://shafeentejani.github.io/2016-12-20/convolutional-neural-nets/

[14] Hochreiter S. & Schmidhuber J. (1997). Long Short-Term Memory Neu-ral computation, 9, nr. 8 pages 1735-1780. Consulted on May 30, 2018 viahttps://pdfs.semanticscholar.org/e26c/c4a1c717653f323715d751c8dea7461aa105.pdf

[15] Duchi J., Hazan E. & Singer Y. (2011). Adaptive subgradient methods for online learningand stochastic optimization JMLR, 12, pages 2121-2159. Consulted on May 30, 2018 viahttp://jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

[16] Tieleman T. & Hinton G. (2012). Lectue 6.5 - rmsprop: Divide the gradient by a running aver-age of its recent magnitude Coursera: Neural Networks for Machine Learning Consulted on May30, 2018 via https://www.coursera.org/learn/neural-networks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of-its-recent-magnitude

[17] Kingma D.P. & Lei Ba J. (2015). Adam: a method for stochastic optimization The In-ternational Conference on Learning Representations (ICLR) Consulted on May 30, 2018 viahttps://arxiv.org/abs/1412.6980

[18] Gal Y. & Ghahramani (2016). A Theoretically Grounded Application of Dropout in Recur-rent Neural Networks. NIPS’16 Proceedings of the 30th International Conference on Neu-ral Information Processing Systems, pages 1027-1035, Consulted on May 31, 2018 viahttps://arxiv.org/abs/1512.05287?context=stat

[19] C. E. Rasmussen and K. I. Williams. (2006). Gaussian Processes for Machine Learning. the MITPress http://www.gaussianprocess.org/gpml/chapters/RW.pdf

[20] Snoek J., Larochelle H. & Adams R. P. (2012). Practical Bayesian Optimization of machinelearning algorithms. NIPS’12 Proceedings of the 25th International Conference on Neural In-formation Processing Systems, 25, nr. 2, pages 2951-2959, Consulted on May 31, 2018 viahttps://arxiv.org/abs/1206.2944

[21] Bergstra J., Bardenet R. ,Bengio Y. & Balazs K. (2011). Algorithms for Hyper-ParameterOptimization. NIPS’11 Proceedings of the 24th International Conference on Neural In-formation Processing Systems, 24, pages 2546-2554. Consulted on April 25, 2018 viahttps://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

[22] Dutch Wikipedia page of Doppler effect Consulted for figure on April 4, 2018 viahttps://nl.wikipedia.org/wiki/Dopplereffect

[23] Parashar K. N., Oveneke M. C. & Rykunov M. (2017). Micro-Doppler feature extraction usingconvolutional auto-encoders for low latency target classification. Radar Conference (RadarConf),2017 IEEE, 12, Consulted on May 31, 2018 via https://ieeexplore.ieee.org/document/7944488/

[24] Kingma D. P. & Welling M. (2013). Auto-Encoding Variational Bayes. CoRR Consulted onSeptember 26, 2017 via https://arxiv.org/pdf/1312.6114.pdf

[25] Kingma D. P., Salimans T. & Welling M. (2015). Variational Droput and the Local Repa-rameterization Trick NIPS’15 Proceedings of the 28th International Conference on Neural In-formation Processing Systems, 28 nr. 2 pages 2575-2583 Consulted on May 31, 2018 viahttps://arxiv.org/abs/1506.02557

[26] Chung J., Kastner K., Dinh L., Goel K., Courville A. & Bengio Y. (2016). A Recurrent LatentVariable Model for Sequential Data. NIPS’15 Proceedings of the 28th International Conference onNeural Information Processing Systems - Volume 2, 28, nr. 2 pages 2980-2988. Consulted onSeptember 26, 2017 via https://arxiv.org/pdf/1506.02216.pdf

62

Detection of hand gestures in micro-Doppler radar data ...

Documents

Transcript of Detection of hand gestures in micro-Doppler radar data ...