Leveraging Transfer Learning in Multiple Human Activity Recognition Using WiFi...

Leveraging Transfer Learning in Multiple HumanActivity Recognition Using WiFi Signal

Sheheryar Arshad∗, Chunhai Feng∗, Ruiyun Yu†, Yonghe Liu∗∗Department of Computer Science & Engineering, The University of Texas at Arlington, USA

†College of Software, Northeastern University, ChinaEmail: [email protected], [email protected], [email protected], [email protected]

Abstract—Existing works on human activity recognition pre-dominantly consider single-person scenarios, which deviates sig-nificantly from real world where multiple people exist simul-taneously. In this work, we leverage transfer learning, a deeplearning technique, to present a framework (TL-HAR) that ac-curately detects multiple human activities; exploiting CSI of WiFiextracted from 802.11n. Specifically, for the first time we employpacket-level classification and image transformation together withtransfer learning to classify complex scenario of multiple humanactivities. We design an algorithm that extracts activity basedCSI using the variance of MIMO subcarriers. Subsequently, TL-HAR transforms CSI to images to capture correlation amongsubcarriers and use a deep Convolutional Neural Network (d-CNN) to extract representative features for the classification. Wefurther reduce training complexity through transfer learning,that infers knowledge from a pre-trained model. Experimentalresults confirm the significance of our approach. We show thatusing transfer learning TL-HAR improves recognition accuracyto 96.7% and 99.1% for single and multiple MIMO links.

Index Terms—Channel State Information, Deep Learning,Transfer Learning, Multiple Human Activity Recognition.

I. INTRODUCTION

With recent advancements in 802.11, WiFi signals havepredominantly emerged as the most pervasive signals, widelyemployed for human activity recognition. The significantpresence of WiFi access points (APs) around us unleash anenormous resource that can be utilized for development ofcyber-physical and Internet of Thing (IoT) applications [1].The ease of access to physical layer parameters have motivatedresearchers to leverage WiFi as a powerful sensing tool, whichcan be exploited to precisely recognize multiple human activityrecognition [2], [3]. This implies that an AP can accuratelyidentify human scale comprehension of space and activities. Inthis work, we utilize channel state information (CSI) extractedfrom 802.11n to design a Transfer Learning based multipleHuman Activity Recognition (TL-HAR) system. TL-HARinfers the state of each human by analyzing his/her influenceon CSI of WiFi signals while preserving user privacy anddeployment cost [4].

Being an emerging technique especially involving complexscenarios; WiFi sensing for multiple human has many chal-lenges. The most critical is to uniquely classify the influenceof each person and activity on WiFi signals. This translates tothe problem of extracting generic features for each activity.

Therefore, this work presents a methodology that extractsquality features from multiple people activities and at the sametime keeps spatio-temporal correlation intact [5], [6]. Manyresearch efforts have utilized CSI for activities recongitionbut mostly have extracted statistical features on each channelindependently both in time and frequency domains [6], [7].

Multiple activities based perturbations are correlated andtherefore CSI extracted from multiple channels needs to beanalyzed collectively to avoid loss of critical information[8]. This encouraged us to design a methodology that learnfeatures across multiple subcarriers (channels) simultaneously.In [8], a framework that constructs CSI radio images has beenproposed for localization and activity recognition. Owing tounique location-specific activities; they hand engineered colorand texture features to improve classification through deeplearning. In contrast, our work consists of multiple peoplebased activities without specific locations and thus definingcolor and texture features may not be fruitful. Therefore, TL-HAR extracts activities based on CSI and consequently learnsgeneric features by transforming CSI packets to images foranalyzing all the subcarriers simultaneously.

In particular, this paper proposes two convolutional deeplearning architectures i.e., d-CNN (deep convolutional neuralnetwork) and i-CNN (inferred convolutional neural network);to learn quality features from CSI images of multiple humanactivities. Through comprehensive experiments, we show thatthe proposed scheme is superior to other similar approaches[9], which effectively solves the intrinsic labor-intensive prob-lem of using hand-crafted features. Moreover, it confirms thatCNN can be designed for multiple human activity recognitioneven for WiFi signals.

We also identify CSI based activity recognition as a transferlearning problem [10], because CSI is highly dependent onenvironmental changes and would otherwise require largedata set and extensive training for quality features learning.Importantly, we transfer the knowledge from Inception V3model [11] to our problem and enriches feature learning toboost accuracy with much less training epochs and data.This methodology inherently mitigates noise that otherwisewould require high-frequency filters to analyze CSI efficiently[7], [12]. Specifically, our algorithm identifies anomaly ina sequence of received packets. These anomalous packetsare used as x-axis and subcarriers as y-axis to transformmultiple human activities based perturbations into images.We consequently use a deep learning framework that extract978-1-7281-0270-2/19/$31.00 ©2019 IEEE

quality features to detect activities with high accuracy. Tosummarize, the main contributions of this paper are as follows:

• We present a transfer learning based multiple humanactivity recognition (TL-HAR) framework. TL-HAR en-riches feature learning to boost accuracy with much lesstraining epochs than a standard deep neural network.

• We improve multiple human activity classification byextracting generic features using transformed images anda deep fifteen layered convolutional neural network (d-CNN).

• We implement CSI-to-images transformation to utilizeCNN and transfer learning for WiFi sensing.

• We eliminate subcarrier level majority voting anddomain-specific feature engineering required to achievehigh recognition accuracy. Instead, we obtain better per-formance through generic features learned with d-CNNand transfer learning based i-CNN.

• Through detailed experiments we show that using transferlearning TL-HAR can achieve an average accuracy of96.7% for a single link and 99.1% for multiple links.

The rest of this paper is organized as follows. We firstreview the related work in Section II. In Section III, weprovide a brief CSI overview. The system model of TL-HAR and methodology used is described in Section IV. Theexperimental details and performance evaluation is providedin Section V. In Section VI, we conclude the paper with somehighlights on the future work.

II. RELATED WORKHuman activity recognition (HAR) has been extensively

studied in recent years particularly to improve designs ofcyber-physical systems. Traditionally, HAR was accomplishedusing device-based approaches like sensors, cell-phones andcameras etc [13]. In [14], authors present HAR comparisonusing different approaches based on wearable sensors. Theyevaluated two standard datasets for testing the performance ofmultiple deep learning approaches. For the diverse activitiesincluded in their evaluation they reported a maximum accuracyof 92.21%. However, wearable sensors based approaches aresubject to different sensory requirements including distance,lighting, line of sight etc. To overcome these shortcomings,we employ passive WiFi sensing to classify multiple peopleactivities which is inherently low-cost, non-invasive and ubiq-uitous solution [7].

The access to physical layer information of WiFi hasincreased researchers’ interest to use CSI for passive humanactivity recognition. The applications which employ CSI varyfrom coarse activities like gait analysis [5] to fine-grained ges-tures [15]. CSI has also been used for indoor localization [16],occupancy estimation [17] and even for vehicular applications[18], [19]. However, many of these works consider a controlledscenario of single person activity, which makes it difficultto use these systems for real multi-people environments. In[7], authors presented multiple people activity recognition,however they extracted hand-crafted features that significantlydepends on domain knowledge. This solution may be goodfor a specific environment or data set but cannot scale up

��

��

��

��

��

��

��

��

��

��

Fig. 1: WiFi signals reflection scenario

for generic activity recognition system. Additionally, domainknowledge usually provides shallow features which can onlybe used for simple activity recognition systems. Hand engi-neered features usually require sophisticated machine learningtechniques to classify activities and may fail for differentcontexts.

To overcome these limitations, we present a deep learningmethod that learns high quality features incorporating thecorrelation in the CSI. In [3], authors proposed a long shortterm memory (LSTM) based recurrent neural network torecognize multiple human activities. However, LSTM extractsonly temporal correlations and require subcarrier-level classifi-cation, which may remove any correlation present among thesubcarriers. This motivated us to implement a methodologythat can use CNN to extract spatial correlation as used byother applications [20]. Therefore, TL-HAR transforms CSI toimages in order to exploit CNN architecture for our problemscenario. The authors in [8], define sophisticated radio imagefeatures and use a deep neural network for the localization andactivity recognition. They hand-engineered color and texturefeatures in an effort to improve classification accuracy. Ourwork of multiple people activities do not have distinctly dif-ferent color and texture features because of random locations.Therefore, we do not hand crafted any feature and let CNNto learn from whole length time series as otherwise sometemporal information will be lost. In [21], authors use CNNand LSTM in combination to detect unusual behavior, butthey utilize computer vision techniques in contrast to our WiFibased recognition. The drawback of deep learning models isthat they require large compute resources, long training timesand large data sets for achieving high accuracy [22]. In orderto minimize training delays and boost classification accuracywe adhere to transfer learning [10]. Transfer learning usesknowledge from a previously trained model that can be fine-tuned for another application to give promising results as usedby other activity recognition frameworks [23], [24].

III. CSI OVERVIEW

Human activities and movements produce multipath effectsthat result in channel perturbations, phase shifting and fadingof the signal. To analyze human impact on channel conditions,

we can use either received signal strength indicator (RSSI) orCSI. RSSI though easier to obtain; lacks detailed subcarrierlevel information and thus is unstable to be utilized forcomplex scenario of multiple people. CSI on the other handprovides both the amplitude and phase information for eachMIMO subcarrier. Thus, TL-HAR leverages CSI of WiFi tosense and classify multiple human activities, fully exploitingIEEE 802.11n/ac with MIMO supported by most existing WiFidevices. Fig. 1 shows multiple human activities and radiofrequency reflections and multipath interference, resulting intemporal and spatial changes of the CSI.

Let NT and NR be the number of transmit and receiveantennas and H(i,j) represent the channel frequency response,then narrowband flat fading channel between ith and jthantennas pair can be expressed as

Y(i,j) = H(i,j) ∗X(i,j) +N(i,j), (1)

In Eq.1 X(i,j) and Y(i,j) represent the transmitted and receivedsignal vector where i = 1, 2, · · · , NT and j = 1, 2, · · · , NRrespectively. N(i,j) is the white Gaussian noise and for zeromean distribution we can obtain approximate CSI as

�H(i,j) =Y(i,j)

X(i,j)=| �H(i,j) | e�H(i,j) , (2)

where | �H(i,j) | and e�H(i,j) are the magnitude and phase ofCSI respectively.

IV. SYSTEM DESIGN

TL-HAR system consists of six modules namely CSI Col-lection Module, CSI-to-Image Transformation Module, Mod-els Database, Deep Learning Module, Transfer Learning Mod-ule and Activity Classification Module. The overall systemarchitecture of the proposed TL-HAR is shown in Fig. 2 thatdescribes the overall process from CSI collection to activityclassification for multiple people scenario.

A. CSI Collection Module

The first step involved in the implementation of TL-HARsystem is the sensing and collection of CSI. We employ aLinksys dual band WiFi router that has three fixed built-inantennas acting as a transmitter Tx and a laptop equipped withIntel 5300 NIC to serve as Rx. This setup enables TL-HARsystem to collect ICMP packets with Sb subcarriers for eachTx-Rx antenna pairs. Our system continuously samples ICMPpackets at a fixed rate and then aggregates CSI extracted fromthese packets to recognize any activity. TL-HAR analyzes mul-tiple Tx-Rx links with their aggregated subcarriers to detecthigh variance and considers it as a potential human activity oractivities. In general, CSI is very sensitive to environmentalfactors and presents a high probability of noise blended withthe CSI which prohibits efficient recognition. Therefore, inTL-HAR instead of using single MIMO subcarrier we proposeaggregated subcarrier’s variance as an indicator of abnormalactivity as described below.

1) Abnormal Environment Detection (AED): To analyzeCSI for an activity, we propose AED algorithm that exploitsvariance amongst the filtered subcarriers for activity identifi-cation. We implement a second order low pass Butterworthfilter to remove high-frequency noise from CSI. For this,we set the packet sampling rate at Fs = 80 packets/s, sameas a normalized cut-off frequency wn = 2πf/Fs = 0.025πrad/s. Human activities and movements consists of low fre-quency components and produce multipath effects that resultsin perturbations for each subcarrier based on the activitiescombination as shown in Fig. 3. However, subcarriers maybe influenced differently and undergo correlated variations.Therefore using TL-HAR, we exploit correlation using imagetransformation and only use variance of aggregated subcarriersfor sensing an activity.

Through several experiments, we are convinced that thevariance between subcarriers changes significantly in case ofhuman activities. This is clearly shown in Fig. 3 with threemarked deviations representing three activities. Therefore,AED identifies and extracts CSI by using aggregated subcar-rier’s variance rather than individual subcarriers. For packetp, we can represent the variance between the subcarriers of asingle MIMO link as νp. This implies that for a total of P re-ceived packets, the variances will be ν = {νp, νp+1, · · · , νP }.We at first try to remove offsets among the subcarrier bynormalizing νp using Eq. (3),

ωp =(νp − µp)

σp. (3)

We then compare ωp with experimentally determinedthreshold τth = C ∗ std(ω) to identify an anomaly (whereC = 1.5 is determined empirically). AED considers it asan abnormal environment if the variance ωp is higher thanτth. This set zp to 1 and 0 otherwise representing normalenvironment. We can mathematically express it as

zp =

�1, ωp ≥ τth0, otherwise

(4)

Even after filtering high frequency noise, we find few outliersin the activities based CSI. To remove false alarms due to thesenoisy spikes originated due to unknown external factors, weuse another level of filtering. Let sp represent the summationof zp between p−� and p+� such that � is very small intervalaround packet having index p, then it can be represented as

sp =

p+��

p−�zp (5)

If AED calculates sp that is smaller than empirical thresholdηth then TL-HAR discards the anomaly and considers it as afalse alarm due to noise. In our implementation, we considerthat any activity will atleast last for 1 sec and set ηth = 5 and� = 1 sec. AED successfully recognize abnormal environmentwith an average accuracy of 97%. The pseudo code for theAED is given in Algorithm 1.

CSI Collection Module

Abnormal Environment

Packet Based CSI SamplingRx

Norm

al En

viron

men

t

CSI-to-Image Transformation ModuleCSI-to-Image Conversion

CSI Packets Aggregation

Image Normalization and Resizing

Time Critical Application

Transfer Learning Module

Learned ParametersModels Databaseatabase

Activities��

ode

Images��

x1 x2 xn...

y1 y2 yn...

Deep Learning Module

True False

Activity 1

Activity 2

Activity N

Activity Classification Module

Softmax Classification

Fully ConnectedLayer

d-CNN

False

True

i-CNN

CSI Alignment

Fig. 2: System architecture of TL-HAR

0 1000 2000 3000 4000 5000 6000Number of Packets

0

10

20

30

40

Mag

nitu

de (d

B)

Unfiltered signalFiltered signalAED detection

Fig. 3: Three perturbations shown by a single subcarrier

Algorithm 1: Abnormal Environment Detection (AED)1 procedure actDetect(data, τth)2 ν ∼ N(µ,σ2);3 for p = 1 to len(v) do4 ωp = (νp − µ)/σ;5 if ωp ≥ τth then zp = 1;6 else zp = 0;7 end8 for p = 1 to len(v) do9 sp =

�p+�p−� zp;

10 if sp < ηth then zp = 0;11 else zp = 1;12 end13 end procedure

B. CSI-to-Image Transformation Module

CSI Collection Module forwards any CSI data that is sensedas an abnormal activity by the AED algorithm.

1) CSI Alignment: This submodule aligns the activitiesbased CSI streams for all Tx−Rx antenna pairs and removesall the packets for which CSI is not calculated by Intel NIC5300 tool. For each Tx − Rx pair, TL-HAR system defines

a matrix Han containing CSI of Na consecutive packetsfor the abnormal activity. This implies that each Han is inessential Na × Sb dimensional matrix having CSI values ofNa successive data packets for each of the Sb subcarriers.Mathematically, Han can be described by the following equa-tion

Han = [Ha(s) | Ha(s + 1) | · · · | Ha(s + Na)]T , (6)where in Eq. (6) Ha(s) represents CSI values of all thesubcarriers of the first packet identified by AED algorithmhaving a dimension of 1× Sb.

2) CSI-to-Image Conversion: This submodule essentiallyconvert activities based CSI to images in an effort to useCNN, whose architecture exploits spatial as well as temporalinformation embedded therein. Knowing that an image canbe represented by a 2D matrix, we obtain a CSI image bytransforming all the anomalous packets identified by AED.To be exact, we use Han and transform it to an imagesuch that horizontal axis represent the sequence of packetsand vertical axis represents the subcarrier index. Unlike othermethods, where anomalous packets needs to be truncated andmatched, this transformation allows TL-HAR to utilize all theanomalous packets simultaneously with exploiting subcarriercorrelation. Mathematically, the magnitude and phase of Hanis represented by Eq. 7 and Eq. 8 respectively

Ca =

a1,1 a1,2 · · · a1,Sba2,1 a2,2 · · · a2,Sb

......

. . ....

aNa,1 aNa,2 · · · aNa,Sb

, (7)

Cp =

p1,1 p1,2 · · · p1,Sbp2,1 p2,2 · · · p2,Sb

......

. . ....

pNa,1 pNa,2 · · · pNa,Sb

. (8)

Consequently, we obtain CSI magnitude and phase imagesas shown in Fig. 4 for each anomaly by utilizing Ia =

��

��

Fig. 4: CSI images (row 1: CSI-amplitude and row 2: CSI-phase) for multiple classes of human activities

Fm2i{(Ca)} and Ip = Fm2i{(Cp)} transformations. Eachentry (i, j) of Ia and Ip represents the value of particularpacket and subcarrier representing a pixel. The implementationdetails of Fm2i is given in [25].

3) Image Normalization and Resizing: CNN learns con-tinuously through training and adding weight matrices to theproduct of learning rate and gradient error vectors computedfrom backpropgation. If input images are not normalized,then ranges for distributions of feature values would becontrastingly different and thus corrections based on learningrate would differ proportionately. In order to avoid complexcompensations, we normalize all the images and further resizethem to 32× 32× 3, so that we can compare the performancewith transfer learning based model that mandates input imagesto be of size 32× 32× 3.

C. Deep Learning Module

Typical human activities may last from few to severalseconds and thus show characteristic temporal and spatialstructure. Recently, CNNs have been actively employed tomodel this type of structure and consequently learn effectivefeatures for the representation of activities. In this work, wepropose CNN based architectures for classifying CSI images;that essentially incorporates multiple people activities havingvaried and long-term temporal convolutions. We demonstratethat d-CNN architecture improve the accuracy of multiple hu-man activities recognition without designing hand-engineeredfeatures with domain knowledge.

The recent trend of using deep neural systems for modelingcomplex problems started from CNNs like AlexNet [26]. Atypical CNN is designed to be similar to human visual cortexhaving multiple linear and nonlinear functions. Linearity is im-plemented using convolution operations though non-linearitycaptures more complex data representations. A simple CNNcan be represented by the following equation

y =F (Ψ, x)

=gn(Wngn−1(· · · (g2(W2g1(W1x+ b1) + b2) · · · ) + bn).(9)

In Eq. 9, x represents the input, y is the output, gi is anonlinear function, Wi represents the convolution matrix ofthe ith layer, bi is the bias of the ith convolution layer, and Ψessentially represents set of all tunable parameters including

Wi and bi. The weights and biases of the model are updatedwith equations given below

ΔWi(t+ 1) = −xλ

rWi −

x

n

∂c

∂Wi+mΔWi(t), (10)

Δbi(t+ 1) = −x

n

∂c

∂bi+mΔbi(t), (11)

where in Eq. 10 and 11 λ, n, r, c and m represents reg-ularization parameter, total number of images, learning rate,cost function and momentum respectively. λ is introduced toprevent over-fitting of the input images. The learning ratecontrols how fast the network learns during training andmomentum aids in the convergence. We set r = 0.002 anduse all the other parameters with default values as set by theoptimizer Nadam used in our evaluation [27].

The proposed d-CNN architecture shown in Fig. 5 consistsof five different types of layers: (1) convolutional layer, (2)batch normalization layer (3) activation layer (4) pooling layer,and (5) softmax layer as a fully connected layer. The abstractdetails of these layers and the architecture is as follows:

1) Convolutional layer: It consists of multiple kernels(filters) that slide across CSI images. A kernel is essentially amatrix that is convolved with input images and stride controlsthe extent to which filter convolves across the image. Theconvolution is performed using Eq. 12, where x is the inputimage, h is the filter, N is the number of pixels in x and thesubscripts denote the nth pixel of the input CSI image

yj =

N−1�

n=0

xnhj−n. (12)

2) Batch Normalization layer: In [28], authors found thatthe change in the distribution of network activations, calledthe internal covariate shift, was one of the main culprits ofslow training rate. To overcome this limitation of d-CNN,CSI training dataset is partitioned into mini-batches of size64 and batch normalization is performed as shown in Eq. 13

Resizingand

Normalization

CSI Images256x256x3 32x32x3 Convolutions

1st Features Mapsafter Convolutions

Normalization Layer

Max Pooling Layer

1st Features Mapsafter Pooling 2nd Features Maps

after Convolutions 2nd Features Maps

after Pooling

3rd Features Mapsafter Convolutions

3rd Features Mapsafter Pooling Fully Connected

Layer

SoftmaxLayer

Activation Layer

Fig. 5: d-CNN architecture used for multiple human activity recognition

to improve training and consequently minimize over-fitting.

µB =1

z

z�

i=1

xi,

σB2 =

1

z

z�

i=1

(xi − µB)2,

x̂i =xi − µB√

σB2,

yi ← γx̂i + β ≡ Bnγ,β(xi),

(13)

where γ and β are learnable parameters, Bn and z denotesbatch normalization and batch size. A typical layer of d-CNNwith batch normalization can be represented by

y = g(Bn(Wx)). (14)

3) Activation layer: We incorporate rectified linear unit(ReLU), i.e., f(x) = max(x, 0) in our implementation of d-CNN. This is one of the most widely activation in modern deeplearning architectures. The objective of d-CNN architectureis then to find an optimal parameter set Ψ with J inputs tominimize the empirical loss given by

J�

j=1

L(yj , F (Ψ, xj)). (15)

In Eq. 15 L denotes cross-entropy loss for our multipleactivities classification problem, xj and yj represents jth inputand output respectively. Considering that all nonlinear and lossfunctions are differentiable, the network training in Eq. 15 canbe computed by an error back-propagation method [29].

4) Pooling layer: This down sampling layer decreases thecomputational complexity and over-fitting by reducing thedimensions of output neurons from the convolutional layer.We incorporate max-pooling in our implementation that se-lects maximum value in each feature map output from theconvolutional layer.

5) Softmax layer: This layer is fully connected to all theactivations in the previous layer and computes the probabilitydistribution of the K classes using,

pk =exk

�K1 e

xK∀k ∈ [1, K] (16)

The proposed d-CNN architecture includes three convolu-tional, four batch normalization layers, four activation layers,three max-pooling, and one fully connected layer (fifteen lay-ered deep CNN). The number of filters used in the first, secondand third convolutional layers are 8, 16 and 32 respectively.The stride is set to (2,2) for convolution of the CSI imagesand max-pooling operations are also set to (2,2) respectively.For a training set of size xT where each image is of size32×32×3 the output shape after the first convolutional layerof d-CNN will be (xT , 30, 30, 8). It will be followed by shapesof (xT , 15, 15, 8) after first max pooling layer, (xT , 13, 13, 16)after second convolutional layer, (xT , 6, 6, 16) after secondmax pooling layer, (xT , 4, 4, 32) after third convolutionallayer, (xT , 2, 2, 32) after third max pooling layer and (xT , 9)after softmax layer representing the learned probabilities foreach class of TL-HAR system respectively.

D. Transfer Learning Module

This module exploits the fact that deep learning architec-tures need not to be necessarily trained from scratch. Thetraining of deep neural networks require extensive compute,data and storage requirements. To combat this, transfer learn-ing utilize existing trained models and fine tune these on thenew dataset to achieve reasonably good accuracy. The overallprocess of transfer learning adopted for TL-HAR frameworkis shown in the Fig. 6. In this paper, we use a pretrainedmodel named Inception V3 [11] developed by Google BrainTeam for image classification on dataset like ImageNet. Thismodel is widely known and is used in different transferlearning applications. Inception V3 is trained on ImageNet1000 category class images and performs really well [26]for image classification problems. We present i-CNN thattransfer the knowledge from Inception V3 for our CSI basedimages dataset. It allows model creation with significantlyreduced training data and time. It may not be efficient forall applications domains, but is surprisingly accurate for ourCSI based multi-human activities recognition.

The reason that fine tuning i.e., final layer retraining workson new CSI images is that it has been trained aggressivelyto distinguish between all the 1000 classes in ImageNet. Wedemonstrate that extracting features from Inception V3 andthen fine tuning outperforms a typical small sized d-CNN and

Random Neural Network

ImageNet

Training Fine Tuning

CSI Images

Pre-trained Neural Network(weights trained on ImageNet)

Trained Neural Network

Fig. 6: i-CNN process showing transfer learning for multiple human activity recognition

can be used in practice to recognize multiple human activitieswith high accuracy.

E. Activity Classification Module and Models Database

This module employ softmax layer as a fully connectedlayer to calculate the probability for each activity class. Usingthis module, TL-HAR obtains the percentage accuracy for allthe activities. Lastly, we use Models Database to store all theactivities based CSI and associated CSI images identified byAED.

V. IMPLEMENTATION AND EVALUATION

A. Experimental Setup

We setup an environment where we use commerciallyavailable off-the-shelf WiFi devices. We use a standard laptopand plug-in Intel WiFi NIC 5300 (wireless network adapter) tocollect CSI. This serves as a receiver in our implementation.We configure Linksys EA4500 dual-band router as an AP(transmitter). The AP operates in 2.4 GHz frequency bandand has three embedded antennas. This enables us to havean access to multiple MIMO streams with enhanced spatialdiversity. On the software side, we use Linux Ubuntu 12.04LTS operating system with a modified kernel [30] and awireless driver for Intel 5300 card. This setup enable us tocapture CSI exhibiting the characteristics of the wireless linkat the subcarrier level. We finally use MATLAB R2018a andKeras API using Tensorflow to implement neural networks thatuse GetForce GTX 1080 Ti GPU for recognition of multiplehuman activities.

B. Data Collection

For performance analysis, we collect CSI in the lab settingswith 10 students, including male and females. We instructthese subjects about the type, location and duration of theactivity to be performed for collecting an accurate data. Wechoose three diverse activities to be performed by a singleand multiple subjects in our evaluation. These include run,walk and hands movement for single and multiple people. Wechoose two and three activity combinations randomly to keepdata collection phase practical. We believe that increasing thenumber of people and activities will not affect the performance

of TL-HAR considerably. The combination of activities in-cluded in our evaluation are Run, Walk and HandsMove, Walk-Walk, Run-Walk, HandsMove-walk, Walk-HandsMove-Walk,Run-Walk-HandsMove and Walk-Walk-Walk respectively. Weset the sampling rate to 80pkts/s as we believe that the fre-quency of human activities is below 10 Hz [31]. We constitutethe dataset that includes 240, 294, 270, 78, 150, 162, 174,168, 168 samples of Run, Walk and HandsMove, Walk-Walk,Run-Walk, HandsMove-walk, Walk-HandsMove-Walk, Run-Walk-HandsMove and Walk-Walk-Walk as detected by AEDalgorithm. We collected more data for single person scenarioas collecting data for multiple people is time consuming andcumbersome. We split the data evenly to obtain training andtesting datasets. To be exact, training dataset comprises of 120,148, 136, 40, 76, 82, 88, 84 and 84 samples and testing datasetconsists of 120, 146, 134, 38, 74, 80, 86, 84, 84 samplesrespectively.

C. Classification Accuracy

We evaluate TL-HAR system from multiple viewpoints. Togain insight, we make a comparison of accuracies obtainedfrom two different modules namely deep learning and transferlearning. We draw strong inferences about the accuracy ofmultiple people activities and insights of these modules whileevaluating the trade-offs. We also compare with traditionalmachine learning classifiers presented in our earlier work [9].

1) Accuracy of Deep Learning Module: d-CNN learnfeatures based on the input CSI images and its networkarchitecture. We design a 15-layer architecture for extractingquality features from multiple people activities dataset. Doingso, we not only decrease overall complexity by eliminatingpreprocessing and subcarrier level majority voting stages butalso achieve a very high accuracy for all the activities. Theclassification results for all the activities is shown in theconfusion matrices of Fig. 7.

Fig. 7a shows the probabilities of true class versus predi-cated class. It should be worth mentioning that class 1 to class9 represent activities in the order Run, Walk, HandsMove,Walk-Walk, Run-Walk, HandsMove-walk, Walk-HandsMove-Walk, Run-Walk-HandsMove and Walk-Walk-Walk respec-tively. For example, the probability of correctly predicting

� � � � � � � � �

��

�

�

�

�

�

�

�

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) Normalized confusion matrix.

� � � � � � � � �

��

�

�

�

�

�

�

�

�

�

��

��

� ��

� � ��

� � � ��

� � � � ��

� � � � � ��

� � � � � � ��

� � � � � � � ��

� � � � � � � ��

�

��

��

��

��

��

��

(b) Confusion matrix.

Fig. 7: Confusion matrices of d-CNN

the run activity is 99% and that of walk-walk-walk is 77%respectively. Similarly, Fig. 7b shows the total number ofsamples included and predicted correctly from the test dataset.For example, for the run activity out of 120 samples d-CNNcorrectly classify 119 samples and 1 sample is incorrectlyclassified as walk-walk. In nutshell, d-CNN provides highaccuracy for CSI images without defining any hand craftedfeatures and additional signal processing modules. This im-plies d-CNN architecture is robust in extracting features formultiple people activities. Also, this method is more scalableand its architecture can be modified and trained with largerdataset to realize even higher classification accuracy. We alsoplot classification accuracy and loss with number of epochs inFig. 8. From these results, we conclude that d-CNN achievesan accuracy of 90% and even more only after 200 epochs. Thisessentially means that d-CNN needs to be trained successivelyfor number of times before it can be used for classificationwith good accuracy. In order to overcome this, we resort totransfer learning that can provide better accuracy with onlyfew number of training epochs.

2) Accuracy of Transfer Learning Module: To minimizetraining epochs and still be able to achieve high accuracy, weexploit transfer learning which use pre-trained weights to learnan efficient representation of multiple classes. As explained inSection IV-D, we use Inception V3 to get pre-trained weightsand fine tune its architecture using our CSI images dataset.

� ��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) Classification Acc. of d-CNN.

� ��

��

��

��

��

��

��

��

��

��

��

��

��

(b) Classification Loss of d-CNN.

Fig. 8: Classification results of d-CNN

The classification results obtained from transfer learning basedCNN (i-CNN) are shown in Fig. 9.

Fig. 9a shows that this architecture learns really fast andcan achieve testing accuracy of greater than 90% even withas less as 50 training epochs. The testing accuracy is morestable and higher in comparison to d-CNN accuracy. Theaverage accuracy obtained using i-CNN after 300 epochs is96.7%. Also, the training loss shown in Fig. 9b has muchsteeper slope in contrast to d-CNN, exhibiting faster weightslearning by i-CNN. Fig. 9c shows the probabilities of trueclass versus predicted class translating to classification accu-racy of 99.2%, 97.9%, 98.5%, 97.4%, 86.5%, 97.5%, 100%,92.9% and 100% for Run, Walk, HandsMove, Walk-Walk,Run-Walk, HandsMove-walk, Walk-HandsMove-Walk, Run-Walk-HandsMove and Walk-Walk-Walk. We also consider twoMIMO links based CSI images data and found that averageaccuracy improves further to 99.1% with probabilities for eachclass shown in Fig. 9d. This essentially proves the power oftransfer learning and benefits of having more spatial data forbetter classification of multiple human activities.

3) Comparison of Classification Modules: In this section,we compare the performance of deep learning and transferlearning architectures. We also compare with the accuraciesobtained from standard machine learning classifiers. As in[9], we engineered the features and implemented fine KNN(f-KNN), Gaussian Support Vector Machine (g-SVM) andweighted KNN (w-KNN) for multiple people activities dataset.We conclude that machine learning models requires efficienthand-crafted features that can be painstaking procedure. Deeplearning model (d-CNN) requires more training epochs andsophisticated network design in order to achieve high accuracy.Transfer learning based i-CNN model seems most optimal andguarantee overall best performance especially for the multiplelinks. The detailed comparison and accuracies of all the fiveclassifiers is tabulated in Table I. This shows that i-CNNobtains highest average accuracy of 96.7% for a single linkMIMO when trained for 300 epochs. The performance of d-CNN is comparable to w-KNN but not quite well compared tof-KNN and g-SVM classifiers. However, d-CNN have its ownadvantage of extracting features at its own and that can reallyoutperform other machine learning classifiers for diverse appli-cations. We also include the classification accuracy obtainedfrom two MIMO links. This essentially give twice the datawith more spatial and correlated information. We observe thatfor multiple links d-CNN and i-CNN deep learning classifiersshow percentage improvement of 3.37% and 2.48% in contrastto only 1.49%, 0.86% and 2.41% improvement shown byf-KNN, g-SVM and w-KNN. This confirms that proposeddeep learning based solutions scale much better for largerdatasets. We also compare training epochs based accuracyfor i-CNN and d-CNN. The accuracy plot for i-CNN and d-CNN using different number of epochs is shown in Fig. 10.We observe that d-CNN achieves an average accuracy of 78%for 50 epochs and 91.94% when trained for 300 epochs. Incontrast, i-CNN obtains an average accuracy of 93.5% and96.7% for 50 and 300 epochs respectively. This validatesthat d-CNN has higher percentage improvement of 15.16% in

� ��

��

��

��

��

��

��

��

��

��

��

��

(a) Classification acc. of i-CNN.

� ��

��

�

�

�

�

�

��

��

��

(b) Classification loss of i-CNN.

� � � � � � � � �

��

�

�

�

�

�

�

�

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(c) Confusion matrix (1 link).

� � � � � � � � �

��

�

�

�

�

�

�

�

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(d) Confusion matrix (2 links).

Fig. 9: Classification results (accuracy and loss) and confusion matrices of i-CNN

Table I: Classification accuracy of multiple activities for different links

Single Link Multiple LinksActivities f-KNN (%) g-SVM (%) w-KNN (%) d-CNN (%) i-CNN (%) f-KNN (%) g-SVM (%) w-KNN (%) d-CNN i-CNN (%)

Run 98.3 98.3 100 99.2 99.2 96.7 95.4 95.8 100 100Walk 95.9 94.5 96.6 90.4 97.9 93.8 92.8 93.2 99.0 100

HandsMove 97.8 100 94.0 97.8 98.5 97.4 97.0 95.5 96.6 100Walk-Walk 84.2 81.6 73.7 86.8 97.4 86.8 77.6 85.5 88.2 93.4Run-Walk 83.8 83.8 83.8 90.5 86.5 93.9 92.6 94.6 82.4 99.3

HandsMove-Walk 92.5 86.3 85.0 92.5 97.5 93.1 95.6 89.4 94.4 100Walk-HandsMove-Walk 95.4 96.5 94.2 97.7 100 95.4 96.5 94.2 100 100Run-Walk-HandsMove 95.2 97.6 94.0 95.2 92.9 98.8 98.8 93.5 95.8 99.4

Walk-Walk-Walk 100 100 100 77.4 100 100 100 100 98.8 100Avg. 93.7 93.2 91.3 91.9 96.7 95.1 94.0 93.5 95.0 99.1

i-CNN 50 Epochs

d-CNN i-CNN 150 Epochs

d-CNN i-CNN 300 Epochs

d-CNN

Classifier

50

60

70

80

90

100

Clas

sifica

tion A

ccur

acy (

%)

RunWalkHandsMoveWalk-WalkRun-Walk

HandsMove-WalkWalk-HandsMove-WalkRun-Walk-HandsMoveWalk-Walk-Walk

Fig. 10: Effect of epochs on the classification accuracy of multiple activities for i-CNN and d-CNN

contrast to only 3.3% obtained for i-CNN classifier. However,the downside is that d-CNN requires more training time incomparison to i-CNN which is more time-efficient. We alsoshow the comparison of classification accuracy without fine-tuning Inception V3 model with our CSI images dataset (wf-CNN) as shown in Fig. 11a. This shows the importance of fine-tuning when using transfer learning, otherwise the accuracywill suffer severely. Finally, we include the comparison ofmachine learning classifiers with the number of features. Weessentially show in Fig. 11b that increasing the hand-craftedfeatures will not always increase the accuracy of the system.Therefore, we validate deep learning models that learn featuresfrom the data automatically and are more effective in contrast

to machine learning classifiers.

VI. CONCLUSION

In this paper we present TL-HAR, a transfer learning basedmultiple human activity recognition framework that accuratelyclassify multiple activities. TL-HAR captures human activitiesbased perturbations through an algorithm which works on thevariance of aggregated MIMO subcarriers and extracts CSI forthat duration. The proposed framework transforms CSI per-turbations to images in an effort to use subcarriers correlationaltogether. We highlighted the limitations of machine learningbased methods and their inefficiencies for learning a complexscenario of multiple human activities. We showed that using

50 100 150 200 250 300Number of Epochs

40

50

60

70

80

90

100

Cla

ssif

icat

ion

Acc

ura

cy (

%)

i-CNNd-CNNwf-CNN

(a) Acc. vs number of epochs.

2 4 6 8 10 12Number of Integrated Features

65

70

75

80

85

90

95

100

Cla

ssif

icati

on

Accu

racy (

%)

f-KNNg-SVMw-KNN

(b) Acc. vs number of features.

Fig. 11: Classification accuracy with varying number of epochsand features

deep learning architecture, we can eliminate pre-processingand majority voting stages which are an integral componentof machine learning methods for achieving good accuracy. Wepresent d-CNN, a fifteen layered CNN architecture, that ex-tracts representative features to achieve an accuracy of 91.9%for a single link and 95% for the multiple links. However,since d-CNN requires long training epochs, we overcome thisusing transfer learning with Inception V3 model. We showedthat using weights of a pre-trained model and fine-tuning withour CSI image dataset, we can achieve an average accuracyof 96.7% for a single link and 99.1% for the multiple links.TL-HAR provides multiple trade-offs for different applicationtypes to get optimal accuracy and complexity of the overallsystem.

REFERENCES

[1] Q. Do, B. Martini, and K.-K. R. Choo, “Cyber-physical systems in-formation gathering: A smart home case study,” Computer Networks,2018.

[2] N. Yu, W. Wang, A. X. Liu, and L. Kong, “Qgesture: Quantifying gesturedistance and direction with wifi signals,” Proceedings of the ACM onInteractive, Mobile, Wearable and Ubiquitous Technologies, vol. 2, no. 1,p. 51, 2018.

[3] C. Feng, S. Arshad, R. Yu, and Y. Liu, “Evaluation and improvementof activity detection systems with recurrent neural network,” in Com-munications (ICC), 2018 IEEE International Conference on. IEEE,2018.

[4] Y. Wang, J. Liu, Y. Chen, M. Gruteser, J. Yang, and H. Liu, “E-eyes: device-free location-oriented activity identification using fine-grained wifi signatures,” in Proceedings of the 20th annual internationalconference on Mobile computing and networking. ACM, 2014, pp.617–628.

[5] W. Wang, A. X. Liu, and M. Shahzad, “Gait recognition using wifisignals,” in Proceedings of the 2016 ACM International Joint Conferenceon Pervasive and Ubiquitous Computing. ACM, 2016, pp. 363–373.

[6] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Device-free human activity recognition using commercial wifi devices,” IEEEJournal on Selected Areas in Communications, vol. 35, no. 5, pp. 1118–1131, 2017.

[7] C. Feng, S. Arshad, and Y. Liu, “Mais: Multiple activity identificationsystem using channel state information of wifi signals,” in Interna-tional Conference on Wireless Algorithms, Systems, and Applications.Springer, 2017, pp. 419–432.

[8] Q. Gao, J. Wang, X. Ma, X. Feng, and H. Wang, “Csi-based device-free wireless localization and activity recognition using radio imagefeatures,” IEEE Transactions on Vehicular Technology, vol. 66, no. 11,pp. 10 346–10 356, 2017.

[9] S. Arshad, C. Feng, Y. Liu, Y. Hu, R. Yu, S. Zhou, and H. Li, “Wi-chase: A wifi based human activity recognition system for sensorlessenvironments,” in A World of Wireless, Mobile and Multimedia Networks(WoWMoM), 2017 IEEE 18th International Symposium on. IEEE, 2017,pp. 1–6.

[10] D. H. Hu, V. W. Zheng, and Q. Yang, “Cross-domain activity recognitionvia transfer learning,” Pervasive and Mobile Computing, vol. 7, no. 3,pp. 344–358, 2011.

[11] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinkingthe inception architecture for computer vision,” in Proceedings of theIEEE conference on computer vision and pattern recognition, 2016, pp.2818–2826.

[12] K. Ali, A. X. Liu, W. Wang, and M. Shahzad, “Keystroke recognitionusing wifi signals,” in Proceedings of the 21st Annual InternationalConference on Mobile Computing and Networking. ACM, 2015, pp.90–102.

[13] J. Lei, X. Ren, and D. Fox, “Fine-grained kitchen activity recognitionusing rgb-d,” in Proceedings of the 2012 ACM Conference on UbiquitousComputing. ACM, 2012, pp. 208–211.

[14] F. Li, K. Shirahama, M. Nisar, L. Köping, and M. Grzegorzek, “Com-parison of feature learning methods for human activity recognition usingwearable sensors,” Sensors, vol. 18, no. 2, p. 679, 2018.

[15] Y. Ma, G. Zhou, S. Wang, H. Zhao, and W. Jung, “Signfi: Sign languagerecognition using wifi,” Proceedings of the ACM on Interactive, Mobile,Wearable and Ubiquitous Technologies, vol. 2, no. 1, p. 23, 2018.

[16] S. Palipana, P. Agrawal, and D. Pesch, “Channel state information basedhuman presence detection using non-linear techniques,” in Proceedingsof the 3rd ACM International Conference on Systems for Energy-EfficientBuilt Environments. ACM, 2016, pp. 177–186.

[17] S. Depatla, A. Muralidharan, and Y. Mostofi, “Occupancy estimationusing only wifi power measurements,” IEEE Journal on Selected Areasin Communications, vol. 33, no. 7, pp. 1381–1393, 2015.

[18] S. Duan, T. Yu, and J. He, “Widriver: Driver activity recognitionsystem based on wifi csi,” International Journal of Wireless InformationNetworks, vol. 25, no. 2, pp. 146–156, 2018.

[19] S. Arshad, C. Feng, I. Elujide, S. Zhou, and Y. Liu, “Safedrive-fi: Amultimodal and device free dangerous driving recognition system usingwifi,” in 2018 IEEE International Conference on Communications (ICC).IEEE, 2018, pp. 1–6.

[20] X. Wang, L. Gao, S. Mao, and S. Pandey, “Csi-based fingerprinting forindoor localization: A deep learning approach,” 2016.

[21] L. Ding, W. Fang, H. Luo, P. E. Love, B. Zhong, and X. Ouyang,“A deep hybrid learning model to detect unsafe behavior: integratingconvolution neural networks and long short-term memory,” Automationin Construction, vol. 86, pp. 118–124, 2018.

[22] S. Di Domenico, M. De Sanctis, E. Cianca, F. Giuliano, and G. Bianchi,“Exploring training options for rf sensing using csi,” IEEE Communi-cations Magazine, vol. 56, no. 5, pp. 116–123, 2018.

[23] D. Cook, K. D. Feuz, and N. C. Krishnan, “Transfer learning for activityrecognition: A survey,” Knowledge and information systems, vol. 36,no. 3, pp. 537–556, 2013.

[24] J. Wang, Y. Chen, L. Hu, X. Peng, and S. Y. Philip, “Stratifiedtransfer learning for cross-domain activity recognition,” in 2018 IEEEInternational Conference on Pervasive Computing and Communications(PerCom). IEEE, 2018, pp. 1–10.

[25] 2010. [Online]. Available: https://www.mathworks.com/matlabcentral/fileexchange/26322-mat2im

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[27] T. Dozat, “Incorporating nesterov momentum into adam,” 2016.[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep

network training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015.

[29] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan, “Better mini-batchalgorithms via accelerated gradient methods,” in Advances in neuralinformation processing systems, 2011, pp. 1647–1655.

[30] D. Halperin, W. Hu, A. Sheth, and D. Wetherall, “Tool release: gathering802.11 n traces with channel state information,” ACM SIGCOMMComputer Communication Review, vol. 41, no. 1, pp. 53–53, 2011.

[31] Y. Zeng, P. H. Pathak, and P. Mohapatra, “Wiwho: wifi-based personidentification in smart spaces,” in Proceedings of the 15th InternationalConference on Information Processing in Sensor Networks. IEEE Press,2016, p. 4.

Leveraging Transfer Learning in Multiple Human Activity Recognition Using WiFi...

Documents

Transcript of Leveraging Transfer Learning in Multiple Human Activity Recognition Using WiFi...