EVM-CNN: Real-Time Contactless Heart Rate Estimation From ...

1778 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 21, NO. 7, JULY 2019

EVM-CNN: Real-Time Contactless Heart RateEstimation From Facial Video

Ying Qiu , Yang Liu, Juan Arteaga-Falconi , Haiwei Dong , Senior Member, IEEE,and Abdulmotaleb El Saddik , Fellow, IEEE

Abstract—With the increase in health consciousness, noninvasivebody monitoring has aroused interest among researchers. As one ofthe most important pieces of physiological information, researchershave remotely estimated the heart rate (HR) from facial videosin recent years. Although progress has been made over the pastfew years, there are still some limitations, like the processingtime increasing with accuracy and the lack of comprehensiveand challenging datasets for use and comparison. Recently, itwas shown that HR information can be extracted from facialvideos by spatial decomposition and temporal filtering. Inspiredby this, a new framework is introduced in this paper to remotelyestimate the HR under realistic conditions by combining spatialand temporal filtering and a convolutional neural network. Ourproposed approach shows better performance compared with thebenchmark on the MMSE-HR dataset in terms of both the averageHR estimation and short-time HR estimation. High consistency inshort-time HR estimation is observed between our method and theground truth.

Index Terms—Spatial decomposition, temporal filtering,convolutional neural network.

I. INTRODUCTION

THE human heart rate (HR) reflects both physiological andpsychological conditions. HR monitoring has benefits for

human beings in many aspects, such as health care for elders,vital sign monitoring of newborns and lie detection in criminals.Nowadays, with the growing health consciousness, mobileHR monitoring systems are becoming increasingly popularwith users. However, all such devices must be in contact withthe skin, which is inconvenient and may cause discomfort.Hence, researchers have been devoted to creating a low-costand noninvasive way to measure HR in recent years. The mainconcepts are derived from the principles of photoplethysmog-raphy (PPG), which can sense the cardiovascular blood volumepulse through variations in transmitted or reflected light [1].Furthermore, Verkruysse et al. showed that a PPG signal can be

Manuscript received April 27, 2018; revised July 20, 2018 and September 4,2018; accepted November 8, 2018. Date of publication November 29, 2018; dateof current version June 21, 2019. This work was supported by NSERC GrantRGPIN/ 2018-05600 on digital twins. The associate editor coordinating the re-view of this manuscript and approving it for publication was Prof. BalakrishnanPrabhakaran. (Corresponding author: Haiwei Dong.)

The authors are with Multimedia Computing Research Laboratory, School ofElectrical Engineering and Computer Science, University of Ottawa, Ottawa,ON K1N 6N5, Canada (e-mail:, [email protected]; [email protected];[email protected]; [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2018.2883866

measured using a standard digital camera with ambient light asan illumination source [2]. Since then, researchers have madesustainable progress in digital-camera-based remote heart rateestimation.

However, various challenges remain. Because the face is theleast occluded skin region of the human body, extracting a PPGsignal from the facial region has become the main approach [3].The motion artifacts of the subject affect the performance ofmeasurements, which are divided into two types: One is rigidmotion, which includes head tilt and posture changes. The otheris non-rigid motion, which includes facial expressions such aseye blinking and smiling. Illumination variations also add noiseto the PPG signal, such as the flash of indoor lights, the variationof reflected light from a computer screen and the internal noise ofthe digital camera [4]. In addition, other challenges affect thistechnology, such as signal strength and a lack of appropriatedatasets.

The approaches proposed by researchers in recent years canbe divided into two groups. The methods in the first groupselect the whole face as the region of interest. Then, most ofthe methods use a filter to reduce the noise caused by motionartifacts and illumination variations. The methods in the othergroup focus on choosing a region of interest (ROI) of the faceand estimating the HR based on this reliable region. To someextent, both methods have aspects of superiority, but downsidesdo exist.

Whole-face-region-based approaches usually extract a PPGsignal by spatially averaging the face region at the outset andthen mainly focusing on reducing the impact of interferenceby the noise. Such approaches, e.g., [4] and [5], perform bet-ter under non-rigid motion conditions than they do under rigidmotion conditions because computing the mean of the wholeface nullifies the effect of facial muscle movements. However,the whole face region always includes noise caused by multi-ple environmental factors. Although various filters are appliedto eliminate the noise, the processing time increases with thenumber of filters being used. Thus, it is difficult to estimate HRinstantaneously by using such an approach.

Partial face region selection can help decrease the time spentfiltering the signal. Reliable region selection is a challenge insuch approaches since it contains too much indeterminacy, par-ticularly when the subject moves with facial expressions duringthe experiment. Progress has been made, such as in [6] and[7], with better performance than the whole-face-region-basedapproach.

1520-9210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

https://orcid.org/0000-0002-7851-4488

https://orcid.org/0000-0001-5388-7213

https://orcid.org/0000-0003-1437-7805

https://orcid.org/0000-0002-7690-8547

mailto:[email protected]





QIU et al.: EVM-CNN: REAL-TIME CONTACTLESS HEART RATE ESTIMATION FROM FACIAL VIDEO 1779

Regardless of which of the above methods are considered,there is a common point among them: they all use the powerspectral density of the underlying signal to estimate the HR in thelast step of the procedure, which requires a clear PPG signal toobtain an accurate result. Many processing steps must be usedto obtain such a clear signal, which increases the computingtime. To acquire better performance in less time, additionalknowledge must be considered. Inspired by deep learning, HRestimation has been one of the hottest topics in recent yearsand has attracted researchers from a variety of fields, who haveused it to cope with various tasks, such as [8], [9], and [10].HR estimation is considered a regression task in this paper.Because subtle changes in skin color related to cardiac rhythmcan be extracted from a digital camera, the average HR withina short time interval can be defined as a label in the regressiontask, and the extracted color changes of the corresponding videosequence can be fed into the neural network. In the research ofWu et al. the blood flow through the face due to the cardiac cycleis amplified as the skin color changes, and the vertical scan fromthe amplified video sequence reveals a plausible human heartrate [11]. Inspired by this, a new paradigm is proposed forestimating the heart rate by using Eulerian Video Magnification(EVM) to extract face color changes and using deep learning toestimate the heart rate.

The dataset is one of the most important factors that can af-fect the performance of deep learning. In addition, the datasetcan be used to make a fair comparison of different approaches.Although the MAHNOB-HCI dataset [12] is widely used in HRestimation research, there are several limitations that must beconsidered. The experiment conditions are strictly controlled,and the subjects hardly move and do not show obvious facialexpressions. To achieve high generalization ability under re-alistic conditions, we perform our method on the MMSE-HRdataset [13], which is more challenging because of its largermultimodal spontaneous emotion corpus.

In our work, EVM and deep learning are integrated to achieveinstantaneous HR estimation. Specifically, EVM is used to ex-tract the feature image that corresponds to the heart rate infor-mation within a time interval. A convolutional neural network(CNN) is then applied to estimate HR from the feature image,which is formulated as a regression problem. In summary, thecontributions of this paper are as follows:

1) A new paradigm is proposed for HR estimation through adigital camera under realistic conditions. As part of EVM,spatial decomposition and temporal filtering are applied toextract face color changes as the feature image. A widerbandpass filter is used to extract the signal within thegeneral range of the human HR.

2) HR estimation is achieved by using a CNN, where the in-put is the feature image within a specified time interval andthe output is the average HR during that time interval. Ourapproach addresses the problems of high computationalcomplexity and time cost.

3) To demonstrate the performance of our approach, acomparison between the proposed approach and abenchmark [6] on the same dataset for average HRestimation and short-time HR estimation was conducted.

Our approach achieves higher accuracy than the othermethods.

The rest of the paper is organized as follows: In Section II, areview of the state-of-the-art methods for remote HR estimationis presented. The details of the proposed method are described inSection III. Then, in Section IV, the dataset, evaluation metricsand experimental settings are discussed. In Section V, resultsand analysis are presented. Finally, the conclusions of our workand directions for future work are discussed in Section VI.

II. RELATED WORK

Traditional measurement methods for cardiac activity arecontact-based methods and include extracting the electrocar-diography (ECG) signal by using an ECG sensor and the PPGsignal by using a pulse oximeter. Although contact-based meth-ods can measure cardiac activity comprehensively with high ac-curacy, sensors that are attached to human bodies always causediscomfort and inconvenience. Researchers have made progressin creating a noninvasive measurement method over the past fewyears.

In 2009, Verkruysse et al. showed that a PPG signal canbe measured remotely by using ambient light and a simpleconsumer-level digital camera in movie mode [2]. It is alsoshown that the green channel features the strongest PPG sig-nal since the hemoglobin light absorption reaches its peak inthe green channel; the red and blue channels also contain PPGinformation. In 2010, Poh et al. introduced a non-contact car-diac pulse measurement method with color images that usesblind source separation [14]. They computed the spatial averagevalue of the face region for each channel, where three temporaltraces were obtained from an RGB video sequence. Independentcomponent analysis was applied to separate the observed rawsignals and identify the underlying PPG signal. Finally, theyapplied Fast Fourier Transform (FFT) on the selected sourcesignal to find the highest power spectrum within the generalhuman HR frequency band and deal with it as the average HRfrequency of the video recording. In 2011, Poh et al. improvedtheir work by adding several temporal filters to clean up the PPGsignal [5].

In 2012, Wu et al. introduced Eulerian Video Magnification,which can amplify the blood flow on the human face in a videorecording [11]. Each pixel of the face region was processed byspatial decomposition and temporal filtering. Then, the resultantframes were multiplied by a magnification factor and added tothe original frames. Therefore, the subtle color changes due tothe cardiac cycle are visible to the naked eye. In 2013, Balakr-ishnan et al. showed that HR can be extracted from videos bymeasuring subtle head motions caused by a Newtonian reactionto the influx of blood at each beat [15]. They used principalcomponent analysis to decompose the trajectories and choosethe one whose temporal power spectrum best matched the pulse,from which the average HR was calculated.

After several methods were successfully introduced for esti-mating HR remotely, researchers began to focus on this topicunder challenging conditions. In 2013, De Haan et al. presentedan analysis of the limitations of blind source separation on


Fig. 1. Diagram for HR estimation. In the face detection and tracking module, regions of interest are automatically extracted and input into the next module.Spatial decomposition and temporal filtering are applied to obtain the feature image in the feature extraction module. In the HR estimation module, a CNN is usedto estimate the HR from the feature image.

motion problems and proposed a chrominance-based approachthat achieved better performance [16]. Li et al. in 2014, pro-posed a framework for estimating HR under realistic conditions[4]. They used face tracking to solve the problem of rigid move-ments and employed an adaptive filter to rectify the impact ofillumination variations. They also compared their performancewith those of existing methods on the MAHNOB-HCI dataset[12]; their approach showed higher accuracy than did the othermethods. In 2015, Lam et al. improved blind source separationby randomly choosing a pair of patches from the face, obtainedmultiple extracted PPG signals, and, finally, used a majorityvoting scheme to robustly recover the HR [7]. Different fromexisting methods based on RGB videos, Yang et al. proposed anon-intrusive heart rate estimation system via 3D motion track-ing in depth video [17]. In 2018, Prakash [18] used a boundedkalman filter for motion estimation and feature tracking in re-mote photo PPG methods. To tackle the blurring and noise dueto random head movements, a blur identification and denoisingare employed for each frame.

Although the accuracy was improved and the system becamemore robust, the processing time was still a drawback, whichwas too long for estimating HR instantaneously. Tulyakov et al.used a self-adaptive matrix completion approach to automati-cally discard the face regions corresponding to noisy featuresand proposed using only reliable regions to estimate HR [6].They also introduced a more challenging dataset with targetmovements and spontaneous facial expressions. Moreover, in-stantaneous HR can be estimated by using their method.

In 2017, Alghoul et al. compared two methods [19]: oneis based on independent component analysis (ICA), and theother is based on EVM. They concluded that the ICA-basedmethod showed better results in general but that the EVM-basedmethod might be more appropriate when motion is involved.Inspired by this, a new framework is introduced that combinesEVM and deep learning to estimate HR instantaneously.Instead of applying the magnification process of EVM, spatialdecomposition and temporal filtering are mainly used to extractthe feature image as described in the next section.

III. METHOD

In this section, an overview of the proposed method is pre-sented through three procedures, as illustrated in Fig. 1. The face

in the input video sequence is detected and tracked to extractthe ROI. Then, the extracted ROI of each frame is processedby using spatial decomposition and temporal filtering to ob-tain the feature image. Finally, the feature image is input into aconvolutional neural network to estimate the HR.

A. Face Detection and Tracking

Face detection and tracking are used to extract the face regionsfrom the images and keep the size of the face within a certaintime range. The proposed approach is designed for practicalconditions and must automatically select the ROI and feed itinto the feature extraction module. Under practical conditions,several factors may impact face region extraction, such as thesubject’s movements, head rotations and facial expressions. Toobtain a steady face region without non-skin pixels, precisefacial landmarks are used to define a face region.

A regression local-binary-features-based approach is appliedto detect the bounding box and 68 facial landmarks inside thebounding box [20]. Local binary features for each landmark arefirst generated by learning a local feature mapping function. Allthe features are concatenated and a linear projection is learnedby linear regression. This process is repeated stage-by-stage ina cascaded fashion. To define the ROI, 8 points are used, asshown in Fig. 2. The green rectangle is the face detection result,and the 68 facial landmarks are represented by 68 green points.The blue rectangle is the ROI defined by the coordinates of the8 points that are denoted by red numbers nearby. The directionsof the x- and y-axes are also shown in Fig. 2. The top-leftvertex coordinate and the size of the blue rectangle are definedin Eq. (1):

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

XLT = XP1 3

YLT = max(YP4 0 , YP4 1 , YP4 6 , YP4 7 )Wrect = XP1 6 −XP1 3

Hrect = min(YP5 0 , YP5 2 )− YLT

(1)

where XLT and YLT , respectively, denote the x and y coordi-nates of the top-left vertex; XP1 3 denotes the x coordinate ofPoint 13; YP4 0 , YP4 0 , YP4 1 , YP4 6 , and YP4 7 denote the y coordi-nates of Point 40, Point 41, Point 46, and Point 47, respectively;Wrect denotes the width of the blue rectangle; XP1 6 and XP1 3

denote the x coordinates of Points 16 and 13, respectively; Hrect


Fig. 2. Face detection and facial landmarks. The green rectangle denotes thebounding box of the face detection result. Facial landmarks are denoted by 68green points. The blue rectangle indicates that the ROI is defined by 8 points,which are denoted by the nearby red numbers.

denotes the height of the blue rectangle; and YP5 0 and YP5 2 de-note the y coordinates of Points 50 and 52, respectively. Sincethe blue rectangle is defined in this way, the ROI always excludesthe eye part and the mouth part, regardless of how the subjectsrotate their heads. Furthermore, the impact of eye blinking andmouth movements is reduced. Hence, a cheek region withoutnon-facial pixels is defined.

As the input to the feature extraction module, the fixed-sizeROI rectangle must be within a certain time interval becausethe feature extraction module must obtain the color changes ofeach fixed pixel over time. To achieve this, the Scalable KernelCorrelation Filter Tracking method [21] is applied to track theROI rectangle and keep the ROI’s size constant within a certaintime interval, which can reduce the impact of rigid head motion.Then, the tracked cheek region is input into the feature extractionmodule to obtain the skin color changes during the cardiac cycle.

B. Feature Extraction

Wu et al. consider the time series of color values at any spa-tial pixel and amplify the variation in a given frequency bandof interest [11]. Inspired by this, the blood flow informationextracted from a video sequence is applied in our approach. Byusing localized spatial pooling and temporal filtering to extractthe signal that corresponds to the pulse, they multiply the ex-tracted signal with a magnification factor to amplify the facialcolor changes over time and make them visible to the nakedeye. Inspired by this, spatial decomposition and temporal filter-ing are applied in our approach to extract the feature image thatcontains the signal related to the HR information. An overviewof spatial decomposition and temporal filtering is given in Fig. 3.As shown in Fig. 3(a), the input sequence is first decomposedinto multiple spatial frequency bands. Then, the lowest-bandsequence is reshaped and concatenated to obtain a new image,as shown in Fig. 3(b). Finally, a bandpass filter is used to obtain

Fig. 3. Feature extraction. (a) shows the module that decomposes the inputsequence of ROIs into multiple spatial frequency bands, (b) demonstrates theresult after reshaping and concatenating the lowest band of (a), which is markedby a black dashed rectangle, and (c) is the feature image obtained by temporalfiltering of the concatenated image with a frequency band of interest.

the feature image that contains the signal related to blood flow,as shown in Fig. 3(c).

Spatial decomposition for color magnification is implementedby using Gaussian pyramid decomposition. Each frame of theinput RGB video is downsampled to a specified level. The lastlayer of each downsampled frame, which is shown as the partin the black dashed rectangle in Fig. 3(a), is reshaped into onecolumn. All columns obtained in the previous step are con-catenated to obtain a new image. To detect instantaneous HR,a one-second interval video sequence is input into the featureextraction module for every round rather than the whole video.Hence, the number of columns of the concatenated image inFig. 3(b) should be equal to the number of frames per second.

Temporal filtering is the use of an ideal bandpass filter to ob-tain the signal with the frequency of interest. Different from [11],a wider bandpass filter is used here to cope with typical condi-tions. A human’s HR is 45–240 bpm. Thus, the correspondingfrequency band is computed as fH R = HR/60, which is 0.75–4.0 Hz. As explained before, each column of the concatenatedimage corresponds to a downsampled image of the ROI and eachrow of the concatenated image corresponds to the variations ofa fixed position (pixel) in the ROI over one second. Therefore,each channel of the concatenated image is transferred to thefrequency domain by using FFT by rows. Then, a mask of thesame size is used to retain the component within the frequencyband of interest and set the others to zero. Finally, Inverse FastFourier Transform (IFFT) is performed by rows to transformthe signal back into the time domain, and three channel imagesare merged to obtain the feature image, as shown in Fig. 3(c).The spatial decomposition and temporal filtering algorithm isnamed Feature Extraction and summarized in Algorithm 1.

The ROI frame is first downsampled into different bands, thelast level is reshaped to one column which is stored into a vectoruntil the number of the elements is equal to the frame rate. Allthe columns are then concatenated orderly to get a new image,and the new image is split into three channels. Each channelis transferred to frequency domain and multiplied with a maskto zero out the FFT components outside the frequency band ofinterest. Finally, 3 channel are transferred back to time domainand merged to obtain the feature image. The feature extraction


Fig. 4. Structure of the convolutional neural network, where for each layer,the light blue cube denotes the input, the light pink cube denotes the receptionfield, the size of each cube is labeled beside it, the relative number of channelsare represented by the thickness, and the arrow denotes the data flow.

Algorithm 1: Feature ExtractionInput: The original RGB video frames I , pyramid level

Pl, frame rate Fps, set of one-column intermediateimages C, low-frequency cut-off Fl and high-frequencycut-off Fh of the ideal bandpass filter

Output: A set of feature images S.1: repeat2: for each frame I do3: Gaussian pyramid with level Pl4: Reshape the last level to one column C05: end for6: C ← C07: until size(C) = Fps8: repeat9: for each set C do

10: Concatenate all the columns into a new image M11: for each channel of M do12: Do FFT to obtain Mf13: Create a mask by using Fl and Fh14: Multiply the mask and Mf to get N15: Do IFFT of N to obtain Ni16: end for17: merge 3 channels to obtain K18: end for19: S ← K20: until the whole video ends21: return S

module is a loop for each second video frames and it stops untilthe video ends.

C. HR Estimation by CNN

HR is estimated from the temporal image extracted in theprevious procedure. This is achieved by using a regression con-volutional neural network. A feature image is input into thenetwork, and a corresponding HR is the output. The pipelineof the neural network is designed as shown in Fig. 4. Consid-ering the frame rate of the video sequence, which is related tothe number of columns of the feature image, the input image isdefined to be of size 25 × 25× 3. Inspired by MobileNet [22],depthwise separable convolutions are used as the main struc-ture in this part. Depthwise separable convolution is a form offactorized convolution that is used to factorize a standard con-volution into a depthwise convolution and a 1 × 1 pointwise

TABLE ICNN BODY ARCHITECTURE

convolution, which can effectively reduce the computationalburden and model size.

As shown in Fig. 4, the first layer is a full convolution layer,which is denoted “Conv”; for the next layer, a depthwise convo-lution layer is used to filter each input channel, and a pointwiselayer is used to combine the outputs of the depthwise con-volution layer, which are denoted “DwConv” and “PwConv”,respectively. The first full convolution layer is obtained by ap-plying a kernel of size 5 × 5 × 3× 96, where 5 × 5 is theheight and width of the filter; 3 represents the depth, whichcorresponds to the number of input channel (RGB); and 96 isthe number of output feature maps. In the following depthwiseconvolution layer, a kernel of size 3×3× 96 is used to filter theinput feature map, where 3× 3 denotes the height and width and96 is the number of input channels. A hidden parameter of thisfilter is the depth, which is always equal to 1 due to the propertyof depthwise convolution. Similarly, the size of next pointwisefilter is 1 × 1 × 96 × 96, where 1× 1 is the height and width,the first 96 is the depth of the filter and the second 96 is the num-ber of outputs. In the following parts, which are represented byan ellipsis, a similar depthwise separable convolution structureis used. The details of the CNN Body architecture are illustratedin Table I. Each convolution layer is followed by a batch nor-malization and ReLU nonlinearity. Downsampling is achievedby changing the value of the stride. An average pooling layerdownsamples the final feature map to 1 and computes the av-erage value of each channel. It simply performs downsamplingalong the spatial dimensionality of the given input. By havingless spatial information, not only will the computation perfor-mance be improved, the probability of overfitting will also bereduced. A fully connected layer is applied to directly connectevery neuron in one layer to every neuron in another layer. Adropout layer is then used to avoid the overfitting problem. Thedropout radio is set to 0.6, which can stop 60% of the feature de-tectors from working during the process of training and improvethe generalization of the network capacity.


Since the HR of each feature image is the only label, the lastfully connected layer has one neuron. All the labels are normal-ized to the range 0-1 before training, where 0 corresponds to45 bpm and 1 corresponds to 240 bpm. The Euclidean distanceis chosen as the loss function for measuring the difference be-tween the predicted value and the ground truth. The equation isdefined as Eq. (2):

LEu =1

2N

N∑

i=1

‖qi − pi‖ (2)

where LEu denotes the Euclidean distance, which computes thesum of the squares of the differences between the label value, qi ,and the predicted value, pi , where N indicates the total numberof samples.

IV. EXPERIMENTS

A. Dataset

To estimate HR instantaneously, our strategy is to extract thefeature image of every second of the video sequence and predictthe HR from the feature image. In other words, consecutivevideo frames of one second of video are input into the featureextraction module, where one corresponding feature image is theoutput to be used to predict the HR. Therefore, the number offeature images extracted from a whole video is the total numberof frames divided by the frame rate. To estimate the HR underpractical conditions, a challenging dataset is used in this paperand is introduced as follows.

The MMSE-HR dataset is a subset provided by the MMSEdatabase [13], which was collected for challenging HR estima-tion. There are 40 participants with diverse ethnic ancestries.102 RGB videos are recorded, and the length of each video isbetween 30 seconds to 1 minute. The frame rate is 25 fps, andthe resolution of each 2D texture RGB image is 1040 × 1392pixels. HR is collected by a contact sensor working at a samplerate of 1 KHz. Since our experiment needs the average HR ofevery second to be the label of each feature image, the mean ofthe 1000 HR values within a second is calculated and used inthe HR estimation. Each video of the dataset is input into themodules described in Section III-A and Section III-B, whichultimately produces 5839 feature images for the whole dataset.

To evaluate the HR data diversity of the dataset, the HRdistribution is obtained by calculating the proportion for eachrange, which is shown in Fig. 5, where the blue bar representsthe total HR distribution of the dataset. For the training andtesting tasks for HR estimation, 730 feature images are randomlyselected as the testing dataset, and the remaining images are usedas the training and validation dataset. The HR distributions ofthese two subsets are also illustrated in Fig. 5, where the orangebar denotes the training and validation dataset and the gray bardenotes the testing dataset. The mean value of the total HR datais 84.28, and the variance is 355.9. The proportion of the totalHR data that are between 65 bpm and 95 bpm is 73%, and thetwo subsets show similar distributions compared with the fulldataset. For the training and validation dataset, the mean of the

Fig. 5. Heart rate proportions from different datasets. The blue bar denotesthe HR data from the MMSE-HR dataset [13], the orange bar denotes the HRdata from the training and validation dataset, and the gray bar denotes the HRdata from the testing dataset.

HR is 84.32, and the variance is 355.6, while for the testingdataset, the mean of the HR is 84.0, and the variance is 357.6.

B. Evaluation Metrics

In our experiments, five evaluation metrics are used and ex-plained as below. All the evaluation metrics are based on theHR error, which is the difference between the predicted HR andthe ground truth.

He(i) = Hp(i)−Hgt(i) (3)

where He denotes the measurement error, Hp denotes the pre-dicted HR and Hgt denotes the ground-truth HR.

1) Mean of He : The mean of the measure error is the averagevalue of He :

Me =1N

N∑

i=1

He(i) (4)

where Me denotes the mean of the measure error and N denotesthe number of measurements.

2) Standard Deviation: is used to quantify the amount ofvariation in a set of data values. In our experiments, the standarddeviation is used to evaluate the dispersion of He :

SDe =

√∑N

i=1(He(i)−Me)2

N(5)

where SDe denotes the standard deviation. SDe ∈ [0,∞),where a lower value indicates that the data points of He tend tobe closer to the mean value.

3) Root-mean-square Error: is used to measure the differ-ences between the values predicted by an estimator and theobserved values. It is sensitive to outliers.

RMSE =

√∑N

i=1(Hp(i)−Hgt(i))2

N(6)

where RMSE denotes the root-mean-square error. RMSE ∈[0,∞) and a lower value indicates that there are fewer outliersin He .


Fig. 6. Training behavior of the neural network. The red curve denotes thetraining loss, the green curve(bold) denotes the validation loss. (The loss curvesare not smoothed.)

4) Mean Absolute Percentage Error: is a measure of theprediction accuracy, which usually expresses the accuracy as apercentage.

MeRate =1N

N∑

i=1

|He(i)|Hgt(i)

(7)

where MeRate denotes the mean absolute percentage error andMeRate ∈ [0,+∞), where a lower value indicates that the pre-dicted value is closer to the ground-truth value.

5) Pearson’s Correlation: is used to measure the linear cor-relation between the predicted HR and the ground-truth value.

ρ =∑N

i=1(Hgt(i)−Hgt)(Hp(i)−Hp)√

∑Ni=1(Hgt(i)−Hgt)2

√∑N

i=1(Hp(i)−Hp)2(8)

where ρ denotes Pearson’s correlation, Hgt denotes the meanvalue of the ground truth and Hp denotes the mean value ofthe predicted HR. ρ ∈ [−1,+1], where 1 indicates total positivelinear correlation and -1 indicates total negative correlation.

C. Experiment Design

In this part, three experiments are introduced and used toevaluate our approach.

1) Experiment 1: The convolutional neural network is usedfor HR estimation in our method. In this experiment, CNN isevaluated on the testing dataset. As discussed previously, 730feature images extracted from the whole dataset are randomlychosen to be the testing data, and the rest of the feature imagesare used in the training and validation steps. The training processis presented in Fig. 6. The validation curve is converged after1.5× 104 iterations.

2) Experiment 2: In this experiment, the proposed approachis compared with the state-of-the-art methods [6] [4] [16] andevaluated on the challenging MMSE-HR dataset [13]. Following[6], all the video sequences in this dataset are used for averageHR estimation. For each video sequence, in each second, the

HR is predicted first by using our approach, and the mean HRis then calculated as our final result.

3) Experiment 3: To evaluate the proposed method’s abil-ity to estimate HR instantaneously, comparison experiments onshort-time HR prediction are conducted. In this experiment, theproposed approach is compared with the same methods as thoseconsidered in Experiment 2, except for [4], which cannot esti-mate HR instantaneously. Following [6], 20% of the recordedsequences that have very strong HR variation were selected.Each sequence is split into non-overlapping windows of 4, 6and 8 seconds. Then, the average HR is calculated for each non-overlapping window. Comparison experiments are conductedindependently on each window size.

In addition, all the experiments are conducted under the Win-dows 10 platform, and the program was developed by usingC++. The convolutional neural network (Section III-C) is runon the deep learning platform of Caffe [18]. The level of theGaussian pyramid in the spatial decomposition is set to 4, andthe low cut-off frequency of the ideal temporal bandpass filter(Section III-B) is set to 0.75 Hz and the high component to4.0 Hz. The functions applied in the feature extraction modulesuch as: down sampling, FFT, IFFT are implemented by usingOpenCV.

V. RESULTS AND ANALYSIS

A. Visualization of Short-Time HR Estimation

To show the performance of the proposed approach for short-time HR estimation, visualizations of the processing of threechallenge sequences with a window size of 4 s are demonstrated.For each video sequence, the mean of the ground-truth valueswithin 4 s is calculated as the final ground-truth value, andone frame is selected from every 4 s video sequence as therepresentative of the subject’s facial expression during that timeinterval, as shown in Fig. 7, where the horizontal axis representsthe time interval and the vertical axis represents the HR value.

From the first result in Fig. 7, the ground truth shows low-frequency HR at the beginning, and a sudden increase and de-crease are shown at the end. For the monotonicity of the groundtruth, there are 8 increasing stages and 6 decreasing stages, ofwhich 5 are falsely predicted, but all 5 changes are small with adifference of less than 5 bpm. In the second result, the groundtruth shows large waves throughout the time interval, while 4changes are falsely predicted, one of which is larger than 5 bpm.In the third result, 4 changes are falsely predicted, two of whichare larger than 5 bpm. Overall, as shown in the figure, the pre-dicted results always have a similar trend to the ground truth;69% of the HR changes are correctly predicted, and among thefalsely predicted samples, 76.9% of them have a difference ofless than 5 bpm. For these small changes, the predicted resultsmay be influenced by the cheek muscle movements, which leadsto a result with large changes.

B. Evaluation of the CNN HR Estimator

This result is used to evaluate the CNN heart rate esti-mator that is obtained in Experiment 1. To demonstrate the


Fig. 7. HR estimation of three sequences with a window size of 4 s. The blueline denotes the ground-truth value and the orange line denotes the predictedresult by using our method. The frames represent the subject’s facial expression;the corresponding HR is shown below.

Fig. 8. HR error distribution on the testing dataset. The horizontal axis denotesthe range of He , and the vertical axis denotes the number of samples. The bluebar represents He , the orange bar represents the LF part (45–65 bpm), the graybar represents the MF part (65–95 bpm) and the yellow bar represents the HFpart (95–185 bpm).

differences between the labels and the predicted values, whichare defined as Eq. (3), the error is calculated for each sample.Additionally, to evaluate the accuracy of the estimator on variousfrequency bands, the testing data are divided into three bandsaccording to the HR distribution in the full dataset (Fig. 5). Theranges [45, 65), [65, 95), and [95, 185] are defined as the low-frequency (LF) part, medium-frequency (MF) part and high-frequency (HF) part, respectively. The numbers of samples ineach part are 60, 541 and 129, respectively. The result of thisexperiment is shown in Fig. 8, where the LF, MF, and HF partsare represented by orange, gray and yellow bars, respectively,and He is denoted by a blue bar.

As shown in Fig. 8, the error between the labels and the pre-dicted values are distributed in 9 sections, which is demonstrated

TABLE IIAVERAGE HR PREDICTION: COMPARISON ON MMSE-HR DATASET

on the horizontal axis, while the vertical axis represents the pro-portion. The mean of He is −1.25, and the variance is 311.25.There are 541 samples with an absolute He within 15 bpm,which indicates that 74.13% of the test data are well estimated.In this range, as shown in Fig. 8, most of the samples are fromthe MF part, which takes up 91.5%, while the LF and HF partstake up 2.4% and 6.1%, respectively. In contrast, for an absoluteHe of more than 15, most of the samples are from either the HFor LF part. Therefore, for He < −15, the HF part takes up 88%,while MF part takes up 12%. For He ≥ 15, the LF part takesup 58.8%, while the MF part takes up 41.2%. This phenomenonis due to the scarcity of the LF and HF samples in the trainingdata, which leads to lower accuracy for these two parts. For theLF part, 78.3% are predicted to be much higher than the groundtruth with He > 15. For the HF part, 74.4% are predicted to bemuch lower than the ground truth with He < −15. However,only 8.5% of the MF part is predicted with lower accuracy, ac-cording to |He | > 15. Comparing the percentages of the partsthat are predicted with higher HR error indicates that if a fre-quency part takes up a larger proportion of the dataset, then thepredicted result of this part has higher accuracy than the oth-ers, which also indicates that the diversity of the HR dataset isimportant.

C. Average HR Prediction on the MMSE-HR Dataset

This result is obtained from Experiment 2, which is used tocompare the performances in terms of average HR predictionwith other methods on the same dataset. The result is evaluatedby using five metrics, which were defined earlier and commonlyused in existing research. The result is shown in Table II. Theperformance comparison results between [4], [6], [16] are takenfrom [6].

As shown in Table II, compared with existing methods, SDe

of our result is reduced to 6.85, which shows that the HR er-ror tends to be closer to the mean value. RMSE is reducedto 6.95, which indicates that the magnitudes of the errors inthe predictions are decreased. The results for MeRate demon-strate that the prediction accuracy is improved to 6.55% whenusing our approach. In addition, ρ is improved to 0.98, whichis closer to 1; thus, the linear correlation between ground truthand predicted HR values is stronger. Overall, in all aspects, ourapproach outperforms existing methods [6] [4] [16] on aver-age HR prediction. Comparing each metric value among all themethods, it is shown that as ρ increases, all the other valuesdecrease.

D. Short-Time HR Estimation

This result is obtained from Experiment 3, which is usedto evaluate the performance on short-time HR estimation and


TABLE IIISHORT-TIME HR PREDICTION PERFORMANCE: COMPARISON ON MMSE-HR

DATASET

compare it with those of other methods. Five metrics are alsoused here to evaluate the results, which are shown in Table III.The performance measures for [16] is taken from [6].

As shown in Table III(a), (b) and (c), compared with the othermethods, for each window size, SDe , RMSE and MeRate

from our approach are all less than those from existing meth-ods, which indicates that the variations of the predicted HRdistribute over a narrower range with a higher accuracy. ρ isalso improved by our approach for each window size, whichshows a stronger linear correlation between the predicted val-ues and the ground truth. Haan’s [16] method has the worstperformance on the experiment with a window size of 6 s, withρ equal to 0.33. Conversely, Tulyakov’s [6] method performsbest for a window size of 6 s. However, from the results of ourmethods in experiments with different window sizes, as the win-dow size grows, SDe , RMSE and MeRate decrease while ρincreases, which shows that the performance improves steadilyin all aspects. Since our approach is to estimate HR for everysecond and calculate the mean value over a pre-defined timeinterval, as the time interval grows, the accuracy increases.

E. Run Time

To calculate the run time, the full framework is divided intotwo parts. The first part includes video capture, frame extrac-tion, face detection, tracking, ROI cropping and feature imageextraction. The second part is the HR estimator. Part 1 runs at110 fps, and part 2 runs at 290 fps. For the training phase, thereare 20 images in a batch, one iteration takes around 0.6 s. Therunning times are measured using a conventional laptop withan Intel Core i5-7300HQ CPU and 8.0 GB RAM. Thus, theproposed approach runs fast enough to be used for real-time HRestimation.

F. Further Discussion

As introduced before, the proposed method is based on theROI extraction, however, the face detection has limitations in

TABLE IVAVERAGE HR PREDICTION: COMPARISON RESULTS ON MAHNOB-HCI

DATASET

some situations. In realistic conditions, subject’s head can ro-tate randomly in front of the camera, while in the proposedapproach, only faces that rotate within a certain angle range canbe detected. Specifically, the subject’s head can rotate aroundyaw from−90 degree to 90 degree, around roll from−20 degreeto 20 degree and around pitch from −20 degree to 20 degree.For the positions that 68 landmarks are not identified, thereis no ROI extracted and no corresponding HR estimated. Thewhole process only works when consecutive faces are detectedwithin 1 s.

We also evaluate the proposed method on the MAHNOB-HCI dataset [12]. There are 27 participants (12 males and 15females) involved, 527 videos are recorded and can be usedtotally. In our experiments, same as [6], 30 seconds interval(frame 306 to 2135) is extracted from each video sequence.The second channel (EXG2) of ECG signal is used to obtainthe ground truth. Half of the video sequences are randomlychosen to extract the feature images and trained. A comparisonresult is shown in Table IV, where all the video sequences aretested and the comparison results for [4]–[6], [14]–[16] are takenfrom [6]. Apparently, the proposed approach also shows betterperformance on the MAHNOB-HCI dataset.

VI. CONCLUSION

In this paper, a new framework is introduced for contactlessHR estimation from facial videos under realistic conditions.Different from the traditional HR estimation approach, whichusually extracts a signal related to HR and uses power spectrumdensity analysis to estimate the average HR, a convolutionalneural network is used to estimate HR in the proposed approach.Instead of applying a series of filters to clean the underlyingsignal, HR is directly estimated from a feature image that isobtained by using spatial decomposition and temporal filtering,which decreases both the computational complexity and theprocessing time.

The results of testing the trained model on the testing datasetshow that 74.13% of the data in the testing dataset are wellestimated, while the rest have large error, which is due to thelack of high- and low-frequency components in the dataset. Inaddition, comparison experiments are conducted on the MMSE-HR dataset for both average HR estimation and short-time HRestimation, and the results show that the proposed approachachieves higher accuracy than other methods.

Furthermore, our approach can be improved in the future. Forinstance, the HR diversity of the dataset can be improved. As


shown in Fig. 5, the low-frequency and high-frequency parts takeup no more than 30% of the data, which affects the estimationperformance on these two parts. Therefore, a larger dataset withwell-proportioned HR data can be collected in future work.

REFERENCES

[1] J. Allen, “Photoplethysmography and its application in clinical physio-logical measurement,” Physiol. Meas., vol. 28, no. 3, pp. R1–R39, 2007.

[2] W. Verkruysse, L. O. Svaasand, and J. S. Nelson, “Remote plethys-mographic imaging using ambient light,” Opt. Expr., vol. 16, no. 26,pp. 21434–21445, 2008.

[3] M. A. Hassan et al., “Optimal source selection for image photoplethys-mography,” in Proc. IEEE Int. Conf. Instrum. Meas. Technol., 2016,pp. 1–5.

[4] X. Li, J. Chen, G. Zhao, and M. Pietikainen, “Remote heart rate mea-surement from face videos under realistic situations,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2014, pp. 4264–4271.

[5] M. Z. Poh, D. J. McDuff, and R.W. Picard, “Advancements in noncon-tact, multiparameter physiological measurements using a webcam,” IEEETrans. Biomed. Eng., vol. 58, no. 1, pp. 7–11, Jan. 2011.

[6] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. Cohn, and N.Sebe, “Self-adaptive matrix completion for heart rate estimation fromface videos under realistic conditions,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2016, pp. 2396–2404.

[7] A. Lam and Y. Kuno, “Robust heart rate measurement from video usingselect random patches,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp.3640–3648.

[8] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignmentusing multitask cascaded convolutional networks,” IEEE Signal Process.Lett., vol. 23, no. 10, pp. 1499–1503, Oct. 2016.

[9] X. Lu, Z. Lin, H. Jin, J. Yang, and J. Z. Wang, “Rating image aes-thetics using deep learning,” IEEE Trans. Multimedia, vol. 17, no. 11,pp. 2021–2034, Nov. 2015.

[10] T. Zhang et al., “A deep neural network-driven feature learning methodfor multi-view facial expression recognition,” IEEE Trans. Multimedia,vol. 18, no. 12, pp. 2528–2536, Dec. 2016.

[11] H. Y. Wu et al., “Eulerian video magnification for revealing subtle changesin the world,” ACM Trans. Graph., vol. 31, no. 4, 2012, Art. no. 65.

[12] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodaldatabase for affect recognition and implicit tagging,” IEEE Trans. AffectiveComput., vol. 3, no. 1, pp. 42–55, Jan.–Mar. 2012.

[13] Z. Zhang et al., “Multimodal spontaneous emotion corpus for humanbehavior analysis,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit.,2016, pp. 3438–3446.

[14] M. Z. Poh, D. J. McDuff, and R. W. Picard, “Non-contact, automatedcardiac pulse measurements using video imaging and blind source sepa-ration,” Opt. Expr., vol. 18, no. 10, pp. 10762–10774, 2010.

[15] G. Balakrishnan, F. Durand, and J. Guttag, “Detecting pulse from headmotions in video,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit.,2013, pp. 3430–3437.

[16] G. De Haan and V. Jeanne, “Robust pulse rate from chrominance-basedrPPG,” IEEE Trans. Biomed. Eng., vol. 60, no. 10, pp. 2878–2886,Oct. 2013.

[17] C. Yang, G. Cheung, and V. Stankovic, “Estimating heart rate and rhythmvia 3D motion tracking in depth video,” IEEE Trans. Multimedia, vol. 19,no. 7, pp. 1625–1636, Jul. 2017.

[18] S. Prakash and C. S. Tucker, “Bounded Kalman filter method for motion-robust, non-contact heart rate estimation,” Biomed. Opt. Expr., vol. 9, no.2, pp. 873–897, 2018.

[19] K. Alghoul, S. Alharthi, H. Osman, and A. Saddik, “Heart rate variabilityextraction from videos signals: ICA vs. EVM comparison,” IEEE Access,vol. 5, pp. 4711–4719, 2017.

[20] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps viaregressing local binary features,” in Proc. IEEE Conf. Comput. VisionPattern Recognit., 2014, pp. 1685–1692.

[21] A. S. Montero, J. Lang, and R. Laganiere, “Scalable kernel correlationfilter with sparse feature integration,” in Proc. IEEE Int. Conf. Comput.Vis. Workshop, 2015, pp. 587–594.

[22] A. G. Howard et al., “Mobilenets: Efficient convolutional neural networksfor mobile vision applications,” 2017, arXiv:1704.04861.

Ying Qiu received the B.Eng. degree in measure-ment and control engineering from Southwest JiaoTong University, Chengdu, China, in 2016. She iscurrently working toward the M.Sc. degree at theMultimedia Computing Research Laboratory, Uni-versity of Ottawa, Ottawa, ON, Canada. Her researchinterests include machine learning and computervision.

Yang Liu received the B.Eng. degree in informa-tion engineering from Beijing Technology and Busi-ness University, Beijing, China, and the M.Sc. de-gree in electronic and computer engineering from theMultimedia Computing Research Laboratory, Uni-versity of Ottawa, Ottawa, ON, Canada, in 2014 and2017, respectively. He is currently a Computer Vi-sion Researcher with SPORTLOGIQ, Montreal, QC,Canada. His research interests include computer vi-sion and computer graphic.

Juan Arteaga-Falconi received the engineering de-gree in electronics from Politecnica Salesiana Uni-versity, Cuenca, Ecuador, in 2008, and the M.A.Sc.degree in electrical and computer engineering fromthe University of Ottawa, Ottawa, ON, Canada, in2013. He is currently working toward the Ph.D de-gree in electrical and computer engineering at theUniversity of Ottawa. From 2008 to 2011, he waswith SODETEL Co. Ltd., Cuenca, Ecuador, wherehe was a Co-Founder and served as General Man-ager. He joined the Multimedia Computing Research

Laboratory, University of Ottawa, in 2012, where he is currently a Teaching As-sistant. His research interests are biometrics, signal processing, system security,machine learning and Internet of things. Mr. Arteaga-Falconi was a Treasurerin the IEEE ExCom of the Ecuadorian section from 2010 to 2012. He receivedthe 2011 and 2013 SENES CYT Ecuadorian Scholarship for graduate studies.

Haiwei Dong (M’12–SM’16) received the Dr.Eng.degree in computer science and systems engineeringfrom Kobe University, Kobe, Japan, and the M.Eng.degree in control theory and control engineering fromShanghai Jiao Tong University, Shanghai, China, in2008 and 2010, respectively. He was a PostdoctoralFellow with New York University, New York City,NY, USA; a Research Associate with the Universityof Toronto, Toronto, ON, Canada; a Research Fellow(PD) with the Japan Society for the Promotion ofScience, Tokyo, Japan. He is currently a Research

Scientist with the University of Ottawa and a Principal Engineer with HuaweiTechnologies Canada. His research interests include robotics, multimedia, andartificial intelligence.

Abdulmotaleb El Saddik (M’01–SM’04–F’09) re-ceived the Dipl.-Ing. and Dr.-Ing. degrees from theTechnical University of Darmstadt, Germany. He iscurrently a Distinguished University Professor andthe University Research Chair with the School ofElectrical Engineering and Computer Science, Uni-versity of Ottawa, Ottawa, ON, Canada. He has au-thored and co-authored four books and more than 550publications and chaired more than 40 conferencesand workshops. His research focus is on multimodalinteractions with sensory information in smart cities.

He has received research grants and contracts totaling more than $18 M. He hassupervised over 120 researchers. Prof. El Saddik is a Fellow of the EngineeringInstitute of Canada and a Fellow of the Canadian Academy of Engineers. He hasreceived several international awards such as the ACM Distinguished Scientist,the IEEE I&M Technical Achievement Award, and the IEEE Canada ComputerMedal.

EVM-CNN: Real-Time Contactless Heart Rate Estimation From ...

Documents

Transcript of EVM-CNN: Real-Time Contactless Heart Rate Estimation From ...