Hand pose estimation with multi-scale network

Appl Intell (2018) 48:2501–2515https://doi.org/10.1007/s10489-017-1092-z

Hand pose estimation with multi-scale network

Zhongxu Hu1 ·Youmin Hu1 ·Bo Wu1 · Jie Liu1 ·Dongmin Han2 ·Thomas Kurfess2

Published online: 6 December 2017© Springer Science+Business Media, LLC 2017

Abstract Hand pose estimation plays an important rolein human-computer interaction. Because it is a problemof high-dimensional nonlinear regression, the accuracyachieved by the existing methods of hand pose estimationare still unsatisfactory. With the development of deep neu-ral networks, more and more people have begun to adoptthe method involving deep neural network.We proposeda multi-scale convolutional neural network for the singledepth image of the hand. The network, which is end-to-end,directly calculates the three-dimensional coordinates of thejoints of the hand,and the multi-scale structure enhances theconvergence speed and the output accuracy of the network.In addition, an output function for the output layer, calledStair Rectified Linear Units, is used to limit the outputvalue. As a result of experiments, the optimization method

� Bo [email protected]

Zhongxu [email protected]

Youmin [email protected]

Jie [email protected]

Dongmin [email protected]

Thomas [email protected]

1 School of Mechanical Science and Engineering, HuazhongUniversity of Science and Technology, Wuhan, China

2 Georgia Institute of Technology, George W. Woodruff Schoolof Mechanical Engineering, Atlanta, GA, USA

with momentum is found not suitable for hand pose estima-tion because it is a task of unstable regression. Finally ourproposed method has state-of-the-art performance on theNYU Hand Pose Dataset.

Keywords Hand pose estimation · Convolutional neuralnetwork · Multi-scale · End-to-end · Stair Rectified LinearUnits

1 Introduction

Hand pose estimation can be useful in many areas, such asSign Language Recognition, Human-Computer Interaction,and Augmented Reality, and it has attracted the attentionof many people in the field of computer vision. With ris-ing popularity of depth camera, such as Kinect, a new roundof depth sensor-based research boom was triggered [1–6].But there are still many challenges for hand pose estimation,there are several reasons listed in the following: 1. As thehand joints have multiple degrees of freedom, the estima-tion is carried out in high-dimensional space; 2. The shapeand color of fingers are similar, and their flexibility is high,which makes them easy to block each other and hamper therecognition process; 3. As the hand moves faster, it tends tocause more and more noise in the image captured.

A more common approach to hand pose estimation is touse the 3D model of the hand to fit the input depth image [7–11]. However, such method is too complex, and generallyrequires multiple assumptions [7]. It also needs to designthe criteria for assessing the degree of fit between the depthmap and the 3d model, and the establishment of the criterionis not a simple task [9].

With the rise of deep neural networks, more and morepeople have begun to adopt such methods of deep neural

http://crossmark.crossref.org/dialog/?doi=10.1007/s10489-017-1092-z&domain=pdf

mailto:[email protected]






2502 Z. Hu et al.

network [2, 3, 12–21] . There are two main strategies forestimating hand pose from the depth map: one is to sub-divide the depth map to several blocks to reconstruct theintermediate representation model, and then the positions ofthe joints are estimated from those models [1, 22]; the otheris to directly calculate the joint coordinates in the depth map[2, 3, 12–14]. Obviously, based on the recent work in thisfield, we prefer the second strategy because that the firststrategy is only valid when the synthetic data is applied,and the intermediate labels can be retrieved automatically.As for the actual data, accurate calibration of block divi-sion is difficult to achieve. Moreover, compared with thestrategy calculating joint data, the block division strategyrequires information of each pixel, so that the amount of theinformation we need increases significantly [15].

Our main idea is that the key points of the hand, whichcan be obtained by direct learning, are a simplified represen-tation of the depth map of the hand, similar to the PrincipalComponent Analysis (PCA). In this paper, an end-to-endmodel based on deep neural network is proposed for handpose estimation. Our work is compared with the predeces-sors, the main innovations are in the following: 1. The usageof multi-scale structure; 2. A new kind of output functionis proposed; 3. It is verified that the training method withmomentum is not applicable to hand pose estimation due tothe presence unstable regression.

The core of our method is a convolutional neural net-work architecture that contains several branches of networkwith different number of layers to reduce the depth map’sdimension. Since convolutional neural networks exhibit out-standing performance in image classification and videorecognition [23, 24], an intuitive strategy is to replace theclassification layer in the image classification deep con-volutional neural network with a regression layer, but thiswill lead to error because the objective function will fallinto the local minimum. The previous approach shows thaterrors can be reduced by combining with a prior or inter-mediate heat map [12] in a convolutional neural networkand using a multi-resolution channel [3]. Unlike these meth-ods, we have trained a multi-scale convolutional networkinspired by the Inception structure, which is different fromthe multi-resolution that requires image resizing [25–27].We use different strides of the convolution kernel to scaledepth image down to different channel, so that the resizingprocess can also be learned. And finally the different levelsof feature maps are combined for regression.

In order to facilitate the unified processing of the three-dimensional coordinates of the joints, we normalize thepoint cloud into a cube bounding box and then project tothe gray scale, in order to constrain the final values of jointcoordinates, we propose a new output function, called Stair

Rectified Linear Units (ReLU), which can be thought of asa combination of the ReLU function [28] and the thresh-old function. Stair ReLU can effectively limit the rangeof the coordinate values, thereby increase the accuracy ofregression.

In the training process, we experimented with a varietyof optimization algorithms, and found that for the hand poseestimate which is a matter of unstable regression problem,the optimization algorithm with momentum will lead thetraining process to training collapse.

The paper is organized as follows. Section 2 discussesrelated work. Section 3 introduces the methods we used.Section 4 introduces some details of the method and exper-imental comparison. The experimental results were dis-cussed as well. Section 5 contains the conclusion andprospects of future work.

2 Related work

Hand pose estimation is a hot topic in the field of computervision, and many articles have been published on the sub-ject. Recent work on this subject can be divided into twokinds.

2.1 Fitting model

This kind of method adopts a deformable hand model, andits objective is to fit the model with the input data, sothe objective function is the similarity between the inputdata(i.e., depth map)and the corresponding model. The fit-ting process is the process of optimizing the hand model’sparameters. The optimization methods include PSO [29]and Gauss Seidel [30]. The quality of the hand model andthe objective function will determine the method’s ultimateaccuracy. But determining the objective function is a hardjob. And notably, this method will take into account theinter-frame information. The previous estimate result willaffect the subsequent ones, that is, if there appears a devia-tion in the previous estimate, the error will accumulate. Toavoid accumulative errors, optimization and re-initializationare carried out. However, the problem of such methods isthat the calculation is too complex [4, 8, 14].

2.2 Learning

Many such methods used random forests and their defor-mations to demonstrate their effectiveness [4, 31–33]. XuC. et al. proposed the method of directly using the ran-dom forest from the depth map to calculate the coordinatesof joints [4]. Generally speaking, this method clusters the

Hand pose estimation with multi-scale network 2503

results of the spatial pixels’ independent voting into a setof candidate values, and then determines the final gesture.In [34], a semi-supervised approach was further proposed,which used transfer learning to compensate for the gapbetween synthetic and real data. The pre-learning approacheliminated the potential errors with the help of predic-tions obtained from random forest estimates for complexgestures. In [31] Liang H et al. used a cascade regres-sion algorithm to solve the gesture estimation problem, andthe authors proposed that the hand root joints were pre-dicted first, then the others joints next. Using this method,hand constraints can be better utilized during hand poseestimation.

Recently, the convolutional neural network has shown itseffectiveness in the joint prediction [35]. A. Toshev et al.optimized the posture of the training data with deep cascadenetwork [13]. Compared with the traditional method, thedeep cascade network has shown outstanding performance,but its training time was as long as twenty days with thedata set of only a few thousand images, which is impracti-cal to be used on large databases, such as NYU hand posedataset [12] that contains more than 70K images. Whileconvolution neural networks were used to generate the 2Dheat map instead of regressing the joint point parameter thepixel strength in the heat map represented the probabilityof the point being a key point. In such a way the networkcan be effectively trained and reach the state-of-the-art per-formance level. However, when the occlusion occurs, thesemethods which use the inverse kinetics to infer the 3D pos-ture from the 2D picture, will no longer be valid. Tompson[12] and Liu Hao et al. [36] continued that idea with the2D heat map and divided the network into several stages,and used the image pyramid or multi-angle image to forma multi-resolution network. The ideas of dividing the net-work into several stages were also used in [3, 14, 15], whereSinha A divided the prediction phase into two parts, namelyglobal regression and local regression. Oberweger dividedthe network into the initial and refine phases. Neverovaused weak supervision, in which two networks got differ-ent inputs, and the two networks learned from each other.Abdul Rahman Hafiz et al. [37] used a complex-valuedneural networks (CVNNs) for gesture recognition, whoseinputs were RGB images and depth pictures, and in the endthey used a two-layer CVNN to get the posture of the handskeleton.

In our work, we have successfully built a multi-scaleconvolutional neural network and proved its effectiveness.Through the experiments we found that the network struc-ture was very important. Thus we discussed the impact ofthe network’s structure. Compared with the work of others,our method mainly has the following differences: 1. Almost

all of the existing networks had the pooling layer, but wethrew out the pooling layer by changing the convolutionstride, so that down-sampling could be learned; 2. We usedan end-to-end network, no longer is the network divided intoseveral stages or using the heat maps; 3. We abandoned theimage pyramid, but the multi-scale structure was retained.

3 Methodology

3.1 Hand model

The hand pose estimation can be considered as a mat-ter of extracting the three-dimensional coordinates of thehand joints from the depth image. In specific, when theinput image contains only the hand, the output is the three-dimensional coordinates representing the J joints of thecorresponding gestures. We denote the coordinates of J jointnodes with ϕ = {φj }Jj=1 ∈ ∧

, where∧

represents the 3 *J dimension of vector space of the hand joints. In order tofacilitate comparison with the work of predecessors [3], in

Fig. 1 Hand keys location: The location of 14 key points to determinethe corresponding gestures

2504 Z. Hu et al.

this paper, J = 14. The relative positions of these 14-pointsis shown in Fig. 1, and these points mainly represents thewrist joints, fingertips, finger joints, palm joints. Notice thatthe thumb has three joints but other fingers only have twojoints, the reason is that the thumb is more flexible.

After we got the hand’s depth map, we resized it to the128 × 128 gray image, the gray value represents the depthvalue. To facilitate the subsequent processing, the gray valueof input image was normalized to range (0, 128), such thatthe units on three dimensions are all equal in length.

3.2 Data augmentation

In order to avoid over-fitting, we usually need a sufficienttraining set. However, the annotation of joints of the handis a hard task, which limits the number of labeled data, sowe used data augmentation method to expand the data set.In this paper, the input of the network was composed of thegray image of the point cloud. In our task, this paper mainlyused three methods: 1. 3D clip. In contrast to the generaltwo-dimensional clipping, we achieved three-dimensionalrandom clipping by offsetting the three-dimensional coor-dinates of a bounding box within a certain range, as shownin the Fig. 2. 2. Noise. Even with the employment of thecurrent depth sensors, including the structural light ones,thecaptured image is inevitable to be mixed with noise andabsence, so add the Gaussian noise can improve the net-work’s robustness; 3. Hollow. As mentioned above, in order

to improve the robustness of the network despite lack ofdata, the data set can be randomly added hollows.

3.3 Network architecture

As mentioned above, the hand pose estimation process canbe considered as a mapping between the depth image andthe vector composed of the coordinates of the key points.With the development of the neural network, the problemof handling high dimensional data can be solved well. Itwas proved that the neural network had been widely usedin the field of hand pose estimation because of its excel-lent performance. However, since hand pose estimation is atask of complex regression, the predecessor’s work adoptedcascade or combination methods [33]. This paper mainlydiscusses how to directly predict the joint coordinates by theend-to-end neural network.

Neural network is a bionic model of human brain. Humanbrain’s neuron connections are sparse. Inspired by that,researchers believe that large-scale neural network shouldbe organized in a reasonable way with sparse connections.This kind of structure is very suitable for a neural network,especially large neural networks. It can reduce over-fittingand the amount of computation. The main goal of the Incep-tion Net proposed by a Google’s team is to find the optimalsparse structure unit [25–27], which uses a convolution ker-nel to connect the highly relevant neurons together underthe Hebbian principle. In order to increase the diversity,

Fig. 2 Data Augmentation: Theleft side of the three graphsshows the raw data, and the rightside shows the augmented data.a The three-dimensional positionof the hand’s point cloud arerandomly moved. b TheGaussian noises are randomlyadded to the hand point cloud. cThe hollows are randomly added


a very efficient sparse structure is constructed with dif-ferent branches and different sizes of convolution. In thiscase, as shown in Fig. 3, since the hand’s depth map tendsto have many repetitive local features (such as fingertips),the convolution neural network is very suitable for featureextraction for its weight sharing that reduces the amount ofrequired weight. Note that we did not have the pooling layer,instead we increased the convolution stride. The purpose ofthat is to make the network itself learn its own spatial down-sampling process. Each layer reduces the size of the featuremaps, increases the number of feature maps, and in the endthe feature maps of different resolution sizes are reshapedinto one-dimensional vectors in parallel. The objective func-tion is shown in (1), where f (xi) represents the predictedvalue, yi is the true value, ‖w‖2 is the weight regularizationterm, λ is the regular term coefficient.

J =N∑

i

(yi − f (xi))2 + λ

2‖w‖2 (1)

Note that many researchers use multi-resolution networkarchitecture [3, 13], which needs to build an image pyra-mid and then to be put images of different resolution intoits different network channels. After that, the output of dif-ferent channels will be combined to get the results. Theyhave proved that this architecture has good performance. It

can be seen that the method has some similarities with ourmethod which also use different resolution feature. We canvisualize the feature maps of our network architecture inFig. 4. Smaller resolution of the network reveals more localcharacteristics, while higher resolution reveals more overallcharacteristics, and the combination of both can effectivelyimprove the network performance. The distinction of ourwork is that we do not need to set the image pyramid inadvance, yet we use convolution to resize its scale, so thatthe process of the reshaping can also be learned. Finallythe network achieves to process the integration of differentlevels of features.

From the perspective of propagation of error, the multi-scale structure proposed in this paper can improve theconvergence speed. Here we will discuss from the perspec-tive of the formula. First we define the objective function:

LN = 1

2

N∑

n=1

c∑

k=1

(ynk − zn

k )2 (2)

Where N is the number of samples; c is the dimensionof the label; yn

k is the kth dimension of the label yn of thenth sample; zn

k is the kth dimension of the prediction of thenth sample. The goal is to update the weight of the networkso that prediction z is closer to the true value y, that is, to

Fig. 3 The structure of the network: As shown in the figure, theconvolution network with Multi-scale is used. The input image is128 × 128 gray image, the gray value represents the depth value. Thesize of the convolution kernel is 5 × 5. The first branch stride is 2,while the other two branch strides are 4 with no pooling layer. In addi-tion to the final output layer, the activation function of the other layer

are Leaky ReLU. As the layers increases, the size of the feature mapsdecreases but the number of feature maps increases. The output ofdifferent branches are reshaped into one-dimensional vector and con-nected in parallel. The final output layer is a full connection layer, thesize of output is 3 ∗ J dimensional vector, the J represents the numberof joints

2506 Z. Hu et al.

Fig. 4 Visualization of thefeatures maps of differentlayers: The order is from left toright and from top to bottom.We can see that in the morebackward layer of the network,more local features areconcerned. In the more forwardlayer of the network, moreoverall features are concerned

minimize L, for ease of explanation, we introduce a sample,the objective function of the nth sample is:

LN = 1

2

c∑

k=1

(ynk − zn

k )2 (3)

The output of the l layer is defined as:

xl = f (ul), ul = Wlxl−1 + bl (4)

Where f is the activation function, xl−1 is the outputof the l − 1 layer, that is also the input of the l-layer, W

and b are the weights and the bias respectively. This typeof neural network is the classic forward propagation, whichmeans the sample will be passed on layer by layer, and inthe end the network outputs a prediction value. Backwardpropagation means the error between the predicted valueand the real value will be returned to each layer, so that thelayers will modify their weights accordingly, and make theprediction value more and more accurate with either the gra-dient descent algorithm or its adjusted versions.The updateformula of gradient descent is as follows:

Wlnew = Wl

old − η∂L

∂Wlold

, blnew = bl

old − η∂L

∂blold

(5)

Where η is the gradient rate of learning. It can be seenthat the gradient descent method uses the error cost functionto obtain the gradient of the parameters. Therefore, weightupdate would need each layer to get such a gradient. In orderto obtain the partial derivative of the error function of a sin-gle sample, we now define the node sensitivity δ as the rateof change of the error to the output, you can get:

δ = ∂L

∂u(6)

∂L

∂bl= ∂L

∂ul

∂ul

∂bl= δl (7)

∂L

∂Wl= ∂L

∂ul

∂ul

∂Wl= δl(xl−1)T (8)

So the backward propagation is in fact through the sensi-tivity of layers to return error. The following formula is thecore of the backward propagation:

δl = (W l+1)T δl+1 · f ′(ul) (9)

Above is an overview of the classical BP algorithm, thespecific derivation process can be seen in Appendix. InCNN, the weight W is replaced by the convolution kernelk, you can get CNN’s weight update formula. In order toexplain concisely, we still take the classic BP algorithm asan example.

Suppose we have only one input layer and one outputlayer:

z = f (u0) = f (W 0x + b0) (10)

Sensitivity:

δ0 = f ′(u0) · (yn − zn) = �0 (11)

∂L

∂W 0= δ0x = �0x

T (12)

∂L

∂b0= δ0 = �0 (13)

znew = f (W 0newx + b0

new) = f ((W 0old − η �0 xT )x

+(b0old − η�0))

= f ((W 0oldx + b0

old ) − η �0 (xT x + 1))


Add a hidden layer between the input and output layers:

z = f (ul) = f (W 1h0 + bl) (14)

h0 = f (u0) = f (W 0x + b0) (15)

Sensitivity:

δ1 = f ′(ul) · (yn − zn) = �1, δ0 = δ1 · f ′(u0) (16)

∂L

∂W 1= δ1(h0)T = �1(h

0)T (17)

∂L

∂W 0= δ0xT = (W 1)T �1 ·f ′(u0)xT (18)

∂L

∂b1= δ1 = �1 (19)

∂L

∂b0= δ0 = (W 1)T �1 ·f ′(u0) (20)

znew=f (W 1newh0

new+b1new)=f ((W 1

old −η�1(h0old )T )h0

new

+(b1old − η�1)) (21)

For the sake of convenience, the activation functionapplied is usually ReLU, and the activation function usedin this paper is ReLU and its variants. Then f ′(u0) =f ′(u1) = 1, assuming �1 = �0 = �.

h0new = f (W 0

newx + b0new)

= f ((W 0old − η(W 1

old )T �1 ·f ′(u0)xT )x

+(b0old − η(W 1

old )T �1 ·f ′(u0)))

For the first case:

znew = f ((W 0oldx + b0

old ) − η �0 (xT x + 1)) = zold

−η � (xT x + 1)) (22)

For the second case, after simplification:

h0new = (W 0

old − η(W 1old )T � xT )x + (b0

old − η(W 1old )T �)

= W 0oldx + b0

old − η(W 1old )T � (xT x + 1)

= h0old − �h

znew = (W 1old − η � (h0

old )T )(h0old − �h) + (b1

old − η�)

= W 1oldh0

old − W 1old �h −η � (h0

old )T h0old

+η � (h0old )T �h +b1

old − η �= zold − (W 1

old �h +η � (h0old )T h0

old

−η � (h0old )T �h +η�)

= zold − [W 1oldη(W 1

old )T � (xT x + 1)

+η�(h0old )Th0

old −η� (h0old )Tη(W 1

old )T�(xTx+1)

+η�] = zold − �z

�z = [W 1oldη(W 1

old )T � (xT x + 1)

+η � (W 0oldx + b0

old )T (W 0oldx + b0

old )

−η�(W 0oldx + b0

old )T η(W 1old )T �(xT x + 1)+η�]

= η[W 1old (W 1

old )T − �(W 0oldx + b0

old )T η(W 1old )T ]

×�(xT x+1)+η�((W 0oldx+b0

old)T(W 0oldx+b0

old)

+1)

The general size of W is a small value. In the networkframe proposed in this paper, the value of W is less than0.1, its maximum value is about 0.05, as shown in Fig. 5. Inthe initial stage of training, � is the larger error. When x isgreater than W , especially in this paper, the input value is inthe range of (0,128), so the first term on the right side of (23)is less than 0; Similarly, In the early stage of training whenthe initial weight W 0 is relatively small, or if the hiddenlayer is sparse due to the ReLU function, the hidden valueh0 is less than or equal to x, thus �Z is less than �(xT x +1)), which means that increasing the number of layers willreduce the speed of convergence.

In conclusion, in the early stages of training, smallernumber of layers will accelerate the convergence speed. Inthe late stages, the loss becomes smaller and the value of thehidden layers is closer to each other, and the convergencespeed is approximately equal. Multi-scale structure can beseen as deep network and shallow network working in paral-lel, deep network can improve the accuracy of the network,and shallow network can improve convergence speed ofthe network, so the multi-scale structure can improve theconvergence rate and still ensures ensuring accuracy.

3.4 Stair ReLU

Convolutional neural networks have proven a great successin classifying problems, and the output function is generallysigmoid or softmax. However when it comes to regressioncalculation of the gesture, the convolutional neural networkis not as successful. The sigmoid function is not able tobe applied onto the last layer because the output range ofSigmoid(x) ∈ (0, 1) is too small to match the output asthe coordinate value. Of course, its range can be expandedby being multiplied by a constant value, but sigmoid tendsto converge to both sides, which makes training not easy,but in turn makes it more suitable for the job of classifica-tion . If the ReLU function is applied on the last layer, itwill cause zero value to appear, which is not what we want.Of course, the improved version of ReLU, such as LeakyReLU [38], can be used. But ReLU and Leaky ReLU bothhave the same problem, which is, the output value may beout of the accepted range. We know that our output valueis the three-dimensional coordinate of the joint, the inputimages are normalized to 128 × 128, and the depth valueof the input images are also normalized to range (0, 128).

2508 Z. Hu et al.

Fig. 5 Distribution of the weights and biases of the network: We have calculated the weight and bias distribution of each layer in the trainingprocess. In the network designed by this paper, the weights and the bias are small

Therefore, the three-dimensional coordinates of the jointshould be between (0, 128). To solve this problem, we pre-sented the Stair ReLU function, as shown in the followingFig. 6.

The formula is:

f (x) =⎧⎨

⎩

αx, x < 0x, 0 ≤ x < L

βx + L · (1 − β), x ≥ L

f ′(x) =⎧⎨

⎩

α , x < 01 , 0 ≤ x < L

β , x ≥ L

(23)

Where L is a constant, in this paper L = 128, α and β

are both (0, 1), and both are exponentially attenuated to 0,and the attenuation formula is as follows:

α = α0 · λtα, β = β0 · λt

β (24)

Where α0 and β0 represent the initial values, λα and λβ

respectively represent the decay rate, and t is the decay

Fig. 6 Stair ReLU: In order to ensure sparsity of the network, theinitial value of α is small, L is the threshold, the purpose is to limit theoutput value, and the initial value of β is large enough to guarantee alarge gradient at the beginning of the training


period. The purpose of introducing decay is to start theconvergence with a large gradient, so as to speed up thetraining. And in order to ensure sparseness, α0 is generallya small number,while β0 is generally a relatively larger one,which can ensure that the initial training curve to start with alarger gradient. While setting β0 to a large number can alsospeed up the convergence, in the latter part of the training,the signal gain in the central area is large, while the signalgain on both sides of the area is small. In the middle of themain region, the gradient of 1 makes most of the neuronalgradient propagate very well and at the same time avoid theneuronal death. Stair ReLU is a kind of rectified linear unit,the characteristics of it are simple and fast. As the train-ing continues, the activation function is further closed, andultimately the output value can be limited.

3.5 Optimizer

The Stochastic Gradient Descent (SGD) is the most com-monly used optimization method, the current SGD usuallyrefers to the mini-batch gradient descent,in which the mini-batch gradient is calculated, and the parameters are updatedevery turn of iteration. Its problems are: 1. Selection oflearning rate is difficult; 2. In some cases the loss maybe trapped in the saddle point. Inspired by the concept ofmomentum in physics, the gradient in the above method isreplaced with “momentum”, speed accumulated along thetraining process. In the early stage of the gradient descent,after the parameter is updated, the descending directionremains the same, gradient multiplied by a relatively largerμ can well accelerate the training process. In the later partof the descent, when the loss is stuck in the local minimum,μ will be set to increase the update amplitude of update,thus helps the loss out of the trap. But if the direction ofgradient changes, μ will be set to decrease the amplitude ofupdate.

However, in the experiment, we found that when theabove optimization algorithm with momentum is applied tothe hand pose estimate, the network collapses. We believedthat it might be the accumulation of momentum that resultedin a large error gradient and the final output value’s fluc-tuating wildly, and ultimately made the loss not able toconverge. Therefore, there are two ways to solve this, oneis the use of non-momentum optimization methods, such asRMSProp, which can effectively solve the problem of net-work’s crashes; the second is to use the Stair ReLU in thefinal output layer, which can partially alleviate the output’sfluctuation. The network’s collapse usually occurs in thelate period of training, at which time, the β value in StairReLU has generally decayed to a small value, the violentfluctuations of the output value can be suppressed.

4 Experiments

4.1 Datasets

In this chapter, we used two data sets, the NYU HandPose Dataset and the MNIST database of handwritten dig-its, among which the latter was used as an auxiliary data setto further validate the validity of the proposed method.

The NYU Hand Pose Dataset contains 72K training sam-ples and 8K test samples, all of which are RGB-D dataacquired with PrimeSense Carmine. PrimeSense Carmineis a sensor that adopts structure light, so the depth imagescaptured by it are defective due to occlusion and other rea-sons. In this experiment, we only used the depth images.The data set has accurate labeling, and gestures in it areeasily distinguished with each other. The training data setincludes samples captured from one user, and the test dataset includes samples captured from two different users. Thelabels include J = 36 joints, but we only used 14 of them[12].

Other hand data sets were also considered, such asICL [32] and MSRA [7]. Compared with those data sets,the NYU has the following advantages: 1. The differencebetween different poses is greater and the variety of data setis larger. 2. The number of annotated joint points is greater.Although in this paper only 14 points were used so thatwe can compare our work with others, more points will beavailable for the future work to make the estimated gesturesmore accurate. 3. The person is farther away from the cam-era and thus has a wider range of activities in the images,which better addresses our needs, because the hand poseestimation will be used in the virtual reality.

The MNIST is a handwritten digital data set with 60,000training samples and 10,000 test samples. Because of itsportability, it is one of the most commonly used data setsamong neural network researchers, the researchers can effi-ciently verify their own theories and methods. And it is alsoused as a benchmark for researchers to compare their workwith others [36]. Because of its ease of use, it is used as anauxiliary data set in this paper.

4.2 Hyper-parameter and optimization

The performance of neural networks depends on the hyper-parameters, so we experimented our network with differenthyper-parameters for different structural transformations.For the convenience of comparison, we chose the samehyper-parameters for each method and adjusted it to bethe best configuration parameter. However, in our experi-ments, the network structure had a greater impact on theperformance of neural networks.

2510 Z. Hu et al.

Table 1 Seven kinds of networks Hidden column lists the number ofhidden layers

Name Features Loss

Hidden Branch

Net1 0 1 loss1Net2 1 1 loss2Net3 2 1 loss3Net4 3 1 loss4Net5 4 1 loss5Net6 4,1 2 loss6Net7 4,4 2 loss7

In that column one value means it’s the number of hidden layers of asingle branch, and two values mean that they are the number of the hiddenlayers of two branches. Branch column lists the number of branches

We trained neural network of different structures, inwhich the objective function was set to minimize the dis-tance between the predicted value and the labeled value,and data augmentation was used, weight regularization andDropout of the final connection layer were used to preventthe occurrence of over-fitting. Regularization was applied tothe convolution layer and the weight value is 0.01. The train-ing set was expanded 10 times. The training optimization

algorithms Adam and RMSProp were compared and wefound that RMSProp was better because the Adam opti-mization algorithm would cause network’s collapse in theprocess of training. The mini-batch size was 128, and theepoch was 5. Each epoch had 12000 iterations, so there wasa total of 60, 000 iterations.

4.3 Evaluation criterion

In order to have an intuitionistic quantitative evaluation ofthe predicted results, we employed two kinds of evalua-tion criteria: 1) Calculate the average Euclidean distancebetween the prediction and the ground truth of each joint inthe test set, and we can have a macroscopic assessment ofthe predicted result. 2) The maximum value of the thresh-old is specified, similar to the evaluation method used in[38]. When each distance between the joint coordinate valuein the predicted result and the corresponding value in thestandard output is less than the maximum threshold inone sample, the predicted result of that sample is deemedcorrect. The proportion of the correct sample in the testsamples is the accuracy rate. This criterion is more rigor-ous compared with the first one, and is more concerned withlocal bad predictions. this criterion ensures that each valueis correctly predicted, because a single erroneous value will

Fig. 7 7 kinds of network: Allnetworks are fully connected,the input is reshaped into a784-dimension vector, theoutput is a 10-dimensionalvector. In order to facilitate thecomparison between the hiddenlayers, the number of neurons inthose hidden layers are equal


cause the entire sample to be erroneous, no matter how goodthe other values are.

4.4 Comparison

In order to fully verify our theory, we first tested a total ofseven kinds of network structure with the MNIST, [39]. Theseven network structures are shown in Table 1 (Fig. 7) :

It can be seen from Fig. 8 that with the increase ofthe number of layers, the convergence speed slowed down.When the deep network was added a shallow branch, theconvergence speed is greatly accelerated, but if the branchwas non-shallow, the convergence rate hardly changed, thatis, when the network was connected in parallel, the shallownetwork would determine the convergence speed, and thedeep network would determine the accuracy rate.

We calculated the average of the input and the averageof output values of hidden layers of the Net 5, as shown inFig. 8. We can see that in the early stages of training, the

first hidden layer had higher value than the input value, theremaining hidden layers had output smaller than the firsthidden layer, even less than the input value, The rest of thenetworks were also subject to such law. When a network hada hidden layer, such as Net2, its convergence speed wassimilar to Net1’s, who had no hidden layer, but other char-acteristics of the network showed obvious differences. Thisconfirms our theory, in the early stage of training the con-vergence speed will slow down when the value of a hiddenlayer is smaller than the value of its previous layer, other-wise the convergence speed would not change significantly.

Obviously, when a network deepens, its convergencespeed becomes slower, but the accuracy rate becomeshigher. At this time, a multi-scale network can improve theconvergence speed without affecting the accuracy rate .

For further verification, two network structures weretested with the NYU data set. The first network is withoutany additional branch, called no Multi-branch network, as isshown in Fig. 9, the second one was the multi-scale network,

Fig. 8 The loss and mean value of the 7 kinds of networks: a showsthe convergence process of the loss value during training in seven net-works with different kinds of structures. It can be seen that if there isonly a single branch in the network, with the increase of the numberof the network’s layers, the convergence process slows down. Whena single-layer network is connected in parallel with that network, theconvergence rate is significantly faster, but the convergence rate is nosignificant change while network connected is not a single one but a

multi-layer one. It can be concluded that the convergence rate is deter-mined by the branch with fewer layers. b the average of output valuesof neurons in the layers of Net5 is calculated, we can see that when thehidden layer is more than one, the convergence rate obviously slowsdown. Because the mean value of hidden1 is greater than the input’sand other hidden layers’. Net2, who has only one hidden layer, andNet1, who has no hidden layer, have similar convergence speed

2512 Z. Hu et al.

Fig. 9 No multi-branch network: The difference of the network from the one in Figure 2 is that there are no shallow network branches here, butthe rest part is the same

as is shown in Fig. 2, which had several different brancheswith different number of layers. The process of training ofthe above two networks are shown in Fig. 10. It can beseen that the convergence rate of the multi-scale networkwas significantly faster, its convergence process was morestable, even its final result of convergence was better. Theresults of the two kinds of network’s training were shown inFig. 11, we can see that when the network had two differentbranches of shallow network connected in parallel with themain branch, the accuracy rate was significantly improved.The reason was that different levels of features were con-nected to the final full connection when a deep networkare connected in parallel with shallow networks, and on theother hand this also increased the number of neurons toimprove the network ability of fitting.

In addition, the Stair ReLU can partially increase the sta-bility of network’s training process to prevent the collapseof the network, after the test we found that the Stair ReLUcould also mildly improve the accuracy rate. By using StairReLU and multi-scale structures, we achieved the highestaccuracy of such network so far.

For further verification, we tested multi-scale networkstructure with leaky ReLU and Stair ReLU , no multi-branchnetwork, and Oberweger Refine network [3]. The criteriaused was the second criterion mentioned in Section 4.3, andthe results are as shown in Fig. 11, when the threshold wassmall, the Multi-scale network with Stair ReLU and theRefine network was better, which means that the varianceof error was relatively small and the prediction was morestable. However, but when the threshold was large, the three

Iteration0 500 1000 1500 2000 2500

Loss

0

1000

2000

3000

4000

5000

6000

no Multi-branchMulti-scale

Fig. 10 The loss of training: It can be seen that in the early stage oftraining, the convergence rate of the multi-scale network is faster andthe process of convergence is more stable. Compared with the multi-scale network, the loss of the non-multi-branch network is larger, andthe output of the multi-scale network converges on a smaller value,that is, its result of convergence is better. There appear fluctuationand peaks in the early stage of the training. The reason of that is thatthe learning rate and the loss are relatively higher at the beginning oftraining, which makes the weight’s update step bigger. And the gra-dient of the batch is used to update the whole network’s parameters,

which eventually lead to the fluctuation and peaks of the model. Withthe loss’s reduction, the learning rate’s decay and the optimizers used(the RMSProp was used ), the weight’s update step decreases, andthe network is closer and closer to convergence. However, it is notedthat because RMSProp uses the decay coefficient to prevent prematuretermination of training, the update step does not decrease monotoni-cally. And there are some abnormalities in the dataset, which result inoccasional fluctuations and peaks in the late stage of training, but for-tunately these fluctuations and peaks do not have much impact on thetraining process


Fig. 11 Accuracy: Weevaluated the predicted resultsof the four networks undercriteria 2 given in Section 4.3,the horizontal axis representsthe threshold, and the verticalaxis represents the accuracyrate.The sample’s prediction isdeemed correct only when theerror between each of the 42values of the output vector andthe corresponding Ground truthis less than the threshold. It canbe seen that the accuracy rate ofmulti-scale network isdistinctively higher than others.However when the threshold issmall, the accuracy rate of thenetwork with Stair ReLU ishigher

Threshold (pix)0 10 20 30 40 50 60 70 80

Per

cent

age

of e

xam

ples

with

in T

hres

hold

0

10

20

30

40

50

60

70

80

90

100

Multi-scale with Stair ReluMulti-scale with Leaky ReluNo Multi-branchOberweger et al.

networks designed in this paper were all superior to theRefine network of Oberweger et al. and the Stair ReLU andLeaky ReLU’s results were similar in the final output layer,which indicated that Stair ReLU could improve the stabilityof the prediction of the network and suppress large fluctuations.

Now we use criteria 1 to evaluate the overall results ofthe predictions. As shown in Table 2, the use of Multi-scale structure significantly reduced the error of prediction,and that of the network with Stair ReLU was not signifi-cantly reduced. Because when most of the predicted valueconverged to the correct one, the Stair ReLU and LeakyReLU were the same,but Stair ReLU could limit values thatmight exceed their acceptable the range, so as was shown inFig. 11, its prediction error fluctuations was more gentle.

By comparing our method with those state-of-the-artmethods, the error of our method is the lowest, which fullyshows the validity of network with multi-scale structure.Those methods also used the neural networks. In patternrecognition, the quality of the extracted features often have

Table 2 Comparison with state-of-the-art methods of joint positionestimation on NYU dataset

Methods NYU(3D)

Ours, no multi-scale 16.37

Ours, multi-scale with Leaky ReLU 14.23

Ours, multi-scale with Stair ReLU 14.21

Neverova et al. [15] 14.94

DeepPrior, Oberweger et al. [3] 19.8a

Tompson et al. [12] 21.0a,b

JTSC, Damien Fourure et al. [35] 16.80

aMeans the corresponding values were estimated from plots if authorsdo not provide numerical valuesbPerformance was reported in [3]

a great impact on the outcome. One of the reasons why thedeep convolution neural network is now widely used is thatthe use of multiple convolution layers can make the networklearn features better, thus better used for classification orregression. Tompson et al. used only two convolution layersto extract the features, so the features learned were poor andthe final accuracy was not good. There were also few con-volution layers in the DeepPrior network which was usedby Oberweger et al., but its accuracy was slightly improvedby dividing the network into an initialization phase anda reinforcement phase. The same problem existed in theMulti-task network proposed by Fourure et al., but it wastrained by multiple data sets at the same time, which wasequivalent to the effect of expanding the dataset, hence animprovement in accuracy. Compared with these methods,the Multi-scale network had more convolution layers andcombined several networks with different layers to extractthe features, the extracted features were better, and if dataaugmentation was used, its final accuracy would be optimal.The Neverova et al. used Semi-Supervised and Weakly-Supervised learning, whose effect was equivalent to that oftwo networks with different layers being trained together.That network thoroughly extracts the features of imagesin the data set through weak supervision. To some extent,application of weak supervision also solved the problem ofinsufficient learning samples in the data set. It can be seenthat its strategy is similar to the method used in this paper,which can also thoroughly extract the features, and solve theproblem of insufficient learning samples in the data set, thusits final accuracy is also good.

4.5 Running time

Our method used Tensorflow with Python, the computerwere equipped with Intel Core i7-6700k, 16GB memory,

2514 Z. Hu et al.

Fig. 12 Qualitative results: These different styles of gestures arerandomly selected, the prediction results and the input images are dis-played in an overlapping manner. When the input image is relativelycomplete, the predictions is excellent. When there is some part blockedor missing, the predictions is still good, this means that our methodis robust. When there is more information missing, such as in the lastrow, there will be a larger error, but the predictions can still get thetopology of the hand (Recommended to be views on a computer)

and a graphics card nVidia GeForce GTX 1070. With GPUacceleration, the network can achieve the frame rate of30 fps in some sequences and the average rate is 20 fpswhich would meet the requirements of real-time estimation.

4.6 Qualitative results

Because data in the NYU dataset are captured with struc-tured light sensors, this type of sensor has a problem ofmissing depth values, which may have certain impact onthe accuracy of the prediction. However, judging from ourresults, we can see that the missing of part of informationdoes not have much impact on our predictions, as is shownin Fig. 12. In other words, our algorithm is robust. But whentoo much information is lost, as is shown in the last twocolumns of the last row, it will lead to large errors, but mosttopological structure of the hand can still be maintained.

5 Conclusion and future work

We’ve evaluated different kinds of network architecturesby comparing their effects of conducting calculation of 3D

coordinates of the hand joints. We proposed a frameworkof multi-scale network, deep network and shallow networkwhich can be connected in parallel to ensure both high accu-racy and high convergence rate, the proposed new networkhave been verified in two public data sets. And as for thehand joint’s regression problem, we proposed a new type ofoutput function applied to the final output layer. Such outputfunction can not only mildly enhance the stability of net-work’s training, but also slightly improve the accuracy rate.Compared with the existing methods, our method has a bet-ter performance in terms of accuracy and speed, and our net-work have state-of-the-art performance on the NYU dataset.

In the future, our work will be focused on these twoaspects: one is to maintain the robustness of the networkdespite lack of information, through the application ofinter-frame information; the other one is to try the semi-supervised way to train the network because of the hardnessof collecting label data.

Acknowledgements This work was supported by the National KeyTechnology R&D Program of China (No.2015BAF01B00) and theNational Key R&D Program of China (No.2017YFD0400405).

Appendix

δ = ∂L

∂u, ul = Wlxl−1 + bl (A.1)

For the bias b in the parameter, since ∂u/∂b = 1, by thechain derived rule:

∂L

∂bl= ∂L

∂ul

∂ul

∂bl= δl (A.2)

The partial derivative of the cost function L for the weightW in the parameter:

∂L

∂Wl= ∂L

∂ul

∂ul

∂Wl= δl(xl−1)T (A.3)

The sensitivity of each layer is not the same, can be calculated:

δl = ∂L

∂ul= ∂L

∂ul+1

∂ul+1

∂ul

= δl+1 ∂(Wl+1xl + b)

∂ul

= δl+1 ∂(Wl+1f (ul) + b)

∂ul

= (W l+1)T δl+1 · f ′(ul) (A.4)

References

1. Keskin C, Kirac F, Kara YE, Akarun L (2011) Real time hand poseestimation using depth sensors. In: 2011 IEEE international con-ference on computer vision workshops (ICCV Workshops). IEEE,pp 1228–1234


2. Supancic JS, Rogez G, Yang Y, Shotton J, Ramanan D (2015)Depth-based hand pose estimation: data, methods, and challenges.In: Proceedings of the IEEE international conference on computervision, pp 1868–1876

3. Oberweger M, Wohlhart P, Lepetit V (2015) Hands deep in deeplearning for hand pose estimation. In: Computer vision winterworkshop

4. Xu C, Cheng L (2013) Efficient hand pose estimation from asingle depth image. In: Proceedings of the IEEE internationalconference on computer vision, pp 3456–3462

5. Kirac F, Kara YE, Akarun L (2014) Hierarchically constrained 3Dhand pose estimation using regression forests from single framedepth data. Pattern Recogn Lett 50:91–100

6. Li P, Ling H, Li X, Liao C (2015) 3d hand pose estimation usingrandomized decision forest with segmentation index points. In:Proceedings of the IEEE international conference on computervision, pp 819–827

7. Qian C, Sun X, Wei Y, Tang X, Sun J (2014) Realtime and robusthand tracking from depth. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition, pp 1106–1113

8. Sharp T, Keskin C, Robertson D, Taylor J, Shotton J, Kim D,Freedman D (2015) Accurate, robust, and flexible real-time handtracking. In: Proceedings of the 33rd annual ACM conference onhuman factors in computing system. ACM, pp 3633–3642

9. Sridhar S, Oulasvirta A, Theobalt C (2013) Interactive marker-less articulated hand motion tracking using RGB and depth data.In: Proceedings of the IEEE international conference on computervision, pp 2456–2463

10. Tzionas D, Srikantha A, Aponte P, Gall J (2014) Capturing handmotion with an RGB-D sensor, fusing a generative model withsalient points. In: German conference on pattern recognition.Springer, Cham, pp 277–289

11. Coleca F, State A, Klement S, Barth E, Martinetz T (2015) Self-organizing maps for hand and full body tracking. Neurocomputing147:174–184

12. Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continu-ous pose recovery of human hands using convolutional networks.ACM Trans Graph (ToG) 33(5):169

13. Toshev A, Szegedy C (2014) Deeppose: human pose estimationvia deep neural networks. In: Proceedings of the IEEE conferenceon computer vision and pattern recognition, pp 1653–1660

14. Sinha A, Choi C, Ramani K (2016) Deephand: robust hand poseestimation by completing a matrix imputed with deep features.In: Proceedings of the IEEE conference on computer vision andpattern recognition, pp 4150–4158

15. Neverova N, Wolf C, Nebout F, Taylor GW (2017) Hand pose esti-mation through semi-supervised and weakly-supervised learning.Computer Vision and Image Understanding. In press, Corrected Proof

16. Rautaray SS, Agrawal A (2015) Vision based hand gesture recog-nition for human computer interaction: a survey. Artif Intell Rev43(1):1–54

17. Hasan H, Abdul-Kareem S (2014) Static hand gesture recognitionusing neural networks. Artif Intell Rev 1–35

18. Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesturerecognition with 3D convolutional neural networks. In: Proceed-ings of the IEEE conference on computer vision and patternrecognition workshops, pp 1–7

19. Ozturk O, Aksac A, Ozyer T, Alhajj R (2015) Boosting real-time recognition of hand posture and gesture for virtual mouseoperations with segmentation. Appl Intell 43(4):786

20. Tripathi BK (2017) On the complex domain deep machine learn-ing for face recognition. Appl Intell 1–15

21. Dinh DL, Lim MJ, Thang ND, Lee S, Kim TS (2014) Real-time 3D human pose recovery from a single depth image usingprincipal direction analysis. Appl Intell 41(2):473

22. Keskin C, Kırac F, Kara Y, Akarun L (2012) Hand pose estima-tion and hand shape classification using multi-layered randomizeddecision forests. In: Computer vision ICCV 2012, pp 852–863

23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classifi-cation with deep convolutional neural networks. In: Advances inneural information processing systems, pp 1097–1105

24. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-FeiL (2014) Large-scale video classification with convolutional neu-ral networks. In: Proceedings of the IEEE conference on computervision and pattern recognition, pp 1725–1732

25. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D,Rabinovich A (2015) Going deeper with convolutions. In: Pro-ceedings of the IEEE conference on computer vision and patternrecognition, pp 1–9

26. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-ResNet and the impact of residual connections onlearning. In: AAAI, pp 4278–4284

27. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016)Rethinking the inception architecture for computer vision. In: Pro-ceedings of the IEEE conference on computer vision and patternrecognition, pp 2818–2826

28. Nair V, Hinton GE (2010) Rectified linear units improve restrictedBoltzmann machines. In Proceedings of the 27th internationalconference on machine learning (ICML-10), pp 807–814

29. Melax S, Keselman L, Orsten S (2013) Dynamics based 3D skele-tal hand tracking. In: Proceedings of graphics interface 2013.Canadian Information Processing Society, pp 63–70

30. Oikonomidis I, Kyriazis N, Argyros AA (2011) Efficient model-based 3D tracking of hand articulations using Kinect. In: BmVC,vol 1(2), p 3

31. Liang H, Wang J, Sun Q, Liu YJ, Yuan J, Luo J, He Y (2016)Barehanded music: real-time hand interaction for virtual piano.In: Proceedings of the 20th ACM SIGGRAPH symposium oninteractive 3D graphics and games. ACM, pp 87–94

32. Tang D, Jin Chang H, Tejani A, Kim TK (2014) Latent regres-sion forest: structured estimation of 3d articulated hand posture.In: Proceedings of the IEEE conference on computer vision andpattern recognition, pp 3786–3793

33. Sun X, Wei Y, Liang S, Tang X, Sun J (2015) Cascaded hand poseregression. In: Proceedings of the IEEE conference on computervision and pattern recognition, pp 824–832

34. Tang D, Yu TH, Kim TK (2013) Real-time articulated hand poseestimation using semi-supervised transductive regression forests.In: Proceedings of the IEEE international conference on computervision, pp 3224–3231

35. Fourure D, Emonet R, Fromont E, Muselet D, Neverova N,Tremeau A, Wolf C (2017) Multi-task, multi-domain learning:application to semantic segmentation and pose regression. Neuro-computing 251:68–80

36. Ge L, Liang H, Yuan J, Thalmann D (2016) Robust 3D handpose estimation in single depth images: from single-view CNNto multi-view CNNs. In: Proceedings of the IEEE conference oncomputer vision and pattern recognition, pp 3593–3601

37. Hafiz AR, Al-Nuaimi AY, Amin MF, Murase K (2015) Classifi-cation of skeletal wireframe representation of hand gesture usingcomplex-valued neural network. Neural Process Lett 42(3):649–664

38. Taylor J, Shotton J, Sharp T, Fitzgibbon A (2012) The vitruvianmanifold: inferring dense correspondences for one-shot humanpose estimation. In: 2012 IEEE conference on computer visionand pattern recognition (CVPR). IEEE, pp 103–110

39. LeCun Y, Cortes C, Burges CJ (2010) MNIST handwritten digitdatabase. AT&T Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2

http://yann.lecun.com/exdb/mnist

http://yann.lecun.com/exdb/mnist

Hand pose estimation with multi-scale network

Documents

Transcript of Hand pose estimation with multi-scale network