[IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China...
Transcript of [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China...
![Page 1: [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China (2014.7.14-2014.7.18)] 2014 IEEE International Conference on Multimedia and Expo (ICME) - Learning](https://reader035.fdocuments.in/reader035/viewer/2022072115/5750a1ce1a28abcf0c96582b/html5/thumbnails/1.jpg)
LEARNING FLEXIBLE BLOCK BASED LOCAL BINARY PATTERNS FORUNCONSTRAINED FACE DETECTION
Zhenhua Chai, Yu Zhang, Zhijun Du, Dong Wang
Media Technology Lab,Huawei Technologies Co., Ltd
Heydi Mendez-Vazquez
Advanced Technologies ApplicationCenter (CENATAV), Cuba
ABSTRACT
Face detection has been a very active research topic in re-
cent years. However, when applied to uncontrolled envi-
ronments, some systems exhibit poor generalization ability.
Even though few of existing methods can achieve promis-
ing results in some challenging situations, they usually have
the requirement of high computational cost. This will defi-
nitely limit the use of those methods in some mobile platforms
which have limited computational resources and strict power-
consumption control. In this paper, a novel facial represen-
tation method for multi-view face detection in uncontrolled
environment is presented. The proposed method, named Flex-
ible Block based Local Binary Patterns (FBLBP), has low
storage requirements and it is fast to compute; while its per-
formance is comparable with the state of the art methods,
demonstrated on the challenging Face Detection Data set and
Benchmark (FDDB).
Index Terms— face detection, structured ordinal fea-
tures, two stage learning, boosting
1. INTRODUCTION
Face detection is one of the hottest research topics in com-
puter vision and pattern recognition in the past few years [1].
The reasons can be attributed to its wide range of applications,
especially as one of the prerequisite components for other in-
teresting applications, e.g. smart cameras, biometrics, video
surveillance, digital album management and human computer
interactions (HCI) etc [2]. The aim of face detection is to find
and locate human faces in digital images (or videos), no mat-
ter the face pose or if there is an occlusion. Up to now it is still
a challenging task to detect all the faces in all images with-
out a mistake. Some examples of faces miss-classified by a
very popular and well optimized face detector (OpenCV) are
shown in Fig. 1.
Several algorithms have been proposed to work toward
this goal. The early works mainly focus on statistical learn-
ing based classifiers for frontal face detection. Thus, popular
classifiers such as neural networks (NN) [3], sparse network
of Winnows (SNoW) [4] and support vector machines (SVM)
Fig. 1. The challenge of face detection in real life (including
pose, backlight, low resolution, occlusion and large expres-
sion variations). The haarcascade−frontalface−alt−treeand haarcascade−profileface detectors (OpenCV2.4.7)
only have detected two faces in these difficult images.
[5] have been used in the literature for this purpose. Good re-
sults have been achieved by those methods, but their applica-
tions are greatly limited by the system speed. Later, a boost-
ing based method was proposed by Viola and Jones (VJ) [6],
exhibiting excellent performance on both accuracy and speed,
and being one of the most popular methods even till now.
The main advantages of VJ method are: 1) the used haar-
like features can be computed very fast by the integral im-
age; 2) the adaboost algorithm [7] is used for effective fea-
ture selection from a large feature pool and at the same time
for classifier training; 3) the cascade structure for detection is
exploited in order to filter most of non-face regions in early
stages. In recent years, many efforts have been made to im-
prove the VJ model in any of these three directions [2].
The features used for describing the faces is considered
one of the most important aspects on the VJ framework for
face detection. Lienhart and Maydt proposed an extended set
of haar-like features with more orientations [8]. The enriched
feature set can better describe the face information. Zhao
et al. introduced a method named Non-Adjacent Rectangle
![Page 2: [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China (2014.7.14-2014.7.18)] 2014 IEEE International Conference on Multimedia and Expo (ICME) - Learning](https://reader035.fdocuments.in/reader035/viewer/2022072115/5750a1ce1a28abcf0c96582b/html5/thumbnails/2.jpg)
(NAR) Haar-like Feature [9] to capture the context informa-
tion of face images. However, a single haar-like feature seems
to be too simple to represent the face image, and usually a de-
tector needs thousands of them to achieve a considerable per-
formance. There are other works that take advantage of the
co-occurrence information to make the features more discrim-
inative. The multi block local binary patterns (MBLBP) [10]
by Zhang et al. and the local assemble binary features (LAB)
[11] by Yan et al. are two good examples of this kind of meth-
ods. Both of them benefit from more complex structures and
achieve better performance than the original haar-like based
features [10, 11].
It should be noticed that all these features mentioned
above share a similar attribute, that is to describe the face
image using simple and robust ordinal comparisons. More
complex features have also been proposed, e.g. speeded up
robust features (SURF) [12], histogram of gradients (HOG)
[13], Gabor wavelets [14] etc. Some of them involved more
sophisticated and time-consuming structures, such as the de-
formable part based model (DPM) [13]. However, although
interesting results have been achieved, the computational cost
of most of these methods is very high. Thus in general, they
do not fit for real-time and mobile applications.
In this work a new method for face detection is proposed.
Following the strategy of VJ method, a new kind of ordinal
feature is proposed, in order to assure a fast speed in the fea-
tures computation process while increasing the classification
accuracy. The contributions of this work are twofold: 1) a
more general ordinal feature, named Flexible Block based Lo-
cal Binary Patterns (FBLBP), is introduced, and 2) a two-step
weak learner algorithm based on the proposed features is pre-
sented, in order to avoid permutation explosion.
This paper is organized as follows: in section 2 the draw-
backs of related works are analyzed. In section 3, the details
of the proposed method are introduced. Experimental results
on the challenging FDDB [15] are presented in section 4. Fi-
nally, conclusions are given in section 5.
2. RELATED WORK
There are mainly three methods related to our work. The first
of them is the MBLBP [10], which is an extension of the Lo-
cal Binary Patterns operator [16] by varying the block size
involved in local comparisons. The block based ordinal com-
parison is more robust to noise than the pixel based local com-
parison. Besides, the fusion of multiple block based ordinal
features allows to capture larger scale structure information.
Thus, MBLBP performs better than LBP for face detection
[10]. However, in MBLBP the feature structure is designed to
be fixed, and all the neighbor blocks must be connected to the
center block. The resulted features can not capture the dis-
criminative ordinal information between two regions at a dis-
tance. Another related method is the joint-Haar feature [17].
But it uses a different kind of weak learner and only considers
the adjacent block case.
The last method relevant to our work is the multi-scale
structured ordinal features (MSOF) [18]. Even though MSOF
overcomes the weakness of the first two methods mentioned
above by considering non-adjacent blocks, it uses a fixed inter
block distance between the center block and neighbor blocks.
Thus, it is not so flexible neither, as it is illustrated on Fig. 2
(a).
Fig. 2. The visual differences between a) MSOF and b) the
proposed FBLBP. The distances between the center block and
neighbor blocks in MSOF are all the same while for FBLBP
all the inter block distances and block positions are learned
from the training set.
3. FBLBP CASCADE FOR FACE DETECTION
The proposed face detector follows VJ strategy. First, a novel
set of more discriminative features is introduced that can be
computed very fast by using integral images. Then, a boost-
ing algorithm is used to select the most discriminative features
and to construct a binary classifier. Multi output regression
tree is used as the weak learner in the proposed framework,
and each weak classifier is trained in two steps and in an iter-
ative way based on a pivot block. Finally, a boosting cascade
structure is used to train the strong classifier. The details of
every step of the proposal will be described in the following.
3.1. Flexible Block based Local Binary Patterns (FBLBP)
The main motivation for proposing FBLBP feature is to over-
come the fixed structure problems in MBLBP [10] and MSOF
[18]. It is shown in Fig. 2 (b) that the structure of the pro-
posed FBLBP is more flexible than existing approaches, by
learning the inter block distances and positions of the blocks
under comparison in the training step.
FBLBP can be computed in a similar way to MBLBP and
MSOF. First, the differential values between average intensity
of the center block (also called pivot block) and average in-
tensities of all neighbor blocks are computed. Then, all the
values are thresholded into binary codes. Finally, all the bi-
nary codes are concatenated together to get one FBLBP fea-
ture. The details can be found in Fig. 3. All the block based
features can be computed very efficiently by using the integral
images [6].
![Page 3: [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China (2014.7.14-2014.7.18)] 2014 IEEE International Conference on Multimedia and Expo (ICME) - Learning](https://reader035.fdocuments.in/reader035/viewer/2022072115/5750a1ce1a28abcf0c96582b/html5/thumbnails/3.jpg)
Fig. 3. FBLBP encoding process. The average intensity value
in each block is compared with the center block, and then the
comparison result is thresholded into a binary value.
In order to make the block based local comparisons more
discriminative, we define the k-th FBLBP feature as two sub-
classes FBLBPk,1 and FBLBPk,0 (k ∈ [1,K] and K is the
total number of FBLBP features) by making an extra sign
judgement. The i-th element in FBLBPk,1 or FBLBPk,0 can
be obtained by formula (1).
Let δ(•) be the kronecker delta, so that when the input is
true, the output is 1, otherwise 0:
Set sign = 1, then FBLBPk,1,i =δ((AvgIntpivot −AvgIntneighbori) ≥ θk,1)
Set sign = 0, then FBLBPk,0,i =δ((AvgIntpivot −AvgIntneighbori) ≤ θk,2)
(1)
By using this strategy, the size of the resulted feature set
becomes doubled, which means a richer feature pool is ob-
tained. Besides, the sign bit makes the FBLBP feature more
discriminative. This thresholding strategy is very similar to
the one used in the local ternary patterns (LTP) [19] for face
recognition. However, in LTP the θk,1 is set to be the oppo-
site number of θk,0, and both of them are fixed for all face re-
gions. In the proposal, both variables can be different for each
FBLBP feature and they are obtained in the learning process
described in the next two subsections.
3.2. Gentleboost Cascade Learning
The flexible structure makes the FBLBP features more dis-
tinctive as described above but at the same time highly redun-
dant. Hence an efficient algorithm is needed to select a subset
of the most significant features. In this paper, the gentleboost
[20] is used due to its robustness to outliers. Given a set of
training samples labeled as (s1, y1), ..., (sN , yN ), the gentle-
boost can be viewed as a sequential procedure to fit additive
models F (s) =M∑
m=1fm(s). For each iteration, the aim is to
select the best weak classifier fm(s), which can minimize the
weighted squared error under the current sample distribution.
The details can be found on Algorithm 1.
When one stage classifier is obtained, we use the boot-
strap to create more samples and train the classifier of the
next stage. The training process for the cascade detector will
Algorithm 1: Gentleboost classifier training
1. Initialize the sample weight wj = 1N
, j = 1, 2, ..., Nand the model F (•) = 0
2. Repeat for m = 1..M :
(a) Fit the weighted least square
regression model of Y to S:
Jwse =∑N
j=1 wj(yj − fm(sj))2
(b) Update F (s) ← F (s) + fm(s)
(c) Update wj ← wje−yifm(si) and normalization
3. Output the stage classifier F (s) = sign[∑M
m=1 fm(s)]
stop when the system performance satisfies our goal. More
details about gentleboost and cascade structure can be found
in the literature [6, 20].
3.3. A kind of two-step weak learner
The flexible structure increases the number of features in
many times. Thus, the pivot block becomes more important
than before. Taking the FBLBP feature set with block size
of 5x5 pixels as an example, for a normalized face image of
24x24 pixels, the total number of blocks will be equal to 400.
If we assume that each FBLBP feature has eight neighbor
blocks, by permutation this will produce billions of FBLBP
features (C8400) while under the same condition the number
of the original MBLBP with the same block size is only 400.
Hence, this will cause the combination explosion and defi-
nitely can not work properly in practice.
Taking this into account, in the proposed method the fea-
ture learning process is divided into two steps and a greedy
solution is used instead of the optimal exhausted search. The
first step is to determine the best positions and the best scale
for both the pivot block and the first neighbor block. This
is desired to train one temporal weak classifier with any two
blocks in the same scale that best fits the minimization of the
weighted squared error, Jwse. The search space is greatly
reduced in this case, and the optimization becomes tractable
by a personal computer nowadays. Since our FBLBP feature
is composed of a string of binary values (or encoded as an
integer) and this value is non-metric, we use multi output re-
gression tree [10] as the weak classifier. The output can be
explicitly obtained by equation (2):
fm(FBLBP k) = ap =
N∑
j=1
wjyjδ(FBLBP jk = p)
N∑
j=1
wjδ(FBLBP jk = p)
(2)
where FBLBP jk stands for the k-th FBLBP feature of the
face sample j, and the k is determined by the number of
neighbor blocks P in one FBLBP feature (p ∈ [1, 2P ]).After we get the best combination of two blocks, we set
either of them as the pivot block and the other as the first
![Page 4: [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China (2014.7.14-2014.7.18)] 2014 IEEE International Conference on Multimedia and Expo (ICME) - Learning](https://reader035.fdocuments.in/reader035/viewer/2022072115/5750a1ce1a28abcf0c96582b/html5/thumbnails/4.jpg)
neighbor block. This will greatly reduce the search space for
the next step, because when the block scale is fixed the total
number of candidate blocks is reduced. In the example be-
fore, in a 24x24 face image, it will be less than 576. Then, the
next step is to build one complete weak classifier by wrap-
ping more neighbor blocks. In this paper a greedy forward
search is used, and the learning process will stop when the
performance begins to decrease. The details are described in
Algorithm 2.
Algorithm 2: Search of candidate blocks
Input: The weights of training samples wi,i = 1, 2, ..., N
1. Initialize FBLBPk = ∅2. Find the best pivot and the first
neighbor block by minimizing:
{xp, xn, err} = argmin
(∑N
j=1 wj(yj − fm(xpj
⋃xnj ))
2);
FBLBPk = FBLBPk
⋃xn,
2. Then, wrap more blocks in the same size scale:
While (i < T )err old = err;
{xn, err} = argmin
(∑N
j=1 wj(yj − fm(FBLBPk
⋃xnj ))
2);
if (err > err old)
break;
elseFBLBPk = FBLBPk
⋃xn;
endend
3. Output FBLBPk
Besides, we set a parameter T to control the maximum
number of neighbor blocks for each FBLBP feature in order
to avoid overfitting. Other methods like backward search and
float search [21] can also be involved in this framework.
4. EXPERIMENTS
In order to show the advantages of the proposed method, we
evaluate the proposed FBLBP cascade detector on the chal-
lenging Face Detection Data set and Benchmark (FDDB)
[15]. The FDDB is one of the most widely used face de-
tection databases. It contains 2845 images with 5771 faces
captured in unconstrained environment. For training, we have
collected more than 20K face images (both frontal and non-
frontal) and background samples from the Internet. During
testing we strictly follow the standard test protocol and use
the testing tools provided by this database. Besides, we use a
subset of training set for cross validation.
The first selected FBLBP feature is shown in Fig. 4. We
can find that the pivot block and the neighbor blocks are non-
adjacent for this most discriminative FBLBP feature. Thus,
the original MBLBP which only considers adjacent neigh-
bor blocks does not fully explore the discriminant property
Fig. 4. The first selected FBLBP feature. The pivot block is
in red while the neighbor blocks are in green.
of block based LBP. Besides, it can be seen from the figure
that the position and distances of the selected neighbor blocks
is not fixed like in MSOF. Besides, we can also find that the
pivot block of the most discriminative feature is located close
to the eye regions, which obeys our common sense for the
most discriminative feature of human faces.
We also explore how the maximum number of neighbor
blocks, T , affects the detection results. As it is shown in Fig.
5 the best results are obtaining for T = 6.
Fig. 5. The performances of different maximum number of
neighbor block in one FBLBP on validation set.
Following the FDDB protocol we have compared the pro-
posed FBLBP with a number of state-of-the-art techniques:
the original MBLBP, the non-adjacent rectangle (NAR) haar-
like feature [9], the latest tree structure model (TSM) [13] and
the top 5 academic methods listed on FDDB website [15], in-
cluding (1) Li’s SURF cascade detector from Intel Lab [12];
(2) Jain’s detector [22] and (3) Mikolajaczyk’s detector [23],
both of which leverage the context information; (4) Subbura-
man’s face detector [24] which has a fast bounding box esti-
mation; and (5) the VJ model [6] implemented by the latest
version of Opencv. The ROC curves for both continuous and
discrete testing cases are shown in Figure 6. It can be seen
that our proposed method is comparable or even better than
the state of the art methods when the false positive is low.
![Page 5: [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China (2014.7.14-2014.7.18)] 2014 IEEE International Conference on Multimedia and Expo (ICME) - Learning](https://reader035.fdocuments.in/reader035/viewer/2022072115/5750a1ce1a28abcf0c96582b/html5/thumbnails/5.jpg)
Some of the detected faces in difficult images can be found in
Fig. 7, including those not detected on Fig. 1.
a)
b)
Fig. 6. Performance on FDDB: a) continuous score and b)
discrete score.
Finally, the speed of our detector is tested. It can be found
that 10 fps can be achieved on a VGA image for a common
PC with i5-2400 processor and without explicit optimization
(e.g. SIMD ). This is almost at the same level as the SURF
detector [12] under a similar hardware, but it is much faster
than the TSM [13]. However, our minimum detection window
is 24x24 while for both TSM and SURF detector it is 40x40.
This means that faces less than 40x40 pixels will be missed
completely by their detectors, while our method can detect
more faces in low resolution images, which is important in
some cases (e.g. mobile surveillance). Besides, we believe
that our detector will be even more competitive when better
hardware be available.
5. CONCLUSION AND DISCUSSION
In this paper, an efficient and effective ensemble model of
flexible block based local binary patterns (FBLBP) is pro-
posed for face detection in uncontrolled environment. We
argue that the detection performance improvement over the
original MBLBP can be attributed to the more flexible struc-
ture. Besides, the extra sign bit makes the FBLBP feature
more discriminative. Moreover, the storage is less than the
original MBLBP, since we take at most six blocks to con-
struct a weak classifier and we need less stage classifiers to
achieve a training error rate similar to the one of MBLBP.
What is more important, a better detection result is obtained
with FBLBP.
As future work, a skin salience model will be considered
for fast filtering the non-face windows. We believe this will
further speed up the system. Besides, the research on how to
automatically set the number of blocks for each FBLBP will
also be interesting.
6. REFERENCES
[1] Ming hsuan Yang, David J. Kriegman, and Narendra
Ahuja, “Detecting faces in images: A survey,” IEEEPattern Analysis and Machine Intelligence (PAMI), vol.
24, no. 1, 2002.
[2] Cha Zhang and Zhengyou Zhang, “A survey of recent
advances in face detection,” Tech. Rep., Microsoft Re-
search, June 2010.
[3] H.A. Rowley, S. Baluja, and T Kanade, “Neural
network-based face detection,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),1996, pp. 203–208.
[4] Ming hsuan Yang, Dan Roth, and Narendra Ahuja, “A
SNoW-based face detector,” in Advances in Neural In-formation Processing Systems, 2000, pp. 855–861.
[5] E. Osuna, R. Freund, and F. Girosi, “Training sup-
port vector machines: an application to face detection,”
in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 1997, pp. 130–136.
[6] Paul Viola and MichaelJ. Jones, “Robust real-time face
detection,” International Journal of Computer Vision,
vol. 57, no. 2, pp. 137–154, 2004.
[7] Yoav Freund and Robert E Schapire, “A decision-
theoretic generalization of on-line learning and an ap-
plication to boosting,” Journal of Computer and SystemSciences, vol. 55, no. 1, pp. 119 – 139, 1997.
[8] R. Lienhart and J. Maydt, “An extended set of haar-
like features for rapid object detection,” in InternationalConference on Image Processing (ICIP), 2002, vol. 1,
pp. 900–903.
[9] Xiaowei Zhao, Xiujuan Chai, Zhiheng Niu, Cherkeng
Heng, and Shiguang Shan, “Context modeling for fa-
cial landmark detection based on non-adjacent rectangle
![Page 6: [IEEE 2014 IEEE International Conference on Multimedia and Expo (ICME) - Chengdu, China (2014.7.14-2014.7.18)] 2014 IEEE International Conference on Multimedia and Expo (ICME) - Learning](https://reader035.fdocuments.in/reader035/viewer/2022072115/5750a1ce1a28abcf0c96582b/html5/thumbnails/6.jpg)
Fig. 7. Some examples of faces detected by the proposed FBLBP detector.
(NAR) haar-like feature,” Image and Vision Computing,
vol. 30, no. 3, pp. 136–146, 2012.
[10] Lun Zhang, Rufeng Chu, Shiming Xiang, Shengcai
Liao, and StanZ. Li, “Face detection based on Multi-
Block LBP representation,” in International Conferenceon Biometrics (ICB).
[11] Shengye Yan, Shiguang Shan, Xilin Chen, and Wen
Gao, “Locally assembled binary (lab) feature with
feature-centric cascade for fast and accurate face detec-
tion,” in IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), 2008, pp. 1–7.
[12] Jianguo Li, Tao Wang, and Yimin Zhang, “Face detec-
tion using SURF cascade,” in IEEE International Con-ference on Computer Vision Workshops (ICCV Work-shops), 2011, pp. 2183–2190.
[13] Xiangxin Zhu and Deva Ramanan, “Face detection,
pose estimation, and landmark localization in the wild,”
in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2012, pp. 2879–2886.
[14] Zhenhua Chai, Zhenan Sun, H. Mendez-Vazquez, Ran
He, and Tieniu Tan, “Gabor ordinal measures for face
recognition,” IEEE Transactions on Information Foren-sics and Security(TIFS), vol. 9, no. 1, pp. 14–26, 2014.
[15] Vidit Jain and Erik Learned-Miller, “Fddb: A bench-
mark for face detection in unconstrained settings,” Tech.
Rep. UM-CS-2010-009, University of Massachusetts,
Amherst, 2010.
[16] Timo Ojala, Matti Pietikainen, and Topi Maenpaa,
“Multiresolution gray-scale and rotation invariant tex-
ture classification with local binary patterns,” IEEETransactions on Pattern Analysis and Machine Intelli-gence (PAMI), vol. 24, no. 7, pp. 971–987, 2002.
[17] T. Mita, T. Kaneko, and O. Hori, “Joint haar-like fea-
tures for face detection,” in IEEE International Confer-ence on Computer Vision(ICCV), 2005, pp. 1619–1626.
[18] Shengcai Liao, Zhen Lei, Stan Z. Li, Xiaotong
Yuan, and Ran He, “Structured ordinal features for
appearance-based object representation,” in 3rd Inter-national Conference on Analysis and Modeling of Facesand Gestures (AMFG07), 2007, pp. 183–192.
[19] Xiaoyang Tan and B. Triggs, “Enhanced local texture
feature sets for face recognition under difficult lighting
conditions,” IEEE Transactions on Image Processing(TIP), vol. 19, no. 6, pp. 1635–1650, 2010.
[20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani,
“Additive logistic regression: a statistical view of boost-
ing,” Annals of Statistics, vol. 28, pp. 2000, 1998.
[21] Stan Z. Li and ZhenQiu Zhang, “Floatboost learning and
statistical face detection,” IEEE Transactions on PatternAnalysis and Machine Intelligence (PAMI), vol. 26, no.
9, pp. 1112–1123, 2004.
[22] V. Jain and E. Learned-Miller, “Online domain adap-
tation of a pre-trained cascade of classifiers,” in IEEEConference on Computer Vision and Pattern Recogni-tion (CVPR), 2011, pp. 577–584.
[23] Krystian Mikolajczyk, Cordelia Schmid, and Andrew
Zisserman, “Human detection based on a probabilistic
assembly of robust part detectors,” in ECCV 2004, vol.
3021 of Lecture Notes in Computer Science, pp. 69–82.
2004.
[24] Venkatesh Bala Subburaman and Sebastien Marcel,
“Fast bounding box estimation based face detection,” in
ECCV Workshop on Face Detection: Where we are, andwhat next, 9 2010.