SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf ·...

16
SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017 1 Hallucinating Face Image by Regularization Models in High-Resolution Feature Space Jingang Shi, Xin Liu, Yuan Zong, Chun Qi Member, IEEE and Guoying Zhao * Senior Member, IEEE Abstract—In this paper, we propose two novel regularization models in patch-wise and pixel-wise respectively, which are efficient to reconstruct high-resolution (HR) face image from low-resolution (LR) input. Unlike the conventional patch-based models which depend on the assumption of local geometry consistency in LR and HR spaces, the proposed method di- rectly regularizes the relationship between the target patch and corresponding training set in the HR space. It avoids to deal with the tough problem of preserving local geometry in various resolutions. Taking advantage of kernel function in efficiently describing intrinsic features, we further conduct the patch-based reconstruction model in the high-dimensional kernel space for capturing nonlinear characteristics. Meanwhile, a pixel-based model is proposed to regularize the relationship of pixels in the local neighborhood, which can be employed to enhance the fuzzy details in the target HR face image. It privileges the reconstruc- tion of pixels along the dominant orientation of structure, which is useful for preserving high-frequency information on complex edges. Finally, we combine the two reconstruction models into a unified framework. The output HR face image can be finally optimized by performing an iterative procedure. Experimental results demonstrate that the proposed face hallucination method produces superior performance than the state-of-the-art methods. Index Terms—Face hallucination, super-resolution, regulariza- tion framework, manifold learning, kernel method. I. I NTRODUCTION D URING the past decades, face analysis has received great interests and achieved impressive successes, especially in the applications under real-world environments. However, in many real-world applications, it often suffers from low- resolution problems because of the limitations of digital imag- ing devices, which can severely affect the performance of most face analysis algorithms. It is desirable to break the limitations of low-quality surveillance camera systems and further enhance the lost high-frequency details for practical applications. Jingang Shi, Xin Liu and Yuan Zong are with the Center for Ma- chine Vision and Signal Analysis, University of Oulu, Finland. (e-mail: [email protected]; xin.liu@oulu.fi; [email protected]). Yuan Zong is also with the Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing, China. Chun Qi is with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China. (e-mail: [email protected]). Guoying Zhao is with the School of Information and Technology, Northwest University, China, and also with the Center for Machine Vision and Signal Analysis, University of Oulu, Finland. (e-mail: guoying.zhao@oulu.fi). This work was supported by Academy of Finland, Tekes Fidipro Pro- gram (Grant No. 1849/31/2015), Tekes Project (Grant No. 3116/31/2017), Infotech, National Natural Science Foundation of China (Grant No. 61772419, 61572395) and Tekniikan Edistamissaatio Foundation. The authors wish to ac- knowledge CSC-IT Center for Science, Finland, for computational resources. The asterisk indicates corresponding author. Thus, we intend to approximate HR images from LR ones, using the super-resolution (SR) technique [1]. Gener- ally, these algorithms can be mainly divided into two cate- gories: interpolation-based methods [2]–[4] and learning-based methods [5]–[37]. The interpolation-based methods usually regularize the reconstruction process by using certain prior knowledge, while the learning-based methods aim to recon- struct the relationship between LR and HR images through a set of training pairs. For the problem of large magnification factor, the learning-based methods significantly improve the performance when they are compared with the interpolation- based ones. Due to the above advantages, learning-based methods attract more attention in recent research for practical applications. Chang et al. [10] utilized manifold learning meth- ods to approximate the relationship of local patches. Yang et al. [16] introduced sparse coding methods to learn dual compact dictionaries, which are effective to represent and reconstruct HR patches. Timofte et al. [38], [39] employed regression- based methods for transforming the input LR patches into HR ones. A set of mapping functions are learned from training set to describe the reconstruction procedure. Dong et al. [5]–[7] exploited internal similarities of natural images and utilized repetitive patterns to refine HR images. Recently, deep learning methods [40]–[42] are also developed to generate HR images from LR ones in an end-to-end manner. The learning-based methods can be formulated for natural images, or can be designed to deal with particular tasks, such as face hallucination, medical image super-resolution and document image super-resolution. We focus on the learning-based face hallucination techniques in this paper. According to the special characteristics of face images, Baker and Kanade [9] first proposed a learning-based face SR approach for reconstructing HR face images, which learns a prior on the distribution of image gradients through the training set. The specific SR procedure of face images is also called “face hallucination” in their literature. Liu et al. [12] performed the face hallucination work into two steps: global reconstruction and residual compensation. The first step generates the global face image which keeps the characteristics of face image using probabilistic method. Then, the second step produces the residual image in patch-wise to compensate the result of the first step. Following [12], lots of classical data models are utilized to describe the relationship of LR and HR face images, which includes PCA [11], Locality Preserving Projection (LPP) [13], Tensor Factorization [14], Non-negative Matrix Factorization (NMF) [16], Canonical Correlation Anal- ysis (CCA) [15] and Two-dimensional Canonical Correlation Analysis (2DCCA) [17]. However, the reconstruction quality

Transcript of SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf ·...

Page 1: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017 1

Hallucinating Face Image by Regularization Modelsin High-Resolution Feature Space

Jingang Shi, Xin Liu, Yuan Zong, Chun Qi Member, IEEE and Guoying Zhao∗ Senior Member, IEEE

Abstract—In this paper, we propose two novel regularizationmodels in patch-wise and pixel-wise respectively, which areefficient to reconstruct high-resolution (HR) face image fromlow-resolution (LR) input. Unlike the conventional patch-basedmodels which depend on the assumption of local geometryconsistency in LR and HR spaces, the proposed method di-rectly regularizes the relationship between the target patch andcorresponding training set in the HR space. It avoids to dealwith the tough problem of preserving local geometry in variousresolutions. Taking advantage of kernel function in efficientlydescribing intrinsic features, we further conduct the patch-basedreconstruction model in the high-dimensional kernel space forcapturing nonlinear characteristics. Meanwhile, a pixel-basedmodel is proposed to regularize the relationship of pixels in thelocal neighborhood, which can be employed to enhance the fuzzydetails in the target HR face image. It privileges the reconstruc-tion of pixels along the dominant orientation of structure, whichis useful for preserving high-frequency information on complexedges. Finally, we combine the two reconstruction models intoa unified framework. The output HR face image can be finallyoptimized by performing an iterative procedure. Experimentalresults demonstrate that the proposed face hallucination methodproduces superior performance than the state-of-the-art methods.

Index Terms—Face hallucination, super-resolution, regulariza-tion framework, manifold learning, kernel method.

I. INTRODUCTION

DURING the past decades, face analysis has received greatinterests and achieved impressive successes, especially

in the applications under real-world environments. However,in many real-world applications, it often suffers from low-resolution problems because of the limitations of digital imag-ing devices, which can severely affect the performance ofmost face analysis algorithms. It is desirable to break thelimitations of low-quality surveillance camera systems andfurther enhance the lost high-frequency details for practicalapplications.

Jingang Shi, Xin Liu and Yuan Zong are with the Center for Ma-chine Vision and Signal Analysis, University of Oulu, Finland. (e-mail:[email protected]; [email protected]; [email protected]).

Yuan Zong is also with the Key Laboratory of Child Development andLearning Science of Ministry of Education, Research Center for LearningScience, Southeast University, Nanjing, China.

Chun Qi is with the School of Electronic and Information Engineering,Xi’an Jiaotong University, Xi’an, China. (e-mail: [email protected]).

Guoying Zhao is with the School of Information and Technology, NorthwestUniversity, China, and also with the Center for Machine Vision and SignalAnalysis, University of Oulu, Finland. (e-mail: [email protected]).

This work was supported by Academy of Finland, Tekes Fidipro Pro-gram (Grant No. 1849/31/2015), Tekes Project (Grant No. 3116/31/2017),Infotech, National Natural Science Foundation of China (Grant No. 61772419,61572395) and Tekniikan Edistamissaatio Foundation. The authors wish to ac-knowledge CSC-IT Center for Science, Finland, for computational resources.

The asterisk indicates corresponding author.

Thus, we intend to approximate HR images from LRones, using the super-resolution (SR) technique [1]. Gener-ally, these algorithms can be mainly divided into two cate-gories: interpolation-based methods [2]–[4] and learning-basedmethods [5]–[37]. The interpolation-based methods usuallyregularize the reconstruction process by using certain priorknowledge, while the learning-based methods aim to recon-struct the relationship between LR and HR images through aset of training pairs. For the problem of large magnificationfactor, the learning-based methods significantly improve theperformance when they are compared with the interpolation-based ones. Due to the above advantages, learning-basedmethods attract more attention in recent research for practicalapplications. Chang et al. [10] utilized manifold learning meth-ods to approximate the relationship of local patches. Yang et al.[16] introduced sparse coding methods to learn dual compactdictionaries, which are effective to represent and reconstructHR patches. Timofte et al. [38], [39] employed regression-based methods for transforming the input LR patches into HRones. A set of mapping functions are learned from trainingset to describe the reconstruction procedure. Dong et al.[5]–[7] exploited internal similarities of natural images andutilized repetitive patterns to refine HR images. Recently, deeplearning methods [40]–[42] are also developed to generateHR images from LR ones in an end-to-end manner. Thelearning-based methods can be formulated for natural images,or can be designed to deal with particular tasks, such as facehallucination, medical image super-resolution and documentimage super-resolution. We focus on the learning-based facehallucination techniques in this paper.

According to the special characteristics of face images,Baker and Kanade [9] first proposed a learning-based faceSR approach for reconstructing HR face images, which learnsa prior on the distribution of image gradients through thetraining set. The specific SR procedure of face images isalso called “face hallucination” in their literature. Liu et al.[12] performed the face hallucination work into two steps:global reconstruction and residual compensation. The first stepgenerates the global face image which keeps the characteristicsof face image using probabilistic method. Then, the secondstep produces the residual image in patch-wise to compensatethe result of the first step. Following [12], lots of classical datamodels are utilized to describe the relationship of LR and HRface images, which includes PCA [11], Locality PreservingProjection (LPP) [13], Tensor Factorization [14], Non-negativeMatrix Factorization (NMF) [16], Canonical Correlation Anal-ysis (CCA) [15] and Two-dimensional Canonical CorrelationAnalysis (2DCCA) [17]. However, the reconstruction quality

Page 2: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

2 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

LR training set

w1 w2 w3 w4

Input LR image

HR training set

Linear

Target HR image

Fig. 1. The SR reconstruction process of traditional patch-based methods.The reconstruction weights of HR patch are directly estimated from thecorresponding LR patch, which requires the consistency of combinationweights between LR and HR space.

of global face images highly depends on the given training setin these methods. It requires a large number of training imagesfor better results. Although residual compensation is utilizedto alleviate the artifacts, ringing effects often take place in thehigh-frequency area of face images.

Recognizing the fact that the global reconstruction tendsto induce artificial results, Ma et al. [19] proposed a patch-based reconstruction method instead of the complicated two-step model. In their method, face image is considered asa highly structured object and divided into patches accord-ing to the position prior. The position information can beserved as a constraint, while the HR image patches arereconstructed by the combination of training patches at thesame position. With the assumption that image patches sharethe same combination relationship in the LR and HR space,the combination weights can be learned in the LR space bya constrained least square problem. Similar idea was alsofurther developed in [31] for the multiview face hallucinationproblem. The additional position constraint really enhances theperformance of face hallucination, which has been utilizedby lots of researchers [20]–[30]. Some following methodsdevised reasonable regularization terms for achieving morestable and accurate solutions, such as `1-norm minimizationterm [20] [24], constrained `p-norm minimization term [25],fused Lasso-based smooth term [30]. Some researchers recon-structed the target HR image by transforming the original faceimage into latent subspace [18] or Discrete Cosine Transform(DCT) domain [22]. Manifold learning methods were furtherconsidered in [26] [27] to preserve the local geometry ofimage patches. The authors of [28] and [32] learned linearprojection models from the training set in order to estimatethe details of HR images. Zeng and Huang [33] expanded thetraining data for improving the quality of face hallucinationresults. Recently, deep learning has also been adopted in facehallucination problem [34]–[37]. Due to the advantages ofdeep learning, these methods produce fine performance withrich details. Meanwhile, more computational resources are alsorequired in the training phase of deep neural network.

Generally, the common idea of patch-based methods is torepresent the input LR patch as the combination of LR trainingpatches with suitable reconstruction model. The reconstructionweights are then directly utilized to recover the target HR

w1 w2 w3 w4

HR training setTarget HR image

Target

FeatureSample1 Sample2 Sample3 Sample4

RKHS RKHS

Nonlinear Relationship

Fig. 2. The illustration of the proposed patch-based method. It abandons toregularize the relationship of image patches in the LR space. The HR patchesare projected into RKHS, while the linear combination in the RKHS impliesthe nonlinear relationship in the HR space. The proposed patch-based modelwill be served as a regularization in the final framework to optimize the targetHR image.

patch with the corresponding HR training set. An illustration isshown in Fig. 1. To preserve the consistency of reconstructionweights, these methods require a basic assumption that LR andHR patches share the same local geometry. Actually, due to theone-to-many relationship between LR patch and correspondingHR one, the combination weights in the LR space can hardlyreflect the truth of HR space. This problem has also beenpresented in the literature [1], [18], [26], [27]. Previously, re-searchers attempted to improve the consistency of combinationweights for achieving reasonable results. Though these algo-rithms devise further regularization to optimize the mismatchof local geometry, the correspondence between LR and HRspace is really hard to completely satisfy. The weights obtainedin the LR space can be considered as a sub-optimal solution,which gives a rough approximation to the original value. It isno doubt that plausible HR images can be reconstructed bydirectly utilizing the corresponding weights in the LR space.However, artifacts will take place in the target HR imagesince it fails to achieve the assumption of sharing the samereconstruction weights.

To improve the above disadvantages, we abandon the as-sumption of sharing the same weights in the space of variousdimensions. In the proposed method, the relationship betweenthe target patch and training images is only considered in theHR space, which avoids to deal with the tough problem of p-reserving local geometry in different resolutions. Furthermore,most previous algorithms utilize the reconstruction frameworkwhich represents each local patch as a linear combination oftraining patches. Nevertheless, it is a limitation to describethe reconstruction process by linear relationship when thenumber of training patches is large. Consequently, it is a goodchoice to leverage the nonlinearity for better representing thetarget patch. To this end, we embed the HR patches into aninfinite-dimensional reproducing kernel Hilbert space (RKHS)by kernel function [43]. The HR patches are assumed to havea linear relationship in the RKHS, which implies the nonlinearrelationship in the original HR space. Fig. 2 gives a conciseinstruction for the proposed regularization model, which isdifferent from the traditional model shown in Fig. 1.

In the patch-based model, the whole HR image is usuallydivided into overlapped patches with moderate size (e.g.,

Page 3: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 3

RKHS

Linear relationship

Image texture

HR Training Set

Patch-wise

Pixel-wise

Local structure

Updating HR image

Initializing

The Regularization models Optimization

Fig. 3. The flowchart of the proposed face hallucination method. The detailsof image texture for the target HR image can be well reconstructed by thepatch-based model, while the pixel-based model further enhances the localstructure (e.g. edge and corner). The target HR image can be finally obtainedby an iterative optimization process.

12 × 12). However, the patch size is a bit large for thereconstruction of local edge structure. The pixels along theedge are just a mere portion in such a patch, while theremaining pixels mainly describe the smooth texture. Since allthe pixels in the patch are taken into account equally, it willinduce somewhat blurring effect on the local structure. In otherwords, the pixels around the edge require further considerationto enhance the high-frequency information. Thus, we furtherpropose an additional regularization in pixel-wise to constrainthe distribution of pixels. For each pixel in the HR image, weregularize the relationship of neighboring pixels to describelocal structure, while the edge details are enhanced by similarstructures in the measured samples. The steering kernel [44]is employed to estimate the orientation of local structure inthe neighborhood. The pixels on the dominant orientationwill be further assigned with large weights in the pixel-based model, which is useful for suppressing artifacts on thecomplex edges. Finally, the above reconstruction models areincorporated into a global regularization framework, whichproduces the target HR image by iterative optimization. Aflowchart of the proposed method is shown in Fig. 3. Thecontributions of this paper are summarized as follows:• Due to the difficulty of preserving local geometry consis-

tency, the assumption of sharing the same weights in variousdimensional spaces is abandoned in the proposed method.Alternately, the patch-based model is directly built in the HRspace, regardless of the combination relationship in the LRspace.• Motivated by the fact that kernel function can capture the

nonlinear similarity of features, the HR patches are mappedinto RKHS, which helps in efficiently representing the non-linear relationship for the reconstruction.• A pixel-based model is presented to emphasize influence

on the pixels along the dominant orientation of structure,which results in preservation of details for the region of localedge.• Experimental results show that the proposed method has

the ability to produce fine performance with various experi-mental conditions. The results also illustrate the robustness ofthe proposed face hallucination method.

• • •• • •

• • •

••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••

••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••

•••••••••••••••••••••

•••••••••••• •••••••••

•••••••••

The High-resolution Patch

Blurring

Down-sampling

Receptive field

The Low-resolution Patch

Fig. 4. An example for the degradation model with a blurring filter of size7 × 7 and down-sampling operator by a factor of 4. All the pixels in thereceptive field (red rectangle) contribute to the generation of the blue pixelin the LR patch, while the red pixels exceed the scope of the correspondingHR patch (green rectangle). Thus, the degradation from the HR patch to theLR one can not be seen as linear process.

II. RELATED WORK

The target of face image hallucination is to estimate a HRface image from the corresponding LR one, where the lowquality face image is degraded by smoothing, down-samplingand noising processes. The whole low-resolution face imagegeneration process can be represented as:

Y=DBX+ε (1)

where Y and X are the LR and HR face images respectively,B represents the blurring filter, D represents the down-sampling process and ε denotes the additional noise. Theobjective of reconstruction is to solve the above inverseproblem in order to obtain the original HR face image X.

Recently, plenty of algorithms [19]–[31] recover the HRface image from the corresponding LR one under the patch-based reconstruction framework due to its effectiveness. Gen-erally, the patch-based methods first divide the whole LRimage Y into overlapped patches. Then, the correspondingHR patches can be reconstructed by utilizing the LR and HRtraining pairs. The final output image is obtained by averagingall the hallucinated HR patches. For each LR patch y, it willbe represented by a linear combination of LR training samplesyi

pi=1:

y = y1ω1 + y2ω2 + . . .+ ypωp (2)

where ωi is the combination coefficient for the correspond-ing training sample. The HR patch x can be estimated bysubstituting the LR training patches yi

pi=1 to the HR ones

xipi=1:x = x1ω1 + x2ω2 + . . .+ xpωp (3)

However, there are two main disadvantages for these meth-ods:• The above reconstruction model assumes that image

patches maintain similar local geometry in the LR and HRspace. Thus, the LR and corresponding HR patches sharethe same linear combination coefficients for reconstruction.Nevertheless, the LR and HR patches actually fail to maintainthe manifold consistency assumption. Define the function ϕi(·)

Page 4: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

4 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

to describe the degradation procedure from the HR patch tothe LR one:

yi = ϕi(xi) (4)

Equation (2) can be rewritten as:

ϕ(x) = ϕ1(x1)ω1 + ϕ2(x2)ω2 + . . .+ ϕp(xp)ωp (5)

Though the degradation of the whole face image can berepresented as a linear process according to (1), the generationof LR patch from the HR one is actually nonlinear due tothe large receptive field of each LR pixel. An illustration isshown in Fig. 4. For most pixels in the LR patch, the range ofreceptive field exceeds the corresponding HR patch so that thefunction ϕi(·) depends on some additional pixels out of thetarget area. What is worse, the function ϕi(·) always variesfor different patches xi because of the influence of additionalpixels in the receptive field. Apparently, it is hard to derive(3) from (5) when the expression of ϕi(·) is not explicit.

If we loose the strict condition and consider the functionϕi(·) as a simple linear operator, the degradation of imagepatches can be represented as yi = Pxi (P ∈ Rm×n,m <n). Thus, equation (2) can be rewritten as:

Px = Px1ω1 + Px2ω2 + . . .+ Pxpωp (6)

If we want to derive (3) from (6), it requires the matrix(P ∈ Rm×n) to meet the condition rank(P) = n. However,the number of rows is less than the number of columns inthe matrix P, which induces rank(P) ≤ m < n. Thus, it isnot reasonable to reconstruct the HR patch with the manifoldconsistency assumption. In the proposed method, we will notconsider the relationship of image patches in the LR space,but regularize the target patch in the HR space directly.• In (3), it is assumed that each image patch can be

reconstructed by a linear combination of patches in the trainingset. However, the linear model is difficult to describe therelationship between the target patch and the correspondingtraining patches. For the patch-based methods, face imagesare often divided into very small patches, such as only 3× 3pixels in LR case. The dimension of image patches is muchless than the number of training samples, so that the solutionof combination coefficients ωipi=1 is infinite. The previousapproaches usually impose various regularization terms toobtain the unique solution. In this case, the results of ωipi=1

are highly influenced by the regularization prior instead of theoriginal feature in the image patch. Thus, besides of utilizingthe regularization prior, we also project the HR patches intoan infinite dimensional RKHS for further reconstruction. Dueto the increase of feature dimension, the optimization ofcombination relationship becomes more stable. Meanwhile,the nonlinear relationship of HR image patches is also capturedin the RKHS, which produces better description for imagefeature than the linear model in the original HR space.

III. THE PROPOSED ALGORITHM

The reconstruction model (1) of the underlying HR faceimage X is severely under-determined, which means that eachLR face image corresponds to infinite HR estimations. In orderto improve this problem, it is essential to incorporate further

regularization prior for the whole reconstruction procedure.With the additional prior, it is possible to obtain a uniquesolution for the HR face image X. The reconstruction modelcan be represented as:

X= arg minX

‖Y−DBX‖22 +γR(X)

(7)

where R(X) represents the regularization models and γ is thecorresponding parameter that balances the reconstruction errorand the regularization term. In (7), the first term ensures theconsistency of the estimated HR face image and the observedLR one, while the additional prior further regularizes the localdetails for the HR face image. According to the characteristicof face image, we propose two effective regularization modelsin order to constrain the reconstruction process. The details ofthe above two priors will be described in the following.

A. Regularizing local texture by HR image patch

Let InL and InH represent the LR and HR training imagesrespectively, where n = 1, 2 . . . N is the corresponding samplenumber in the training set. As shown in Fig. 2, each faceimage is divided into overlapped patches according to theposition. Denote lnp and hnp as the LR and HR image patchescentered at position p. The whole training set can be classifiedinto M groups Lp,HpMp=1, where Lp =

[l1p, . . . , l

Np

]and

Hp =[h1p, . . . ,h

Np

]represent the corresponding LR and

HR patches located at the same position. With the positionconstraint, training pairs Lp,Hp can be utilized to recoverthe local details of LR patch at the corresponding position p,which has been proved the effectiveness in [19].

For an input LR face Y, it is also divided into overlappedpatches for reconstruction. We represent the LR image patchcentered at position p as yLp , while the corresponding HRpatches xHp is expected to be estimated. In the previousalgorithms, the common method for reconstructing the HRparts is to assume that the HR and LR patches share the samegeometry structure. The optimization problem can be solvedby minimizing the following function:

J (ωp) =

∥∥∥∥∥yLp −∑i

ωiplip

∥∥∥∥∥2

2

+ λΩ (ωp) (8)

where ωp is the vector that contains all the combinationcoefficients ωip and λ is utilized to balance the two data terms.The first term represents the reconstruction error in the LRspace, and the second term enforces the constraint for thecombination relationship.

In order to avoid the disadvantage for the inconsistency ofLR and HR manifold, we directly consider the relationshipbetween the target patch and training set in the HR space.The optimization objective can be written as:

J(ωp,x

Hp

)=

∥∥∥∥∥xHp −∑i

ωiphip

∥∥∥∥∥2

2

+ λΩ (ωp) (9)

In equation (9), the reconstruction weight is no longer estimat-ed in the LR space, which avoids the inaccuracy of manifoldconsistency assumption.

Page 5: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 5

In order to overcome the shortcoming of the above linearcombination model between HR image patches, we furtherintroduce kernel function to induce an infinite dimensionalRKHS for capturing the nonlinear similarity of patches. All theimage patches are projected into infinite dimensional kernel-mapped space, while the target is to find an optimal linearcombination of nonlinear features. The linear relationship insuch a RKHS leads to nonlinear relationship for the HRpatches in the original space. Since the nonlinear structures ofpatches are taken into account, it is more effective to estimatereasonable HR patches.

Let φ : Rd → RF (d F) be a nonlinear mappingoperator, which projects the data from the original featurespace to an infinite dimensional RKHS space. The kernelfunction k(xi,xj) = φ(xi)

Tφ(xj) describes the nonlinearsimilarity between two features xi and xj in the RKHS.By using the operator φ, we are able to project the featuredata from the original HR space into RKHS, where the targetHR patch and the corresponding HR training patches can berepresented as:

xHp → φ(xHp )

Hp → φ(Hp) =[φ(h1

p), . . . , φ(hNp )] (10)

Thus, we substitute the kernel-mapped features to the formu-lation of objective function (9). It can be rewritten as:

J(ωp,x

Hp

)=

∥∥∥∥∥φ(xHp )−∑i

ωipφ(hip)

∥∥∥∥∥2

2

+ λΩ (ωp) (11)

To avoid the bias solution, the regularization term Ω (ωp) isrequired to constrain the combination in the RKHS. Inspiredby [45], it will obtain better performance if local similarsamples are privileged to get larger combination coefficientsin the reconstruction. The similarity constraint is imposed onthe RKHS, which can efficiently reflect the real characteristicsof HR patches. Thus, we define a kernel-based similarityconstraint which measures the distance of HR patches in theRKHS. The objective function is formulated as:

J(ωp,x

Hp

)=

∥∥∥∥φ(xHp )−∑i

ωipφ(hip)

∥∥∥∥22

+ λ ‖dp ωp‖22s.t.∑i

ωip = 1

(12)where denotes the Hadamard product, the constraint en-sures the shift-invariant requirement and the vector dp =[d1p, . . . , d

Np ]T serves as a penalty which describes the kernel

similarities between the target HR patch and correspondingtraining samples in the kernel-mapped space. The kernelsimilarity distance can be represented as:

dip =∥∥φ(xHp )− φ(hip)

∥∥2

=

√φ(xHp )

Tφ(xHp ) + φ(hip)

Tφ(hip)− 2φ(xHp )

Tφ(hip)

=√k(xHp ,x

Hp )+k(hip,h

ip)− 2k(xHp ,h

ip) (13)

Thus, if one training sample φ(hip) is similar with the targetHR patch φ(xHp ), the penalty dip would be small, which in-duces larger value for the corresponding weight ωip. The above

regularization term explicitly encourages the locality propertyfor the reconstruction of patch-based model in the kernel-mapped space. Thus, there are only a few significant values forthe solution, while most of the weight coefficients are closeto zero. In practice, we merely consider the correspondingweights of P = 200 nearest neighboring samples, while set thecoefficients of other samples to be zero. When the number oftraining samples is less than P , all the samples are consideredin the regularization.

Define Rp as the matrix to extract HR image patch centeredat position p of the entire image. We then consider thewhole HR image X as a variable. It leads to the followingoptimization problem:

J (ωp,X) = ‖φ(RpX)− φ(Hp)ωp‖22 + λ ‖dp ωp‖22s.t.1Tωp = 1

(14)where 1 is a column vector of ones. Incorporating (14) to thereconstruction model (7), the resulting optimization problemis represented as:

ωp,X = arg minωp,X‖Y −DBX‖22

+α∑p

(‖φ(RpX)− φ(Hp)ωp‖22 + λ‖dp ωp‖22)

s.t. 1Tωp = 1, p = 1, 2, ...,M (15)

where the parameter α measures the contribution of thecorresponding regularization term.

B. Regularizing local structure along edge

The patch-based model recovers most high-frequency tex-tures for the face image. However, the local structure requiresfurther consideration for suppressing artifacts on the complexedges. It aims to characterize the relation between neighboringpixels for enhancing the edge structure, which is not consid-ered in the patch-based prior.

For the pixel xp on the pth position of the estimated HRface image X, a 5× 5 square window Np is selected aroundthe target pixel as the neighborhood. In order to describethe property of local structure, we search K = 50 nearestcorrelative neighbors from the same position in the HR trainingimages for reference. The selected measured patches sharesimilar structure for the texture near the edge, which can beutilized to optimize the local details for the target patch. Let tipbe the pixel at position p of the ith correlative neighbor. Thetarget pixel xp is assumed to be estimated by linear regression:

xp =∑i

tipβip (16)

To obtain a unique solution, the locality-constrained `2-normis utilized to regularize the regression:

βp = arg minβp

∣∣∣∣∣xp −∑i

βiptip

∣∣∣∣∣2

+ µ∥∥gp βp∥∥22

(17)

where βp contains all the regression coefficients βip, thevector gp =

[g1p, . . . , g

Kp

]describes the distance of pixels at

position p with gip =∣∣xp − tip∣∣, and µ is the regularization

Page 6: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

6 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

parameter. Apparently, a single pixel can not well describethe local structure. It is essential to further take into accountthe neighboring pixels in the local neighborhood. We supposethat the details of edges in the target patch can be enhancedby the regression of K selected measured samples, while allthe neighboring pixels share the same regression coefficientswith the central pixel. The simplest way is to gather the lossfunctions for all the pixels in the neighborhood. Similar to(17), the objective function can be represented as:

βp = arg minβp

∑q∈Np

∣∣∣∣∣xq −∑i

βiptiq

∣∣∣∣∣2

+ µ∥∥gq βp∥∥22

(18)

where the data term describes the fidelity in the reconstruc-tion and the regularization term encourages the pixels frommeasured samples with most similar textures to have largeregression coefficients. Specifically, the regularization term

in (18) can be rewritten as∑q∈Np

∥∥gq βp∥∥22=∥∥∥gNp

βp∥∥∥22,

where the element in vector gNpcan be represented as

giNp=√ ∑q∈Np

(giq)2. The element giNp

describes the total

distance of all the pixels between the target sample and the ith

measured sample. The pixels from similar measured sampleswill induce small distances giNp

in the weighting vector gNp,

which enlarges the corresponding coefficients βip in vector βp.In (18), the optimization function considers all the pixels inthe neighborhood equally. Actually, pixels located near theedge need to be privileged, since they will have much strongerinfluence for reconstructing the local structure. To solve theabove problem, we introduce the steering kernel [44] toadaptively measure the importance of each neighboring pixel.The objective function (18) can be rewritten as a weightedleast-square optimization problem:

βp = arg minβp

∑q∈Np

∣∣∣∣∣xq −∑i

βiptiq

∣∣∣∣∣2

+ µ∥∥gq βp∥∥22

wKpq

(19)where wK

pq describes the similarity between the pixel xp andthe neighboring pixel xq . It can be represented by:

wKpq =

√det(Cq)

2πh2exp

− (zq − zp)TCq(zq − zp)

2h2

(20)

where h is the smoothing parameter, the 2×1 vectors zp and zqare the 2-D coordinates of the pixels xp and xq respectively.Cq represents the symmetric gradient covariance matrix inthe vertical and horizontal directions, which relates to thedominant orientation of gradients in the local neighborhood[44]. To simplify the objective function (19), we can be furtherrepresent it as

βp = arg minβp

(χp − Ipβp

)TWK

p

(χp − Ipβp

)+ µ

∥∥Gβp∥∥22(21)

where the vector χp = X(Np) represents the concatenation ofall the neighboring pixels around position p in the target HRimage, Ip = [I1H(Np), . . . , I

KH(Np)] represents the K nearest

correlative references from the HR training set, the similarity

matrix WKp and the weighting matrix G are diagonal matrices

with wKpq(q ∈ Np) and gi =

√ ∑q∈Np

wKpq(xq − tiq)

2 on the

diagonal respectively.By differentiating the objective function (21) with respect

to βp, the solution of the regression coefficient can be derivedanalytically:

βp = (IpTWK

p Ip + µGTG)−1ITpW

Kp χp (22)

Once the regression coefficient βp has been estimated, thelinear regression (16) can be rewritten as:

xp = eTp Ip(IpTWK

p Ip + µGTG)−1ITpW

Kp χp (23)

where ep is a column vector with the central element (cor-responding to position p) equal to one and the rest equal tozero. Equation (23) can be simplified as xp = aTp χp, wherethe vector ap = (eTp Ip(Ip

TWKp Ip + µGTG)−1ITpW

Kp )T

and each element in ap represents the corresponding weightof neighboring pixel to estimate the target pixel xp. Toregularize the local structure for the whole HR image, weexpect to minimize the approximation error for all the pixels,i.e. min

∑p

∥∥xp − aTp χp∥∥22. The above minimization term can

be described as the following concise form: min ‖X−AX‖22,where the matrix A regularizes the relationship of pixelsfor global optimization. Specifically, A(p, q) represents thecontribution of pixel xq to estimate pixel xp:

A(p, q) =

aqp if q ∈ Np0 otherwise

(24)

Incorporating the above regularization model into (15), thefinal optimization function can be obtained by:

ωp,X = arg minωp,X‖Y −DBX‖22 + η‖X−AX‖22

+α∑p

(‖φ(RpX)− φ(Hp)ωp‖22 + λ‖dp ωp‖22)

s.t. 1Tωp = 1, p = 1, 2, ...,M (25)

where the parameter η describes the importance of the pro-posed pixel-based model.

C. Summary of the proposed algorithm

The final output HR face image X can be obtained bysolving the above optimization problem (25). However, theobjective function is a nonconvex optimization problem. Itis difficult to obtain the global optimal solution directly.Fortunately, we can expect to obtain a local optimal solutionby an iterative procedure. To deal with the problem in (25),we alternatively optimize over ωp and X, while keep the otherone fixed. If the value of objective ωp is fixed, the optimal Xcan be obtained by minimizing the following function:

f(X) = ‖Y −DBX‖22 + η‖X−AX‖22+α

∑p

(‖φ(RpX)− φ(Hp)ωp‖22 + λ‖dp ωp‖22) (26)

Page 7: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 7

The solution for minimizing (26) can be iteratively addressedby using the gradient descent method. The iterative procedurecan be updated by:

Xn+1 = Xn − τ∇f(Xn) (27)

where τ is the step size of the gradient descent. The gradientof function (26) can be further described as:

∇f(X) = 2(DB)T (DBX−Y)

+2η(I−A)T (I−A)X + α∑p

∇Jp(X) (28)

The kernel function in Jp(X) projects the image patchesfrom the original HR space to high-dimensional kernel spacefor better representing the local relationship. In order tosimplify (28), we first need to define the concrete expres-sion for kernel function. The Gaussian kernel k(xi,xj) =exp(−σ ‖xi − xj‖22) is utilized in the proposed method, sinceit can transform the feature to infinite dimension, which helpsto seek the nonlinear characteristics. In this case, the gradientfunction ∇Jp(X) can be further described as:

∇Jp(X)

= ∇‖φ(RpX)− φ(Hp)ωp‖22

+λ∑i

(ωip)2∇∥∥φ(RpX)− φ(hip)

∥∥22

= −2∇

[φ(RpX)]

T[φ(Hp)ωp]

−2λ

∑i

(ωip)2∇

[φ(RpX)]T

[φ(hip)]

= −2∑i

[ωip + λ(ωip)2]∇e−σ‖RpX−hi

p‖22

= 4σ∑i

[ωip + λ(ωip)2]e−σ‖RpX−hi

p‖22RTp (RpX− hip) (29)

When the value of X is fixed, the optimization of equation(25) reduces to minimize:

Jp(ωp) = ‖φ(RpX)− φ(Hp)ωp‖22 + λ‖dp ωp‖22s.t. 1Tωp = 1, p = 1, 2, ...,M (30)

The solution of ωp can be derived from (30) analytically:

ωp = (Z−11)/

(1TZ−11) (31)

where the matrix Z can be obtained by:

Z = (φ(RpX)1T − φ(Hp))T (φ(RpX)1T − φ(Hp))

+λ(diag(dp))2 (32)

where diag(•) is the diagonalization function for the corre-sponding vector. Considering the characteristics of Gaussiankernel, each element in the matrix Z can be concretelyrepresented by:

zij =

1− e−σ‖RpX−hi‖22 − e−σ‖RpX−hj‖22

+e−σ‖hi−hj‖22 + λ(dip)2

if i = j

1− e−σ‖RpX−hi‖22 − e−σ‖RpX−hj‖22

+e−σ‖hi−hj‖22otherwise

(33)

The objective function is solved by alternatively optimizing(26) and (30). Thus, it requires an initial estimation for thetwo objectives in order to begin the iterative process. We firstutilize the LR image features to substitute the correspondingfeatures of HR space in (30), which gives a rough estimationfor the initial value of ωp. Before initialization, the LRimages are magnified to the same size of HR ones by bicubicinterpolation for obtaining more high-frequency information.We also simply consider the interpolation result of Y as theinitial estimation for the HR face image X(0).

With the initial value of the two objectives, we can optimizethe output HR image X by minimizing (26). Once this is done,the combination coefficients ωp in each position are obtainedby solving (30). We repeat the iterative procedure to update theabove objectives until the optimization problem (25) convergesto the local minimum. Finally, the optimal solution X isobtained as the output HR face image. The details of thecomplete face hallucination procedure are summarized inAlgorithm 1.

Algorithm 1 The proposed face hallucination algorithm.Input: LR testing image Y, LR/HR training images.

1: Initialization: Estimate the initial HR face image X(0)

and combination coefficientsω

(0)p

Mp=1

on each local

position.2: for t=1 to T do3: repeat4: Construct the relationship of local patches in the HR

feature space according to (14).5: Learn local edge structure by the weight matrix A(t)

according to (24).6: Update X(t) through (26) while fixing the value of

ω(t−1)p

Mp=1

.

7: Updateω

(t)p

Mp=1

through (30) while fixing the

value of X(t).8: until convergence9: end for

Output: The final output HR face image X∗.

IV. EXPERIMENTAL RESULTS

The experiments are conducted on the CAS-PEAL [46] andFERET [47] face databases for demonstrating the performanceof the proposed method. For the CAS-PEAL database, wechoose a subset that contains 1040 frontal face images withnormal expression for experiment. We randomly choose 40images for testing, while the other 1000 images are utilizedfor training. For the FERET database, 600 images are selectedfrom the ba, bj and bk subsets which corresponding to 200different persons. The ba subset contains frontal images withregular expression. The bj subset consists of images withalternative expression, while the bk subset suffers from illu-mination variation. Among them, 30 images of 10 individualsare selected for testing, and the other images are served as thetraining set. All the face images are aligned by the positions of

Page 8: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

8 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

two eyes and cropped to 128×128 pixels. The HR images areblurred using a 7 × 7 Gaussian filter with standard deviation0.85 and down-sampled to 32 × 32 pixels for producing theLR images. In the experiments, the intensity values of imagesare converted to the range from 0 (black) to 1 (white) forconvenience. We will compare the results of the proposedmethod with several state-of-art baselines (e.g., LSR [19],WASR [25], LINE [26], SRLSP [28], VDSR [41] and LCGE[37]) in order to validate the effectiveness.

A. Parameter settings

Several parameters are required to be fixed in the proposedmethod. For the patch-based model, HR patches are set to 12×12 in size with four pixels overlapped. The penalty parameterλ in (25) and the parameter σ in Gaussian function is set to0.001 and 0.01, respectively. In the pixel-based method, theparameter µ is fixed to 0.5 and the smoothing parameter h isselected as 2. We set the regularization parameters η = 0.05and α = 0.01 in the optimization function (25). The maximumnumber of iterations is set to T = 10.

B. Analysis of Assembling Regularization Models

In the proposed method, two regularization models areassembled together to optimize the target HR image. In thissubsection, we aim to discuss the influence of regularizationparameters α and η, which are employed to balance theregularization terms in (25). The experiment will also showthat the combination of the two regularization models isefficient to enhance the final SR results.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Parameter

30.9

31

31.1

31.2

31.3

31.4

31.5

PS

NR

(dB

)

0.42dB

(a)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Parameter

0.933

0.934

0.935

0.936

0.937

0.938

0.939

0.94

SS

IM

0.0051

(b)

Fig. 5. The effects of parameter η on the hallucinated face images. (a) PSNR.(b) SSIM.

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Parameter

30.4

30.6

30.8

31

31.2

31.4

PS

NR

(dB

)

0.96dB

(a)

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Parameter

0.929

0.93

0.931

0.932

0.933

0.934

0.935

0.936

0.937

0.938

0.939

SS

IM 0.0095

(b)

Fig. 6. The effects of parameter α on the hallucinated face images. (a) PSNR.(b) SSIM.

We randomly choose 10 testing images from the CAS-PEAL face database for conducting the experiment. Theaverage PSNR and SSIM [48] values are utilized to evaluatethe influence of parameters. We tune parameter η from 0 to0.1 with step 0.01 and parameter α from 0 to 0.05 with step0.005, respectively. Fig. 5 and Fig. 6 show the plots of theaverage PSNR and SSIM values along with the varying of ηand α, respectively. We can observe that the best results areobtained when η = 0.05 and α = 0.01. When η = 0, it meansthat the output HR image is merely regularized by the patch-based model. Similarly, the pixel-based model is separatelyutilized to produce the HR image when α = 0. Thus, the aboveexperimental results also make a comparison for the effect ofthe two regularization models. We can observe that the pixel-based model improves the reconstruction quality by 0.42dBin PSNR and 0.0051 in SSIM, while the patch-based modelenhances the average quantitative indicators by 0.96dB inPSNR and 0.0095 in SSIM. It illustrates that the combinationof two regularization models can produce superior results forface hallucination.

C. Experiments on the CAS-PEAL database

In this subsection, the proposed approach is comparedwith the state-of-the-art methods to evaluate the effectiveness,which include LSR [19], WASR [25], LINE [26], SRLSP [28],VDSR [41] and LCGE [37]1. Some representative hallucinatedresults are shown in Fig. 7. LSR [19] produces blurring effectson the eyes, nostrils and mouth. The ringing effects alsoappear around the contour of face image. WASR [25] takesthe advantage of sparse representation, which achieves moredetails than LSR [19] on the face image. However, the staircasenoises appear along the margin of face obviously. LINE [26]alleviates the jaggy artifacts near the contour, but the finedetails are not well recovered. The textures are also sufferedfrom smoothness around the eyes and mouth. SRLSP [28]enhances the local texture on the face image, but some aliasingartifacts are produced near the areas with high-frequencydetails. Also, slight blocking artifacts appear on the nose whenthe image is under high illumination condition. VDSR [41] andLCGE [37] produce face images with more reasonable detailsby deep learning, whereas the above algorithms sometimescause noticeable artifacts near the region of eyes. The proposedapproach suppresses the noises on the face image and avoidsthe ringing effects near the contour, while also producessharper details on the eyeballs and mouth regions. Especially,the proposed method successfully reconstructs the details oftwo-layer eyelids and pouch in some representative results.Meanwhile, the comparison methods fail to recover the microhigh-frequency information on the face image.

In Fig. 8, we further compare the PSNR and SSIM valueson the reconstructed HR face images for evaluating the abovealgorithms. The proposed method achieves the best perfor-mance for most testing images. The average PSNR and SSIMvalues of various methods are shown in Table I. Comparedwith the second best method [37], the proposed method gains

1The source codes of LINE [26], SRLSP [28] and LCGE [37] are availableon their authors’ webpages.

Page 9: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 9

(a) (b) (c) (d) (e) (g) (h) (i)(f)

Fig. 7. SR results on the CAS-PEAL database. (a) Input LR image. (b) LSR [19]. (c) WASR [25]. (d) LINE [26]. (e) SRLSP [28]. (f) VDSR [41]. (g) LCGE[37]. (h) The proposed method. (i) Original HR image. (The apparent distinctions between the compared algorithms and the proposed one are marked in thered rectangle. Please refer to the electronic version and zoom in for better comparison.)

5 10 15 20 25 30 35 40

Index of testing image

27

28

29

30

31

32

33

34

35

36

37

PS

NR

(dB

)

ProposedLCGEVDSRSRLSPLINEWASRLSR

(a)

5 10 15 20 25 30 35 40

Index of testing image

0.88

0.9

0.92

0.94

0.96

0.98

1

SS

IM

ProposedLCGEVDSRSRLSPLINEWASRLSR

(b)

Fig. 8. PSNR and SSIM values on the CAS-PEAL database. (a) PSNR. (b)SSIM.

TABLE ITHE AVERAGE PSNR AND SSIM VALUES FOR CAS-PEAL DATABASE.

TOP: PSNR(dB), Bottom: SSIM.

LSR WASR LINE SRLSP VDSR LCGE Proposed

29.67 30.69 30.95 31.06 31.12 31.13 31.730.9177 0.9290 0.9317 0.9333 0.9345 0.9350 0.9404

the improvement of 0.60dB in PSNR and 0.0054 in SSIM.The experimental results further illustrate the advantage of theproposed method in quantitative comparison.

D. Experiments on the FERET database

We also conduct the experiments on the FERET databasein order to show the robustness of the proposed method. Thetesting face images are acquired under significantly differentilluminations and face expressions. It is utilized to evaluatethe proposed algorithm under various conditions. Some com-parison results are presented in Fig. 9. The proposed approachproduces best visual quality, both in terms of flat textures and

5 10 15 20 25 30

Index of testing image

29

30

31

32

33

34

35

36

37

38

39

PS

NR

(dB

)

ProposedLCGEVDSRSRLSPLINEWASRLSR

(a)

5 10 15 20 25 30

Index of testing image

0.86

0.88

0.9

0.92

0.94

0.96

SS

IM

ProposedLCGEVDSRSRLSPLINEWASRLSR

(b)

Fig. 10. PSNR and SSIM values on the FERET database. (a) PSNR. (b)SSIM.

TABLE IITHE AVERAGE PSNR AND SSIM VALUES FOR FERET DATABASE. TOP:

PSNR(dB), Bottom: SSIM.

LSR WASR LINE SRLSP VDSR LCGE Proposed

31.74 32.19 32.19 32.34 32.45 32.59 32.830.8941 0.9005 0.8995 0.9002 0.9088 0.9057 0.9088

details on high-frequency areas (e.g. eyes, teeth and mouth),while other approaches produce some noticeable artifacts onthe face image. Fig. 10 presents the corresponding PSNR andSSIM values for the comparison of different methods. Theaverage quantitative comparisons are also shown in Table II.Compared with LCGE [37], our method achieves an increaseof 0.24dB in PSNR and 0.0031 in SSIM. The experimentalresults demonstrate that our method has the ability to dealwith the variations of illumination and facial expression.

Page 10: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

10 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

(a) (b) (c) (d) (e) (g) (h) (i)(f)

Fig. 9. SR results on the FERET database. (a) Input LR image. (b) LSR [19]. (c) WASR [25]. (d) LINE [26]. (e) SRLSP [28]. (f) VDSR [41]. (g) LCGE[37]. (h) The proposed method. (i) Original HR image. (The apparent distinctions between the compared algorithms and the proposed one are marked in thered rectangle. Please refer to the electronic version and zoom in for better comparison.)

(a) (b) (c) (d) (e) (g) (h)(f)

Fig. 12. SR results on the real-world image. (a) Input LR image. (b) LSR [19]. (c) WASR [25]. (d) LINE [26]. (e) SRLSP [28]. (f) VDSR [41]. (g) LCGE[37]. (h) The proposed method. (The apparent distinctions between the compared algorithms and the proposed one are marked in the red rectangle. Pleaserefer to the electronic version and zoom in for better comparison.)

Fig. 11. The real-world image from Internet.

E. Experiments on the real-world images

In this subsection, we conduct the experiments on thereal-world images for comparison. Fig. 11 shows the real-world image which is collected from the Internet. The markedhuman faces are hallucinated in the experiment. The LR facesare manually aligned through the positions of two eyes andstandardized to the size of 32 × 32. Since individuals inthe image are all from America, we utilize the FERET facedatabase as the training set for reconstructing the HR faceimage. Fig. 12 presents the hallucinated results for variousmethods. From the experimental results, we can see that themethods of LSR [19], WASR [25] and LINE [26] are verysensitive to noises. Lots of aliasing artifacts appear on the

Page 11: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 11

TABLE IIITHE AVERAGE PSNR AND SSIM VALUES FOR CELEBA DATABASE. TOP:

PSNR(dB), Bottom: SSIM.

LSR WASR LINE SRLSP VDSR LCGE Proposed

33.23 33.55 33.57 33.69 33.57 33.51 34.250.9167 0.9207 0.9208 0.9227 0.9273 0.9220 0.9297

reconstructed face image, especially on the regions containingcomplex details. Furthermore, it seems that the textures ofeyes are mixed with the edge of eyelid in some SR results.The methods of [28] [41] and [37] have the ability to improvethe noisy effects on the SR results. However, the reconstructedresults are somewhat suffered from artifacts. It is apparent thatthe details around the area of eyes are not very clear. Theproposed method is able to produce very reasonable results.It suppresses the noises on the face image and recovers finedetails for the SR results, which achieves better performancethan other methods.

In order to further validate the effectiveness of variousalgorithms in real-world environment, we conduct an extensiveexperiment on the CelebA database [49]. The images of theCelebA dataset are all obtained from the Internet, whichincludes different lighting conditions and facial expressionsfrom various environments. In the experiment, we select1000 face images with near-frontal pose as training samples,while another 50 near-frontal face images are served as thetesting set. The HR face regions are cropped and resized to128×128 pixels. Then the 32×32 LR face images are obtainedby blurring and down-sampling procedures. Fig. 13 presentssome representative SR results for comparison. The proposedmethod produces fine details especially on the regions ofeyes and mouth. The PSNR and SSIM values for all thetesting images are shown in Fig. 14, while the average PSNRand SSIM values are also shown in Table III. Comparedwith other algorithms, the proposed method achieves the bestperformance in both visual quality and quantitative indicator.The experimental results illustrate that the proposed methodhas the capability to deal with face image in real-worldenvironment.

5 10 15 20 25 30 35 40 45 50

Index of testing image

28

30

32

34

36

38

40

42

PS

NR

(dB

)

ProposedLCGEVDSRSRLSPLINEWASRLSR

(a)

5 10 15 20 25 30 35 40 45 50

Index of testing image

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

SS

IM

ProposedLCGEVDSRSRLSPLINEWASRLSR

(b)

Fig. 14. PSNR and SSIM values on the CelebA database. (a) PSNR. (b)SSIM.

F. Impact of training set sizeIn this subsection, we conduct the experiments on the CAS-

PEAL database to discuss the influence of training set size. In

TABLE IVTHE AVERAGE PSNR AND SSIM VALUES FOR VARIOUS TRAINING SET

SIZES.

Size 200 400 600 800 1000

PSNR (dB) 30.82 31.37 31.54 31.66 31.73SSIM 0.9315 0.9370 0.9391 0.9399 0.9404

the former experiment, we select 40 images from the databasefor testing, while all the remaining 1000 images are servedas the training set. To study the relationship between trainingset size and reconstruction quality, we decreases the size oftraining set from 1000 to 200 with step 200. Meanwhile,the 40 testing images are always fixed in the experiment forfurther comparison. We then perform the face hallucinationexperiments and compare the reconstruction results on trainingdata with various amounts of samples. Some representativehallucinated images are shown in Fig. 15. The proposedmethod preserves sharp details and produces reasonable HRimages with all the training sets. The change of visual qualityis very slight along with the variation of training set size.

5 10 15 20 25 30 35 40

Index of testing image

28

29

30

31

32

33

34

35

PS

NR

(dB

)

1000 training images800 training images600 training images400 training images200 training images

(a)

5 10 15 20 25 30 35 40

Index of testing image

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

SS

IM

1000 training images800 training images600 training images400 training images200 training images

(b)

Fig. 16. PSNR and SSIM values with various amounts of training samples.(a) PSNR. (b) SSIM.

Fig. 16 presents the PSNR and SSIM values for the 40testing images with various training set. The quantitative indi-cators gradually increase together with the amount of trainingsamples. Especially, the proposed method achieves fine resultseven when the training set only contains 200 image samples.The experimental results show that the proposed algorithm isinsensitive to the size of training set.

The average PSNR and SSIM values are also presentedin Table IV. The advantage of the proposed method is re-markable. Compared with the state-of-the-art methods shownin Table I (with 1000 training samples), it achieves superiorquantitative indicators with only 400 samples. The experimen-tal results further demonstrate that the proposed algorithm hasthe capability of producing satisfactory results with a smallnumber of training samples.

G. Comparison on Resource-limited Training Set

Recently, plenty of algorithms in computer vision area aimto train complex models from a mass amount of data. Thoughthe increase of training data really induces the improvement ofperformance, we have to deal with the lack of data in the real-world application when the resource of training set is limited.

Page 12: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

12 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

(a) (b) (c) (d) (e) (g) (h) (i)(f)

Fig. 13. SR results on the CelebA database. (a) Input LR image. (b) LSR [19]. (c) WASR [25]. (d) LINE [26]. (e) SRLSP [28]. (f) VDSR [41]. (g) LCGE[37]. (h) The proposed method. (i) Original HR image. (The apparent distinctions between the compared algorithms and the proposed one are marked in thered rectangle. Please refer to the electronic version and zoom in for better comparison.)

(a) (b) (c) (d) (e) (f) (g)

Fig. 15. SR results on the CAS-PEAL database with various amounts of training samples. (a) Input LR image. (b) 200 training images. (c) 400 trainingimages. (d) 600 training images. (e) 800 training images. (f) 1000 training images. (g) Original HR image. (Please refer to the electronic version and zoomin for better comparison.)

(a) (b) (c) (d) (e) (g) (h) (i)(f)

Fig. 17. SR results on the CAS-PEAL database with only 100 training samples. (a) Input LR image. (b) LSR [19]. (c) WASR [25]. (d) LINE [26]. (e) SRLSP[28]. (f) VDSR [41]. (g) LCGE [37]. (h) The proposed method. (i) Original HR image. (The apparent distinctions between the compared algorithms and theproposed one are marked in the red rectangle. Please refer to the electronic version and zoom in for better comparison.)

Page 13: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 13

TABLE VTHE AVERAGE PSNR AND SSIM VALUES ON THE RESOURCE-LIMITED

TRAINING SET. TOP: PSNR(dB), Bottom: SSIM.

LSR WASR LINE SRLSP VDSR LCGE Proposed

29.01 29.39 29.39 29.35 30.08 30.15 30.190.8997 0.9072 0.9065 0.9058 0.9250 0.9230 0.9211

Thus, we devise an experiment on the CAS-PEAL database forstudying the robustness of various algorithms. We fix the LRtesting images as previous mentioned, while only 100 imagesare randomly selected from the whole data set for training.It means that the amount of training images decreases from1000 to 100. Fig. 17 shows some representative SR resultsfor various algorithms. With resource-limited training data, thevisual quality of hallucinated image is apparently degraded forall the methods. It can be seen that serious artifacts have beenproduced on the output HR images. Though some noisy effectsappear on the results of the proposed method, it consistentlypreserves the ability to generate plausible high-frequency de-tails, especially on the region of eyes and mouth. Meanwhile,the algorithms [19], [25], [26], [28], [41] tend to induce ringingeffects and unexpected visual artifacts on the face image. Thenoisy and ringing effects can be suppressed by [37], but theresults of [37] seem to suffer from smoothness and lose thelocal details. We further compare the PSNR and SSIM valuesfor all the testing images in Fig. 18. The average PSNR andSSIM values are also shown in Table V. The methods [41] [37]and the proposed method achieve higher quantitative indicatorswhen compared with other approaches. The proposed methodalso produces more vivid details than [41] [37], which isimportant to the discriminability of human visual system.The experimental results further illustrate that the proposedalgorithm has the advantage of producing superior results withresource-limited training data.

5 10 15 20 25 30 35 40

Index of testing image

26

27

28

29

30

31

32

33

34

35

36

PS

NR

(dB

)

ProposedLCGEVDSRSRLSPLINEWASRLSR

(a)

5 10 15 20 25 30 35 40

Index of testing image

0.86

0.88

0.9

0.92

0.94

0.96

0.98

SS

IM

ProposedLCGEVDSRSRLSPLINEWASRLSR

(b)

Fig. 18. PSNR and SSIM values on the resource-limited training set. (a)PSNR. (b) SSIM.

H. Computational Cost

In order to evaluate the convergence of the proposed al-gorithm, we present the quantitative indicators together withthe iterations of the proposed algorithm on the CAS-PEALdatabase. Fig. 19 shows the average PSNR and SSIM valuesfor the testing images, together with the iterations of theproposed algorithm. From the experimental results, we can

see that the proposed algorithm achieves superior results withthe increase of iteration number, while the PSNR and SSIMmeasures become stable after 10 iterations. Due to the balanceof reconstruction quality and efficiency, the maximum numberof iterations is set to 10 in the proposed method.

0 2 4 6 8 10 12 14 16 18 20

Iteration number

27

27.5

28

28.5

29

29.5

30

30.5

31

31.5

32

PS

NR

(a)

0 2 4 6 8 10 12 14 16 18 20

Iteration number

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

SS

IM

(b)

Fig. 19. The quantitative indicators with different iteration numbers. (a)PSNR. (b) SSIM.

We also compare the computational time of the proposedalgorithm with other methods. LSR [19], WASR [25], LINE[26], SRLSP [28] and the proposed algorithm are implementedon a computer of 3.2 GHz CPU and 8G memory usingMATLAB. VDSR [41] and LCGE [37] are conducted on asingle Tesla K80 GPU using Caffe, while the enhancement partin [37] is performed using C++ on CPU. The deep learningbased methods (VDSR [41] and LCGE [37]) respectivelyrequire about eight hours and two days to train the deepnetwork on GPU, while other methods do not need the pre-training procedure. The average running time of LSR [19],WASR [25], LINE [26], SRLSP [28], VDSR [41], LCGE [37]and the proposed method are 10.6 s, 24.5s, 17.1s, 9.3s, 2.1s,72.8s and 35.5s, respectively. Though the proposed methoddoes not present the superiority in computational time, ithas the ability to produce superior results with reasonabletime cost. The proposed method also has the advantage toproduce HR results without the time-consuming training pro-cess. Furthermore, it is feasible to improve the efficiency ofthe proposed method by reducing the number of iterations.According to Fig. 19, the proposed method achieves 31.47dBin PSNR and 0.9387 in SSIM after the second iteration. Insuch case, the computational cost will reduce to about one-fifth of the current time cost, which can then beat most of theexisting methods.

I. Effectiveness of utilizing RKHS

In the following, we will design an experiment to showthe effectiveness of using RKHS in the reconstruction. In thisexperiment, we abandon to project the patches into RKHS andcompare the experimental results with the proposed method.Specifically, the reconstruction model without using RKHScan be written by setting φ(x) = x in (25). The detailsof deduction are shown in the Supplemental Material. Theexperiments are conducted on the CAS-PEAL database, whilethe 40 testing images are fixed in the experiments for com-paring the visual quality and quantitative indicators. Somerepresentative hallucinated results are presented in Fig. 20.The proposed algorithm obtains vivid results for the details

Page 14: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

14 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

(a) (b)

Fig. 20. Comparison for the effectiveness of utilizing RKHS. (a) WithoutRKHS. (b) The proposed method.(Please refer to the electronic version andzoom in for better comparison.)

of two-layer eyelids. The PSNR and SSIM values for allthe testing images are shown in Fig. 21. It is apparent thatthe reconstruction in the RKHS produces better results thanthe algorithm in the original HR space. The average PSNRand SSIM values of all the testing images are 31.47dB and0.9385 for the reconstruction algorithm without using RKHS.Compared with the results in Table I, the proposed methodachieves the improvement of 0.26dB in PSNR and 0.0019 inSSIM, respectively. The experimental results demonstrate thatthe reconstruction in the RKHS is effective to obtain superiorHR images for face hallucination.

5 10 15 20 25 30 35 40

Index

28

29

30

31

32

33

34

35

PS

NR

ProposedWithout RKHS

(a)

5 10 15 20 25 30 35 40

Index

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

SS

IM

ProposedWithout RKHS

(b)

Fig. 21. PSNR and SSIM values for the experimental results. (a) PSNR. (b)SSIM.

J. Effectiveness of the patch-based and pixel-based models

In this subsection, we respectively give the reconstructionresults for the patch-based and pixel-based models to showthe effectiveness of each regularization term. The experimentsare conducted on the CAS-PEAL database, while the resultsof utilizing various regularization models are obtained byalternately adjusting the parameter η and α in (25). Whenwe set η = 0, it means that the output HR image is producedby the patch-based regularization model. When the parameterα is set to 0, the pixel-based model is utilized to obtain theoutput HR image. Some representative SR results are shown inFig. 22. The PSNR and SSIM values of all the testing imagesare presented in Fig. 23. The algorithm with separated patch-based regularization term achieves the average quantitativeindicators as 31.36dB in PSNR and 0.9355 in SSIM, whilethe average PSNR and SSIM values are 30.81dB and 0.9314for the algorithm with separated pixel-based regularizationterm. According to the experimental results, the patch-basedmodel produces superior results than the pixel-based onewhen they are utilized separately. However, the performanceof the combined regularization terms constantly outperforms

(a) (b) (c)

Fig. 22. Face hallucination results for utilizing various regularization models.(a) Pixel-based regularization model. (b) Patch-based regularization model. (c)The proposed method. (Please refer to the electronic version and zoom in forbetter comparison.)

the results of two separated regularization terms as shown inFig. 23. Thus, both of the two regularization models play animportant role in the proposed method.

5 10 15 20 25 30 35 40

Index

27

28

29

30

31

32

33

34

35

36

PS

NR

ProposedPatch-basedPixel-based

(a)

5 10 15 20 25 30 35 40

Index

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

SS

IM

ProposedPatch-basedPixel-based

(b)

Fig. 23. Comparison for the patch-based and pixel-based models. (a) PSNR.(b) SSIM.

K. The performance of face recognition

In this subsection, we evaluate the performance of the aboveface hallucination algorithms in face recognition application.The experiments are conducted on the Multi-PIE database[50]. We select a subset of session 04 which contains frontalface images with neutral expression under 20 different illu-mination conditions (Camera: 05 1, Recording Number: 01).For each individual, five face images are randomly selectedas the testing set, while the rest of images are consideredas the training set for the experiments of recognition. TheHR face images are cropped and normalized to 64×64 pixels,while the corresponding LR images are down-sampled to thesize of 16×16 pixels for performing the experiments in alow resolution. To evaluate the effect of face hallucination,the LR testing images are hallucinated to the resolution of64×64 pixels using the training set. The face recognitionexperiments are then performed on the hallucinated HR faceimages. Since the experiments focus on the comparison ofrecognition ability for various face hallucination methods, weemploy PCA to obtain the eigenfaces and utilize the 1NNclassifier for recognition. The simplest recognition algorithmis helpful to demonstrate that the improvement of recognitionrate owes to the face hallucination methods rather than other

Page 15: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

IEEETRAN.CLS FOR JOURNALS 15

TABLE VITHE FIRST-RANK RECOGNITION RESULTS FOR VARIOUS FACE

HALLUCINATION ALGORITHMS

LR Bicubic LSR WASR LINE68.57% 68.14% 78.96% 80.09% 79.57%SRLSP VDSR LCGE Proposed HR79.65% 81.56% 79.13% 81.90% 87.19%

sophisticated classification algorithms. To better compare theexperimental performance, we also provide the recognitionrates on the degraded LR images and original HR images asthe baselines. The recognition results of various algorithmsare shown in Table VI. The recognition rate of the proposedmethod is 81.90%. It achieves the highest recognition ratewhen compared with other face hallucination approaches. Itis about 13% higher than the result on the degraded LRimages without face hallucination. We also notice that thebicubic interpolation slightly decreases the final recognitionrate because it produces artificial effects on the reconstructedHR images. The experiments demonstrate that the proposedalgorithm has the ability to reconstruct faithful details andimprove the recognition performance.

V. CONCLUSION

In this paper, we propose two novel regularization modelsto deal with LR face hallucination problem. The patch-basedmodel regularizes the relationship between target patch andtraining patches in the HR space, which is effective to recoverlocal texture. Different from the previous methods which buildthe regularization in the LR space, it avoids the difficultyof preserving local geometry consistency. Furthermore, theHR patches are projected into the RKHS, which helps toseek the nonlinear relationship in reconstruction. The pixel-based model is devised to compensate local details, especiallyon the regions of eyes, nostrils and mouth. It privileges thepixels along the dominant orientation of structure, which isvalid to reconstruct the local edge structure. The whole HRimage can be finally optimized by combining the above tworegularization models. Experimental results illustrate that theproposed method produces superior results in comparisonwith state-of-the-art baselines under various conditions. Futurework will focus on the improvement of the proposed methodfor solving multiview face hallucination problem with largepose variation.

REFERENCES

[1] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “A comprehensive survey toface hallucination,” International Journal of Computer Vision, vol. 106,no. 1, pp. 9–30, 2014.

[2] X. Zhang and X. Wu, “Image interpolation by adaptive 2-D autore-gressive modeling and soft-decision estimation,” IEEE Trans. on ImageProcessing, vol. 17, no. 6, pp. 887–896, 2008.

[3] J. Sun, Z. Xu, and H.-Y. Shum, “Gradient profile prior and its appli-cations in image super-resolution and enhancement,” IEEE Trans. onImage Processing, vol. 20, no. 6, pp. 1529–1542, 2011.

[4] K. Zhang, X. Gao, D. Tao, and X. Li, “Single image super-resolutionwith non-local means and steering kernel regression,” IEEE Transactionson Image Processing, vol. 21, no. 11, pp. 4544–4556, 2012.

[5] W. Dong, L. Zhang, G. Shi, and X. Wu, “Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regulariza-tion,” IEEE Trans. on Image Processing, vol. 20, no. 7, pp. 1838–1857,2011.

[6] W. Dong, G. Shi, Y. Ma, and X. Li, “Image restoration via simultaneoussparse coding: Where structured sparsity meets gaussian scale mixture,”International Journal of Computer Vision, vol. 114, no. 2-3, pp. 217–232, 2015.

[7] W. Dong, L. Zhang, G. Shi, and X. Li, “Nonlocally centralized sparserepresentation for image restoration,” IEEE Transactions on ImageProcessing, vol. 22, no. 4, pp. 1620–1630, 2013.

[8] K. Zhang, D. Tao, X. Gao, X. Li, and Z. Xiong, “Learning multiple linearmappings for efficient single image super-resolution,” IEEE Transactionson Image Processing, vol. 24, no. 3, pp. 846–861, 2015.

[9] S. Baker and T. Kanade, “Hallucinating faces,” in Proceedings ofthe IEEE International Conference on Automatic Face and GestureRecognition, 2000, pp. 83–88.

[10] H. Chang, D. Yeung, and Y. Xiong, “Super-resolution through neighborembedding,” in Proceedings of the IEEE International Conference onComputer Vision and Pattern Recognition, 2004, pp. 275–282.

[11] J. S. Park and S. W. Lee, “An example-based face hallucination methodfor single-frame, low-resolution facial images,” IEEE Trans. on ImageProcessing, vol. 17, no. 10, pp. 1806–1816, 2008.

[12] C. Liu, H.-Y. Shum, and W. T. Freeman, “Face hallucination: Theoryand practice,” International Journal of Computer Vision, vol. 75, no. 1,pp. 115–134, 2007.

[13] Y. Zhuang, J. Zhang, and F. Wu, “Hallucinating faces: LPH super-resolution and neighbor reconstruction for residue compensation,” Pat-tern Recognition, vol. 40, no. 11, pp. 3178–3194, 2007.

[14] K. Jia and S. Gong, “Generalized face super-resolution,” IEEE Trans-actions on Image Processing, vol. 17, no. 6, pp. 873–886, 2008.

[15] H. Huang, H. He, X. Fan, and J. Zhang, “Super-resolution of humanface image using canonical correlation analysis,” Pattern Recognition,vol. 43, no. 7, pp. 2532–2543, 2010.

[16] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution viasparse representation,” IEEE Trans. on Image Processing, vol. 19, no. 11,pp. 2861–2873, 2010.

[17] L. An and B. Bhanu, “Face image super-resolution using 2d cca,” SignalProcessing, vol. 103, pp. 184–194, 2014.

[18] B. Li, H. Chang, S. Shan, and X. Chen, “Aligning coupled manifoldsfor face hallucination,” IEEE Signal Processing Letters, vol. 16, no. 11,pp. 957–960, 2009.

[19] X. Ma, J. Zhang, and C. Qi, “Hallucinating face by position-patch,”Pattern Recognition, vol. 43, no. 6, pp. 2224–2236, 2010.

[20] C. Jung, L. Jiao, B. Liu, and M. Gong, “Position-patch based face hal-lucination using convex optimization,” IEEE Signal Processing Letters,vol. 18, no. 6, pp. 367–370, 2011.

[21] H. Huang and N. Wu, “Fast facial image super-resolution via locallinear transformations for resource-limited applications,” IEEE Trans. onCircuits and Systems for Video Technology, vol. 21, no. 10, pp. 1363–1377, 2011.

[22] W. Zhang and W.-K. Cham, “Hallucinating face in the DCT domain,”IEEE Trans. on Image Processing, vol. 20, no. 10, pp. 2769–2779, 2011.

[23] Y. Hu, K.-M. Lam, G. Qiu, and T. Shen, “From local pixel structureto global image super-resolution: A new face hallucination framework,”IEEE Trans. on Image Processing, vol. 20, no. 2, pp. 433–445, 2011.

[24] J. Shi, X. Liu, and C. Qi, “Global consistency, local sparsity andpixel correlation: A unified framework for face hallucination,” PatternRecognition, vol. 47, no. 11, pp. 3520–3534, 2014.

[25] Z. Wang, R. Hu, S. Wang, and J. Jiang, “Face hallucination via weightedadaptive sparse regularization,” IEEE Transactions on Circuits andSystems for video Technology, vol. 24, no. 5, pp. 802–813, 2014.

[26] J. Jiang, R. Hu, Z. Wang, and Z. Han, “Face super-resolution via multi-layer locality-constrained iterative neighbor embedding and intermediatedictionary learning,” IEEE Transactions on Image Processing, vol. 23,no. 10, pp. 4220–4231, 2014.

[27] J. Jiang, R. Hu, Z. Wang, Z. Han, and J. Ma, “Facial image hallucinationthrough coupled-layer neighbor embedding,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 26, no. 9, pp. 1674–1684, 2016.

[28] J. Jiang, C. Chen, J. Ma, Z. Wang, Z. Wang, and R. Hu, “SRLSP:A face image super-resolution algorithm using smooth regression withlocal structure prior,” IEEE Transactions on Multimedia, vol. 19, no. 1,pp. 27–40, 2017.

[29] L. Liu, C. P. Chen, S. Li, Y. Y. Tang, and L. Chen, “Robust facehallucination via locality-constrained bi-layer representation,” IEEETransactions on Cybernetics, will be published.

[30] J. Jiang, J. Ma, C. Chen, X. Jiang, and Z. Wang, “Noise robust faceimage super-resolution through smooth sparse representation,” IEEETransactions on cybernetics, vol. 47, no. 11, pp. 3991–4002, 2017.

Page 16: SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, …jultika.oulu.fi/files/nbnfi-fe201902256128.pdf · 2019-02-25 · hallucination, medical image super-resolution and document image super-resolution.

16 SUBMITTED TO IEEE TRANS ON IMAGE PROCESSING, MANUSCRIPT, 2017

[31] X. Ma, H. Huang, S. Wang, and C. Qi, “A simple approach to multiviewface hallucination,” IEEE Signal Processing Letters, vol. 17, no. 6, pp.579–582, 2010.

[32] R. A. Farrugia and C. Guillemot, “Face hallucination using linear modelsof coupled sparse support,” IEEE Transactions on Image Processing,vol. 26, no. 9, pp. 4562–4577, 2017.

[33] X. Zeng, H. Huang, and C. Qi, “Expanding training data for facial imagesuper-resolution,” IEEE Transactions on cybernetics, vol. 48, no. 2, pp.716–729, 2018.

[34] X. Yu and F. Porikli, “Ultra-resolving face images by discriminativegenerative networks.” in ECCV, 2016, pp. 318–333.

[35] S. Zhu, S. Liu, C. C. Loy, and X. Tang, “Deep cascaded bi-network forface hallucination.” in ECCV, 2016, pp. 614–630.

[36] Q. Cao, L. Lin, Y. Shi, X. Liang, and G. Li, “Attention-aware facehallucination via deep reinforcement learning,” Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 690–698,2017.

[37] Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang, “Learning to hallucinateface images via component generation and enhancement,” in Internation-al Joint Conference on Artificial Intelligence (IJCAI), 2017.

[38] R. Timofte, V. De Smet, and L. Van Gool, “Anchored neighborhoodregression for fast example-based super-resolution,” in Proceedings ofthe IEEE International Conference on Computer Vision, 2013, pp. 1920–1927.

[39] ——, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in Asian Conference on Computer Vision. Springer, 2014,pp. 111–126.

[40] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution usingdeep convolutional networks,” IEEE Transactions on pattern analysisand machine intelligence, vol. 38, no. 2, pp. 295–307, 2016.

[41] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolutionusing very deep convolutional networks,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2016, pp.1646–1654.

[42] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussiandenoiser: Residual learning of deep cnn for image denoising,” IEEETransactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.

[43] T. Hofmann, B. Scholkopf, and A. J. Smola, “Kernel methods in machinelearning,” The Annals of Statistics, pp. 1171–1220, 2008.

[44] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for imageprocessing and reconstruction,” IEEE Trans. on Image Processing,vol. 16, no. 2, pp. 349–366, 2007.

[45] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proceedings ofthe IEEE International Conference on Computer Vision and PatternRecognition, 2010, pp. 3360–3367.

[46] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, and D. Zhao,“The CAS-PEAL large-scale Chinese face database and baseline evalu-ations,” IEEE Trans. on Systems, Man and Cybernetics, Part A: Systemsand Humans, vol. 38, no. 1, pp. 149 –161, 2008.

[47] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The FERETevaluation methodology for face-recognition algorithms,” IEEE Trans.on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090–1104, 2000.

[48] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image qualityassessment: from error visibility to structural similarity,” IEEE Trans.on Image Processing, vol. 13, no. 4, pp. 600 –612, 2004.

[49] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributesin the wild,” in Proceedings of the IEEE International Conference onComputer Vision, 2015, pp. 3730–3738.

[50] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, “Multi-pie,”Image and Vision Computing, vol. 28, no. 5, pp. 807–813, 2010.

Jingang Shi received the B.S. degree and Ph.D.degree both from the School of Electronic andInformation Engineering, Xi’an Jiaotong University,Xi’an, Shaanxi, China. Since 2017, he has been apostdoctoral researcher at the Center for MachineVision Research and Signal Analysis, University ofOulu, Finland. His current research interests mainlyinclude image restoration and face analysis.

Xin Liu is currently a Ph.D. candidate with theCenter for Machine Vision and Signal Analysis,University of Oulu, Finland. He received the B.S.and M.S. degrees in Computer Science, in 2003and 2007, respectively. His research interests includehuman behavior analysis, 3D computer vision, imagerestoration, and object detection.

Yuan Zong received the B.S. and M.S. degrees inelectronics engineering both from Nanjing NormalUniversity, Nanjing, Jiangsu, China, in 2011 and2014, respectively. He is currently working towardthe Ph.D. degree in the Key Laboratory of ChildDevelopment and Learning Science of Ministry ofEducation, Southeast University, Nanjing, Jiangsu,China. From October 2016 to September 2017, hewas working as a visiting student at the Center forMachine Vision and Signal Analysis, University ofOulu, Finland. His research interests include affec-

tive computing, pattern recognition, and computer vision.

Chun Qi received the Ph.D. degree from XianJiaotong University, Xi’an, China, in 2000. He iscurrently a Professor and Ph.D. supervisor at Schoolof Electronics and Information Engineering, Xi’anJiaotong University. His current research interestsmainly include image processing, pattern recognitionand signal processing.

Guoying Zhao is currently a Professor with theCenter for Machine Vision and Signal Analysis,University of Oulu, Finland, where she has beena senior researcher since 2005 and an AssociateProfessor since 2014. She received the Ph.D. degreein computer science from the Chinese Academyof Sciences, Beijing, China, in 2005. In 2011, shewas selected to the highly competitive AcademyResearch Fellow position. She was Nokia visitingprofessor in 2016. She has authored or co-authoredmore than 160 papers in journals and conferences.

Her papers have currently over 6900 citations in Google Scholar (h-index35). She is co-publicity chair for FG2018, has served as area chairs forseveral conferences and is associate editor for Pattern Recognition, IEEETransactions on Circuits and Systems for Video Technology, and Image andVision Computing Journals. She has lectured tutorials at ICPR 2006, ICCV2009, and SCIA 2013, authored/edited three books and seven special issues injournals. Dr. Zhao was a Co-Chair of many International Workshops at ECCV,ICCV, CVPR, ACCV and BMVC. Her current research interests include imageand video descriptors, facial-expression and micro-expression recognition, gaitanalysis, dynamic-texture recognition, human motion analysis, and personidentification. Her research has been reported by Finnish TV programs,newspapers and MIT Technology Review.