Web Mining Towards Age Estimator

download Web Mining Towards Age Estimator

of 10

Transcript of Web Mining Towards Age Estimator

  • 8/2/2019 Web Mining Towards Age Estimator

    1/10

    Web Image Mining Towards Universal Age Estimator

    Bingbing NiNational University of

    Singapore

    4 Engineering Drive 3Singapore 117576

    [email protected]

    Zheng SongNational University of

    Singapore

    4 Engineering Drive 3Singapore 117576

    [email protected]

    Shuicheng YanNational University of

    Singapore

    4 Engineering Drive 3Singapore 117576

    [email protected]

    ABSTRACT

    In this paper, we present an automatic web image mining systemtowards building a universal human age estimator based on facialinformation, which is applicable to all ethnic groups and variousimage qualities. First, a large (391k) yet noisy human aging im-age dataset is crawled from the photo sharing website Flickr andGoogle image search engine based on a set of human age relatedtext queries. Then, within each image, several human face detectors

    of different implementations are used for robust face detection, andall the detected faces with multiple responses are considered as themultiple instances of a bag (image). An outlier removal step withPrincipal Component Analysis further refines the image set to about220k faces, and then a robust multi-instance regressor learning al-gorithm is proposed to learn the kernel-regression based humanage estimator under the scenarios with possibly noisy bags. Theproposed system has the following characteristics: 1) no manualhuman age labeling process is required, and the age information isautomatically obtained from the age related queries, 2) the derivedhuman age estimator is universal owing to the diversity and rich-ness of Internet images and thus has good generalization capabil-ity, and 3) the age estimator learning process is robust to the noisesexisting in both Internet images and corresponding age labels. Thisautomatically derived human age estimator is extensively evaluated

    on three popular benchmark human aging databases, and withouttaking any images from these benchmark databases as training sam-ples, comparable age estimation accuracies with the state-of-the-artresults are achieved.

    Categories and Subject Descriptors

    I.4.9 [Computing Methodologies]: Image Processing and Com-puter VisionApplications

    General Terms

    Algorithm, Performance, Experimentations

    Keywords

    Internet Vision, Age Estimation, Multi-instance Regression

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.

    MM09, October 1924, 2009, Beijing, China.Copyright 2009 ACM 978-1-60558-608-3/09/10 ...$10.00.

    Universal AgeEstimator

    Google Image:

    Flickr Image:

    Asian

    40

    European

    30

    African

    18

    Baby

    5

    Mid-Age

    40

    Senior

    80

    Rotated

    60

    Occluded20

    Figure 1: An illustration of the purpose of this study, i.e., to uti-

    lize web image resources for learning a universal age estimator.

    1. INTRODUCTIONImage based human age estimation has wide potential applica-

    tions, e.g., demographic data collection for supermarkets or otherpublic areas, age-specific human computer interfaces, age-orientedcommercial advertisement, and human identification based on oldID-photos. For these applications, generally a large set of humanface images with ground-truth age labels are required for learningan effective human age estimator. The previous research for humanage estimation can be roughly divided into two categories accord-ing to whether the age estimation task is considered as a regressionproblem or a multi-class classification problem. Many efforts havebeen devoted to the human age estimation problem in the past fewyears. Kwon et al. [12] proposed a human age classification methodbased on cranio-facial development theory and skin wrinkle analy-sis, where the human faces are classified into three groups, namely,babies, young and senior adults. Hayashi et al. [9] proposed to usethe wrinkle and geometry relationships between different parts of

    a face to classify the age information into groups at the five yearintervals. Lanitis et al. [13] adopted Active Appearance Models(AAM) [4] to extract the combined shape and texture informationfor human age estimation. Geng et al. [7] proposed to model thestatistical properties of aging patterns, and each aging pattern char-acterizes the aging process for one person. Yan et al. proposed amethod called Ranking with Uncertain Labels for age estimationby introducing a semidefinite programming (SDP) formulation forregression problems with uncertain nonnegative labels [21]. Yanet al. later introduced a patch kernel method based on GaussianMixture Models (GMM) for age regression, where the best result

    85

  • 8/2/2019 Web Mining Towards Age Estimator

    2/10

    on FG-NET database to date was reported [22]. Guo et al. [8] in-troduced an age manifold learning scheme for extracting face ag-ing features and designed a locally adjusted robust regressor forthe prediction of human ages. Recently, Fu and Huang [6] devel-oped a discriminant subspace learning method for age estimationby exploring the sequential patterns from the face images with ag-ing features.

    These approaches have achieved satisfactory human age estima-tion accuracies on certain benchmark human aging datasets, e.g.,

    FG-NET [1] and UIUC [6] databases, there however exist two dif-ficulties which essentially hamper the research and applications inthis area:

    1. Most previous algorithmic evaluations were performed onrelatively small dataset(s), mainly due to the difficulties incollecting a large dataset with precise human age ground-truths. Moreover, each human aging database usually onlycovers one human ethnic group, and for certain ages, e.g.,senior ages, the samples are rare. All these essentially limitthe generalization capability of the learnt human age regres-sor to general face images from real applications.

    2. All previous research on image based human age estimationis founded on the assumption that the face images have beencropped out and reasonably aligned. For practical applica-tions, rough face detection has been considered as a well-solved problem, the precise face cropping is however still farfrom satisfying, which consequently results in the so-calledface misalignment issue. A practical solution to bridge thegap between the possibly misaligned faces and the require-ment of precise face cropping for age estimation is criticalto guarantee the algorithmic robustness and effectiveness inreal applications.

    As shown in Fig. 1, the main purpose of this research is to drivethe human age estimation research more towards real scenarios.Instead of being limited to well-cropped faces and single humanethnic group, the system should be able to take general face imagesfor training universal human age estimator, which is applicable forall ages, all ethnic groups and various image qualities. Meanwhile,the ultimate goal of this research is to design a fully automatic andreal-time system which takes general face images as inputs. Ourproposed solution for such targets is motivated by the followingobservations:

    1. Though it is practically difficult or even impossible to collecta large face image set with precise age ground-truths, the In-ternet provides very rich resources on face images with pos-sible age information encoded within the surrounding texts.The popular image search engines such as Google imagesearch and photo sharing websites such as Flickrcan providea huge number of images for even a single age-related query(e.g., 15 years old), where usually thousands of correct sam-ples from different human ethnic groups are available. Therewere some such attempts for web image mining, for exam-ple, Yanai and Barnard [23] proposed a web image miningmethod for discriminative visual concept selection.

    2. When taking general images as inputs, an inevitable prob-lem is how to robustly learn a human age estimator basedon possibly misaligned faces. The misaligned problem couldbe alleviated by the current proposed patch based represen-tation [22], which has shown to be very effective in humanage regression problem based on misaligned faces. Also ifthe training images are obtained from Internet based on age-related queries, another inevitable problem is how to handle

    the possibly multiple faces detected within one image and thepossibly incorrect label of the image. All these motivate us topresent an algorithm for robust multi-instance based humanage regressor learning.

    The main contributions of this work are three-fold: 1) we presenta web image mining scheme to harness the Internet images for col-lecting a large and diverse face database with nearly-correct age la-bels, and then use it for learning universal human age estimator; 2)

    we propose a novel learning algorithm to robustly derive a humanage estimator based on images with multiple face instances andpossibly noisy labels; and 3) we develop a fully automatic systemfor automatic training image collection, age regressor learning, andfinal age estimation, which does not rely on any kind of human in-teractions and is thus of great potentials in real scenarios. A systemoverview is illustrated in Fig. 2 and the three major components ofthe proposed research are summarized as follows:

    1. Internet Aging Image Collecting

    The Internet aging image collection is performed by auto-matically crawling images from the image search engines orphoto sharing websites based on a set of age related text en-quiries, e.g., "15-years-old", "age-15" and "15th-birthday"for the age of 15 years. In this step, generally more than10k of images can be downloaded for each age, but a largeportion of these images do not contain any face instance orcontain only face instances with other different ages.

    2. Noisy Image and Label Filtering

    In this work, we propose to adopt a pipeline approach to re-tain good face instances for model training as well as to filterout those noisy images or face instances as many as possible.First, we propose to conduct parallel face detection basedon multiple face detectors for improving the probability toobtain well-aligned face instances for each image, and thenthe face instances overlapping with those at least one faceinstance from distinct detectors are retained as good sam-ples for model training. Then, Principal Component Analy-

    sis [10] is applied for each age and those face instances withlarge reconstruction errors are filtered out. This pre-screeneddataset is then input into the proposed robust multi-instanceregression algorithm to learn a universal human age estima-tor.

    3. Robust Multi-instance Regression

    In this step, we are given a set of training face images withmultiple face instances within each image, and the task is tolearn a human age regressor. This problem can be consideredas a specific multi-instance learning problem, but those pre-vious multi-instance learning algorithms cannot be directlyapplied for this problem since there may exist noisy imagelabels for the training data. In this work, we present a robustmulti-instance learning algorithm to tackle this problem with

    the awareness oflabel outliers.

    Note that multi-instance learning is a widely studied researchtopic in the past few years. Keeler et al. [11] first proposed themulti-instance learning concept when dealing with the hand-printednumerals detection problem, motivated by the observation that theremight exist more than one numerals in a single image. After that,many researchers proposed various related schemes such as DD[14], EM-DD [24] and citation-kNN [20] to tackle this problem.Also the multi-instance learning concept was incorporated into bothboosting and support vector machine algorithms, yielding the so

    86

  • 8/2/2019 Web Mining Towards Age Estimator

    3/10

    Noise FilterRobust MIR

    TrainingFace Detector

    Array

    Raw Face

    DatabasePre-screened

    Database

    Universal

    Age

    Estimator

    UniversalAging

    Images

    Web Image

    Resources

    New Image

    years

    Figure 2: The system overview for learning universal age estimator based on automatic web image mining.

    called MIL-boosting [19] and MIL-SVM [2] algorithms. Therealso exist several algorithms proposed for the multiple instancemultiple label learning problem [3, 25]. Besides these classifica-tion problems, recently Ray et al. [15] proposed a multi-instanceregression framework to deal with the regression problem, whichis the most related work with our research in this work. This al-gorithm does not consider the noisy label issue, and therefore thealgorithmic robustness cannot be guaranteed.

    The rest of this paper is organized as follows. Section 2 intro-duces the Internet aging image collecting and noise pre-screening

    process. The robust multi-instance regression algorithm is elabo-rated in Section 3 and the experimental results are demonstrated inSection 4. Section 5 concludes this paper along with discussion offuture work.

    2. INTERNET AGING IMAGE DATABASE

    2.1 Internet Aging Image CollectingA universal age estimator heavily relies on a large training image

    set with the age ground-truths. Generally the human facial imagesare very easy to obtain from different sources, their precise agelabels are however not easy to obtain, especially when a universalage estimator is expected for all ethnic groups. Generally there aretwo ways to obtain such age labels. In the first way, the real agelabels are known, provided right by the humans in the images orrecorded when capturing the images. The FG-NET [1] and UIUC[6] aging databases belong to this category. For the second way,the age labels are obtained based on the human estimations and thedatabase used in [9] belongs to this category.

    Recent years have witnessed an explosion of social media con-tent shared online, e.g., Flickr. For these community-contributedmedia repositories, the users may upload personal media data withinformative titles and annotate with descriptive tags. For those hu-man face related images, the human age information is often nat-urally involved within the media titles and tags, e.g., the titles like

    Mothers 50th Birthday and the tags like 15-years. Also for thosepopular image search engines, e.g., Google image search, a largeamount of images are available based on age related queries. Al-

    though it is often the case that the titles/tags and queries might beirrelevant to those retrieved images, there do exist a considerableportion of images with correct information on ages. Motivated bythis observation, we propose to crawl aging images from the Inter-net (we use Flickr and Google image search in this work) basedon a set of age related text queries. In this work, we construct thequery list using the age related templates such as xx-years-old, age-

    xx and xxth-birthday, and finally about 10k to 20k images could bedownloaded for each age from 1 year old to 80 years old. Note thatwe do not explicitly use the ages over 80 since generally it is not sodistinguishable for ages above 80 as indicated in [22].

    Detector ND

    Detector 1

    I np ut I mage Face D etec tor A rr ay B ag o f Face In stances

    Detector 2

    Figure 3: An exemplary result from parallel face detection.

    2.2 Parallel Face Detection for RobustnessWe have the following observations on the crawled aging im-

    ages: 1) many images do not contain faces, 2) even in those imageswhich contain faces, the involved faces may not be in frontal view,and 3) there may exist occlusions for the faces. Thus accurate androbust face detection is compulsory in prior to the further age es-timator learning process. On one hand, the images without facesshall be removed, and on the other hand, the faces shall be detectedin a robust way. There are many face detectors with reasonablygood performances, but generally no detector can guarantee to beperfect. However, the parallel detection results from several facedetectors can provide multiple and complimentary detection resultsfor each image, which actually simulate the multi-instance learning

    scenario. We thus adopt this scheme to use multiple state-of-the-artface detectors for detecting all possible faces for each image. Thedetectors we used belong to different variations of Adaboost [18]based detectors. The diagram of this parallel face detection schemeis illustrated in Fig. 3.

    2.3 Within-age-category Noise FilteringAfter the parallel face detection step, most images without faces

    are removed, but there may still exist noises for those detectedfaces, e.g., false alarms, faces with large spatial misalignments, andfaces with incorrect age labels. These noises can be further reducedby using certain statistical approaches. In our work, we adopt twoways to remove potentially false face detections. Firstly, we matchthe detection results from the parallel face detectors, namely, only

    those detected faces which have significant overlapping with otherfaces from different detectors are retained (e.g., with more than90% overlapping areas). The step removes most of the false alarmfaces. Secondly, we perform Principal Component Analysis (PCA)[10] within each age category and then remove those faces withlarge reconstruction errors based on the retained components, i.e.,these "faces" are greatly different from the majority of the detectedfaces within an age category. This step is capable of filtering outsome of the "poor" faces such as occluded, rotated or significantlydeformed ones, since these noisy faces usually exhibit quite differ-ent appearances compared with those normal near-frontal faces.

    87

  • 8/2/2019 Web Mining Towards Age Estimator

    4/10

    3. ROBUST UNIVERSAL AGE ESTIMATOR

    3.1 Robust Multi-instance Age EstimatorAfter the noise filtering step, the left aging dataset, which may

    still contain noisy images and labels, has the following characteris-tics:

    1. There might exist multiple face instances within a single im-

    age, which come from the parallel face detection or frommultiple true human faces within an image.

    2. For each individual image, there is no guarantee that thereexist at least one face instance right with the given age label.

    The second property makes the problem to learn an age estimatoressentially different from those well-known multi-instance learningproblems, where a common assumption is that there exist withineach bag at least one positive instance with the bag label. There-fore we need develop new regressor learning framework for multi-instance regression with noisy labels.

    3.1.1 Robust Multi-instance Regression FormulationBefore introducing the detailed formulation, we first show the

    main terminologies. Instead of explicitly enforcing that each im-age, referred to as bag within the context for multi-instance learn-ing, has at least one face instance with the given age, we impose asoft constraint, that is, we allow some bags contribute no face in-stances with the given ages. Denote each bag as Bi, i = 1, , M,where M is the total number of bags (images). Also we unpack thebags and provide index for the face instance in the instance level,and the label of the jth instance is denoted as yj (without regardingwhether it is correct or not) while its feature vector representationis denoted as xj R

    d, where d is the feature dimension and weshall introduce the details of the features in the latter part of this pa-per. For each bag, the number of involved instances is denoted as|B

    i| = #{x

    j|xj

    Bi}. The total number of instances is denoted

    as N, namelyi |Bi| = N.The target of this work is to learn a human age regressor,

    f(x,a) : x Rd y R, (1)

    where x is the input feature vector of the face instance and a isthe parameter vector to estimate. In the following we present thedetailed formulation for multi-instance regressor learning with pos-sibly noisy labels.

    First, we define pj as the parameter to measure the possibilityof a face instance xj to inherit its parent bags label, and as pj [0, 1], we rewrite it as

    pj = ec2j (2)

    with cj being a real value. Here we denote c = [c1, c2, , cN]T.Then the entire possibility to have at least one face instance withthe label of the ith bag is,

    PBi = 1

    xjBi

    (1 pj). (3)

    A direct observation of this formula is that if one instance takes theprobability of1, the entire probability of a bag shall be 1.

    The problem of multi-instance regressor learning with noisy la-

    bels is then defined as the optimization,

    min Q(c,a) =Nj=1

    pj(yj f(xj ,a))2

    + j=j

    wjj (f(xj,a) f(xj ,a))2

    M

    i=1 log PBi , (4)which can be further expand as

    Q(c,a) =Nj=1

    ec2

    j (yj f(xj, a))2

    + j=j

    wjj(f(xj, a) f(xj ,a))2

    Mi=1

    log(1

    xjBi

    (1 ec2

    j )). (5)

    The matrix W = [wjj ] is define as

    wjj = 1, if xj Nk1 (i),0, else, (6)where Nk1 (i) is the k1 nearest neighbor set of the face instance xjmeasured by Euclidean distance in the feature vector space, and inthis workk1 is set as 7.

    Note that: 1) the first term in the objective function measuresthe data fitting capability, namely, the error between the labels andthe estimations from the derived age estimator; 2) the second termmeasures the smoothness of the regression function, i.e., those sam-ples with similar feature representations should also be similar forthe predicted labels; 3) the third term denotes the negative log like-lihood of the probability for the ith bag to have at least one faceinstance with the label of the bag, which penalizes the cases with-out face instances selected; and 4) and are two parameters for

    controlling the tradeoff among these three parts. Also for the sec-ond smoothness term, we do not multiply the indicator variables cjand cj , since this regularity is applicable for all face instances, notconstrained to those face instances with the largest cj s. It meansthat finally the so-called negative face instances within a bag arealso used for the deduction of the regressor, which well enhancesthe robustness of the regressor learning process.

    The function f(x,a) is the final age regression mapping func-tion. In this work, we use kernel regression method for estimating

    y for a face instance represented as x, and the Gaussian kernel isused in this work, namely, k(x1,x2) = exp{x1 x2

    2/21},where 1 is a tunable parameter for measuring feature similarity.First, we select a set ofM reference face instances from the entireface instance set, denoted as X = [x1, , xM ], the age estima-tion model from x to its age label y is then formulated as

    y = f(x,a) =M

    m=1 amk(x, xm)Mm=1 k(x, xm)

    , (7)

    where a = [a1, , aM ]T is the parameter vector. Note that the

    reference point number M and the parameter 1 are set empiri-cally in all our experiments in this work.

    3.1.2 Optimization Procedure

    There exist two set of variables in problem formulation, namely,

    a and c, and the objective function is not convex. Therefore no

    88

  • 8/2/2019 Web Mining Towards Age Estimator

    5/10

    global optimum is available, and we adopt the alternative optimiza-tion scheme to minimize this objective function, namely, we alter-nate the optimization with respect to a and c respectively by fixingthe other parameter set in each iteration step. For fixed c, since thethird term depends only on c, the whole objective function becomesa quadratic function. Here, we denote y = [y1, y2, , yN]

    T,

    F = [fnj] RNM with fnj =

    k(xn,xj)Mm=1 k(xn,xm)

    , and L =

    D W with D as a diagonal matrix defined as Djj =

    j wjj .

    W is defined in Eqn. (6). Then the objective function can be refor-mulated in the matrix form as

    min Q(c,a) = (y Fa)Tc(y Fa)

    + 2(Fa)TL(Fa) Mi

    log PBi , (8)

    where c denotes the diagonal matrix whose diagonal elements are

    from c, namely, c(j,j) = ec2j and c(j,j

    ) = 0, j = j. Tominimize this objective function, we calculate the derivative of thisfunction and set it to be zero, and then we have

    FT(c + 2L)Fa = FTcy. (9)

    Therefor the optimal a could be obtained as:

    a = (FT(c + 2L)F)FTcy, (10)

    where denotes the pseudo inverse of a matrix.For a fixed a, the objective function with respect to c is non-

    linear and we use gradient descent method for the optimization.The partial derivative is calculated as

    Q

    cj= 2cje

    c2j (f(xj ,a) yj)2 + 2

    1 PB(j)PB(j)

    cjec2j

    1 ec2

    j

    , (11)

    where B(j) denotes the the parent bag of the instance xj . Theabove two steps iterate until converged, namely stop when the con-secutive changes ofa and c are smaller than a predefined threshold,set as 104 in this work.

    3.1.3 Convergence Analysis

    The optimization problem in Eqn. (8) is non-convex due to thenon-convexity of the objective function, and hence we cannot guar-antee that the solution shall be globally optimal. Here, instead weprove that the iterative procedure converges to a local optimum.For the optimization with respect to a, its closed-form solutionguarantees the non-increase of the objective function. And for theoptimization with respect to c, the gradient descent method alsoguarantees the non-increase of the objective function. We thereforehave

    Q(ct, at) Q(ct,at+1) Q(ct+1,at+1), (12)

    where at means the derived solution from the tth iteration for a and

    also for ct. Therefore, the objective function is non-increasing.Also, the objective function value is non-negative, which meansthat the objective function has a lower-bound of 0. Then we canconclude that the objective function shall converge to a local opti-mum according to "Gauchys criterion for convergence" [17].

    3.2 Post-processing with Feature RefinementThe original feature vector for the face instance does not nec-

    essarily provide good discriminating power for distinguishing theage information, and a supervised dimensionality reduction processmay offer better discriminating power. We do not plan to put the

    dimensionality reduction within the aforementioned robust multi-instance regression formulation, since it shall bring heavy compu-tation cost due to the much larger size of the projection matrix,denoted as P Rdm where m is the final desired feature dimen-sion, compared with the parameter vectors a and c.

    In this subsection, we follow the work in [22] to adopt a super-vised learning process for enhancing the discriminating power ofthe feature vector after we have obtained the values for c. Morespecifically, we first select the top-k2 (k2 is set as 100 in this work)

    face instances with the largest pjs for each age. To simplify therepresentation, we still denote these selected face instances withhigh probabilities as xjs and their age labels as yjs. The criteriafor guiding the pursue of the project matrix is that in the derivedlow-dimensional feature space, the face instances with similar agelabels should also be similar in feature space. Let the label similar-ity matrix Ws = [wsjj ] defined as

    wsjj = e||yjyj ||

    2/22 , (13)

    the 2 is set to be 1 in this work and then the projection matrix isachieved by the following optimization,

    minPTP=Ij=j

    ||PTxj PTxj ||

    2wsjj . (14)

    Denote Xs as the feature matrix for the selected face instance, thenthe optimal P consists of the eigenvectors corresponding to the top-m largest eigenvalues of the matrix XsLsXsT, where Ls is theLaplacian matrix of the matrix Ws. We then project all the training,testing, and reference face instances of the regressor into this lowdimensional space by P.

    3.3 Face Instance Representation via PatchesFor those automatically crawled and cropped face instances, spa-

    tial misalignments may exist, and thus a robust feature representa-tion is critical for final age estimation performance. The featurevector x to represent the face instance is derived based on the work[22], namely, each face instance is finally represented as a so-called

    supervector.First, the DCT2 features are extracted from the small patches (in

    this work we use 8 8 pixels patches) to form the feature vector

    z Rd

    , and then based on all the patches extracted in an overlap-ping manner, we build a universal background model with GaussianMixture Models (GMM) as follows,

    p(z; ) =Kk=1

    wkN(z;k, k), (15)

    where = {w1,1, 1, }, wk, k and k are the weight,mean and covariance matrix of the kth Gaussian component, re-spectively, and K is the total number of Gaussian components.

    The density is a weighted linear combination of K uni-modal

    Gaussian densities, namely,

    N(z;k, k) =1

    (2)d

    2 |k|1

    2

    e1

    2(zk)

    T1k

    (zk). (16)

    We obtain a maximum likelihood parameter set for the GMM byusing the Expectation-Maximization (EM) approach as in [5].

    For each face instance, we derive the instance-specific GMM byadapting the mean vectors of the global GMM and retaining themixture weights and covariance matrices. Assuming that the ex-tracted patch set from the face instance is denoted as {zi}

    Hi=1, and

    89

  • 8/2/2019 Web Mining Towards Age Estimator

    6/10

    then an MAP adaption process is as follows,

    p(k|zj) =wkN(zj ;k, k)K

    k=1 wkN(zj ;k , k), (17)

    nk =Hi=1

    p(k|zj), (18)

    k =1

    nk

    H

    i=1p(k|zj)zj, (19)k = kk + (1 k)k, (20)

    where k = nk/(nk + r) and r is a parameter on priors. Theabove MAP adaptation process is based on conjugate priors for themeans, and is useful because it interpolates, smoothly, between thehyper-parametersk and the maximum likelihood parameters k.If a Gaussian component has a high probabilistic count, nk, then kapproaches 1 and the adapted parameters emphasize the new suf-ficient statistics; otherwise, the adapted parameters are determinedby the global model.

    Suppose we have two face instances with the exacted patch setas Za and Zb, then, from the GMM MAP adaptation process in(17-20), we can obtain two adapted GMMs for them, denoted asga and gb. Consequently, each face instance is represented by aspecific GMM distribution model, and a natural similarity measurebetween them is the Kullback-Leibler divergence,

    D(ga||gb) =

    ga(z)log

    ga(z)

    gb(z)

    dz. (21)

    The Kullback-Leibler divergence itself does not satisfy the con-ditions for a metric, but there exists an upper bound from the log-sum inequality,

    D(ga||gb) Kk=1

    wkD(N(z; ak, k)||N(z;

    bk, k)),

    where ak denotes the adapted mean of the kth component fromface instance a, and likewise for bk. Based on the assumption thatthe covariance matrices are unchanged during the MAP adaptationprocess, the right side of the above inequality is equal to

    d(Za, Zb) =1

    2

    Kk=1

    wk(ak

    bk)T1k (

    ak

    bk). (22)

    It is easy to prove that d(Za, Zb) is a metric function, and can beconsidered as the Euclidean distance between two supervectors inanother high-dimensional feature space,

    (Za) = [

    w12

    12

    1 a1 ; ;

    wK

    2 12

    K aK], (23)

    and then d(Za, Zb) = (Za) (Zb)2.

    Then we can represent each face instance using x = (Z) given

    that Z is the extract patch set. If the patch DCT2 feature is ofd di-mensional, and the number of components used is K, then the rep-

    resentation length would be ofdK, which is normally very large. Inthis work, we perform PCA to reduce this original dimension into

    d = 2000 dimension, for the sake of computational tractability.

    4. EXPERIMENTAL RESULTS

    4.1 Database ConstructionWe crawled 391, 176 images (also known as face bags) from

    Flickr.com and Google image search engine based on a set of agerelated queries for each year from 1 to 80, and detected 586, 595face instances by using the parallel face detection scheme. Af-ter the pre-screening of the initial face instances, we have totally

    77, 021 images (bags) with 219, 892 face instances left, i.e., about2/3 of the face instances (false alarms, misaligned or poor qualityfaces) are removed. Note that the derived face dataset is signifi-cantly larger than the state-of-the-art aging dataset, e.g., FG-NET[1] (1002 face images), and MORPH-1 (1690 face images) andMORPH-2 [16] (55, 608 face images).

    Fig. 4 shows several typical samples before the pre-screeningstep and Fig. 6 displays the sample face instances automaticallycropped from the face image dataset before and after pre-screening.

    Several conclusions can be drawn from these observations:

    1. In most raw images, multiple face instances are cropped dueto the multiple detector process, which essentially leads to amulti-instance problem for learning the universal age estima-tor.

    2. We could observe that for some images, the bag age label isincorrect. In this case, the original multiple instance learningalgorithm is easy to fail. Therefore a robust multi-instancelearning algorithm, which can handle noisy labels, is neces-sary for pursuing a universal age estimator.

    3. There are also many poor quality faces and false alarm de-tections output from the face detectors. Poor quality facesinclude those non-frontal faces and occluded faces. Most of

    these inappropriate detections may be pre-screened out bythe multiple detector strategy. This results in a very cleandataset containing face instances only.

    4. A small portion of the filtered instances are true faces. Thesetrue faces are also removed as they are detected by only oneface detector.

    5. There exist a significant number of non-face images relatedto the age keywords, e.g., pet, old building, wine, birthdaycake, and tomb, which are easily filtered out by the face de-tectors.

    0 10 20 30 40 50 60 70 800

    2000

    4000

    6000

    8000

    10000

    12000

    14000

    16000

    Age Value

    NO.ofImages(Bags)

    Before Prescreening

    After Prescreening

    Figure 5: Age label statistics of the downloaded images before(left light color bars) and after (right dark color bars) pre-

    screening.

    An illustration of the age statistics for the dataset before and afterpre-screening is shown in Fig. 5.

    4.2 Algorithmic EvaluationsIn this subsection, we systematically evaluate the algorithmic

    convergence, robustness and age estimation accuracy of our pro-posed robust multi-instance regressor learning algorithm.

    90

  • 8/2/2019 Web Mining Towards Age Estimator

    7/10

    (a) Correct Label (b) Mixed Label (c) Incorrect Label (d) Poor Quality (e) False Alarm (f) Non-face

    Figure 4: Some sample images from the raw face database with detected face regions. Each column denotes a type of detection

    results, including (from left to right): (a) All the face instances (single or multiple) within the image inherit the bag age label; (b) Part

    of the face instances inherit the bag age label and other detected face instances correspond to other ages (noisy instances); (c) The bag

    age label is incorrect (the age labels for the images are 20, 10, 50, 60, 20, 20 from top to bottom); (d) Poor quality face instances dueto rotation, illumination variation, occlusion or photo fadedness; (e) Images contain false detections; (f) Age-relevant images which

    however contain no face instances. Note that different colors of the detection rectangles indicate the results from different detectors.

    4.2.1 Experiment Setup

    Our proposed algorithm is trained on the constructed Internet

    aging database (cropped faces after the pre-screening step). Theaforementioned image patch based features are used to representeach face instance. For all the experiments, the cropped and pre-screened faces are first re-scaled to the size of 80 80 pixels,then the histogram equalization is performed on the re-scaled faces.Then the DCT2 features are extracted based on 8 8 image patchesand the universal background model is trained with 512 Gaussiancomponents. The adaptation rate r is set to be 1.0 in this work.By performing PCA on the derived supervector, 2000 dimension offeatures are preserved and after the post-processing step, we furtherproject the feature vector into a 500 dimensional space. Note that

    the first dimensionality reduction is for computational efficiencywhile the second is for improving discriminative power. In the

    evaluations, we mainly focus on whether the robust multi-instancelearning can improve the algorithmic performance. We may imple-ment many different regressors for such an evaluation, and in thiswork, we use the Gaussian kernel based kernel regression methodfor designing the age regressor. The comparison experiments areconducted between our proposed robust multi-instance regressionmethod with noisy labels (RMIR) and the direct Gaussian kernelregression (GKR) without considering label noises. For both meth-ods, the number of reference centers (M) and the kernel parameter1 are set to be empirically optimal. For RMIR, and are fixedempirically in all the experiments.

    91

  • 8/2/2019 Web Mining Towards Age Estimator

    8/10

    Figure 6: Sample face instances from the raw aging image database. Note that the images with masks are removed by the pre-

    screening step, and some true faces are also removed as they are detected by only one face detector.

    The convergency process of the objective function for our pro-posed robustmulti-instance regression method is visualized in Fig. 7.The training process is performed on the constructed Internet ag-ing database. As can been seen, normally the optimization processshall converge after about 20 iterations.

    2 4 6 8 10 12 14 16 18 201.41

    1.415

    1.42

    1.425

    1.43

    1.435x 10

    10

    Iteration No.

    Object

    ive

    Function

    Value

    Figure 7: The convergence process of our proposed ro-

    bust multi-instance regression learning algorithm on the con-

    structed Internet aging database.

    4.2.2 Within-database Experiments

    The within-database experiments cannot directly be conducted

    on our constructed Internet aging database, since the precise agelabels are not provided. We therefore perform these evaluations onthose state-of-the-art aging datasets: FG-NET [1] and MORPH-1[16], where the ground-truth age labels are available. For thesedatasets, we randomly partition each of them into the training setand testing set as: for FG-NET, 600 images are randomly selectedas the training set and the rest 402 images are taken as the testingset. For the MORPH-1 dataset, the training and testing partition is800 : 890. Since these datasets have only one face instance for eachimage, the following scheme is used to simulate the noisy multipleinstance case: in the training set, for each image, we randomly

    add another instance to construct a two-instance-bag. Note that theselected instances might not belong to the same bag label. Then forall the training bags, we randomly add certain level of label noises,namely, from 0% to 40%. We then perform both our proposedrobust multi-instance regression algorithm and the GKR on all theinstances from the training bags.

    The comparison results in terms of the mean absolute error(MAE)on the testing set are shown in Fig. 8. The definition for the MAEis given as:

    MAE=Nt

    j=1|

    yj gj|/Nt, (24)

    where yj and gj are the estimation and ground-truth age label re-spectively, and Nt is the number of testing samples. As can beseen, our proposed method scales well with the increase of age la-bel noise level, however the regressor GKR fails in the presence oflabel noises.

    4.2.3 Cross-database Experiments

    In this subsection, we evaluate the algorithmic performance un-der the cross-database scenarios, namely the age estimator is trainedon one database and then tested on another database. The algo-rithms are trained on our collected Internet aging dataset and thenthe obtained regressors are tested on those benchmark datasets, in-cluding FG-NET (1002 images), MORPH-1 (1690 images), MORPH-2 (55, 608 images). Note that MORPH-1 and MORPH-2 contains

    face images from most ethnic groups. The age statistics and the re-gression accuracies in terms of the mean absolute error (MAE) onthese testing datasets are summarized in Table 1, where the base-line results from GKR are also reported for comparison. As canbeen seen, the MAEs from our proposed method is satisfying (9.49,7.42, and 8.60 for these three datasets respectively) and our pro-posed robust multi-instance learning based regressor outperformsthe baseline GKR based regressor significantly. The poor perfor-mance of GKR is mainly caused by the presence of the noisy labelsin the training dataset.

    To further validate the generalization capability of the regressor

    92

  • 8/2/2019 Web Mining Towards Age Estimator

    9/10

    0% 10% 20% 30% 40%0

    2

    4

    6

    8

    10

    12

    14

    15

    Noise Level

    M

    ean

    Abso

    luteError

    (year)

    GKR (All Instances)

    RMIR

    Lower Bound

    0% 10% 20% 30% 40%0

    2

    4

    6

    8

    10

    Noise Level

    M

    ean

    Abso

    luteError

    (year)

    GKR (All Instances)

    RMIR

    Lower Bound

    Figure 8: Comparison of the mean absolute errors (MAEs) (on the testing set) using different methods on the FG-NET (left) and

    MORPH-1 (right) dataset. The lower bound means the mean absolute error obtained by training a GKR regressor with the incorrect

    face instances excluded .

    Age Label

    0

    10

    20

    30

    40

    50

    60

    70

    80

    Top-10 ranked face instances based on pj Bottom-10 ranked face instances based on pj

    Figure 9: The top-10 ranked face instances (left) vs. the bottom-10 ranked face instances (right) for the age labels 0, 10, 20, 30, 40,50, 60, 70, 80 from the top row to the bottom, respectively. The rankings of the face instances are based on the values of pjs derivedfrom our RMIR algorithm.

    learnt from our proposed robust multi-instance regressor learningmethod on the large-scale Internet aging database, we also evaluatethe cross-database age estimation accuracies between the FG-NETand MORPH-1 datasets, namely, we use the FG-NET dataset fortraining and the MORPH-1 dataset for testing, and vice versa. Theresults reported in Table 1 show that these small-size datasets lackthe generalization capability, while our Internet aging database per-

    forms much better. Note that the best result ever reported using FG-NET as the training set and MORPH-1 as testing set is 8.07 in [7],which is also worse than our reported result (i.e., 7.42) by learningthe universal age estimator from the Internet aging database.

    The results shown in Table 1 exhibit certain error distributions.For almost all cases (except when FG-NET dataset is used for train-ing), the mid-age range presents lower errors but small/large ageranges have larger errors. This is due to the age distribution of thetraining data, i.e., for MORPH-1 and MORPH-2 dataset, there aremore samples in the mid-age range. For our IAD dataset, morevalid samples (with large pj) are also in the mid-age range after

    RMIR training. For FG-NET dataset, more samples are in the smallage range, which results in smaller testing errors in the small agerange.

    To illustrate the effectiveness of our proposed robust multi-instancelearning method in a visualizable way, we show the top ranked 10

    faces (based on the probability ofpj = ec2j derived from our

    RMIR algorithm) and the bottom ranked 10 faces from the training

    results on the Internet aging database for comparison in Fig. 9. Wecan observe that the majority of the bottom ranked faces are faceinstances with incorrect age labels or of poor image qualities andthe top ranked faces are more consistent with the given age labels,which validates effectiveness of our formulation in removing imageoutliers.

    5. CONCLUSIONS AND FUTURE WORKIn this paper, we aimed to utilize the prosperous Internet media

    resources for automatically constructing a universal human age es-

    93

  • 8/2/2019 Web Mining Towards Age Estimator

    10/10

    Table 1: Age distribution statistics, mean absolute errors (MAEs) (year) of our Robust Multi-Instance Regression algorithm (RMIR)

    and GKR regressor on the three testing datasets. Note that "IAD-Train" means we use the Internet aging database as the training

    set and similarly "FG-NET-Train" means the FG-NET database is used as the training set.

    MAE (FG-NET) MAE (MORPH-1) MAE (MORPH-2)No. of Samples IAD-Train Morph-1-Train IAD-Train FG-NET-Train IAD-Train

    Range FG-NET MORPH-1 MORPH-2 RMIR GKR GKR RMIR GKR GKR RMIR GKR

    0 9 371 0 0 10.98 21.98 13.95

    10 19 339 343 7483 8.15 20.69 5.89 9.52 21.98 6.30 8.70 18.14

    20

    29 144 763 15364 6.05 15.69 5.09 6.62 18.03 9.99 5.45 14.18

    30 39 79 428 15511 7.92 8.96 12.35 5.32 12.21 17.28 6.07 8.23

    40 49 46 124 12265 13.42 6.40 19.54 10.74 7.28 26.45 12.23 5.76

    50 59 15 25 3643 22.75 9.54 28.06 17.49 5.70 34.59 20.30 7.21

    60 69 8 7 324 29.96 14.69 35.49 34.77 10.67 53.92 29.96 13.09

    70 79 0 0 16 40.48 23.86

    > 80 0 0 0

    Average 9.49 18.64 10.37 7.42 16.60 12.46 8.60 10.80

    timator. The main contributions of this work are as follows. First,a large size (391k) human aging image database was crawled viaa set of popular age related queries. Then, after parallel detectionand noise removal, a clean database with about 220k face instances

    is obtained. Finally, a robust multiple instance regressor learningmethod was developed for handling both noisy images and labels,which led to a strong universal age estimator, applicable to all eth-nic groups and various image qualities. An interesting direction forfuture study is to develop incremental learning algorithm for learn-ing multi-instance regressor with noisy labels, which is practicallyvaluable for web-scale data mining purpose.

    6. ACKNOWLEDGMENTSWe thank Mr. Yantao Zheng and Dr. Jinhui Tang for the help

    in collecting online image data. This research is done for CSIDMProject No. CSIDM-200803 partially funded by a grant from theNational Research Foundation (NRF) administered by the MediaDevelopment Authority (MDA) of Singapore. This work is also

    supported by NRF/IDM Program, under research Grant NRF2008IDM-IDM004-029.

    7. REFERENCES[1] The fg-net aging database:

    http://sting.cycollege.ac.cy/ alanitis/fgnetaging.html.

    [2] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vectormachines for multiple-instance learning. In Neural InformationProcessing Systems, 2002.

    [3] Y. Chen, J. Bi, and J. Wang. Multiple-instance learning viaembedded instance selection. IEEE Transactions on Pattern Analysisand Machine Intelligence, 28(12):1931lC1947, 2006.

    [4] T. Cootes, G. Edwards, and C. Taylor. Active appearance models.IEEE Transactions on Pattern Analysis and Machine Intelligence,23(6):681685, 2001.

    [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from

    incomplete data via the em algorithm. Journal of the RoyalStatistical Society, 39(1):138, 1977.

    [6] Y. Fu and T. Huang. Human age estimation with regression ondiscriminative aging manifold. IEEE Transactions on Multimedia,10(4):578584, 2008.

    [7] X. Geng, Z. Zhou, and K. Smith-Miles. Automatic age estimationbased on facial aging patterns. IEEE Transactions on Pattern

    Analysis and Machine Intelligence, 29(12):22342240, 2007.

    [8] G. Guo, Y. Fu, C. Dyer, and T. Huang. Image-based human ageestimation by manifold learning and locally adjusted robustregression. IEEE Transactions on Image Processing,17(7):11781188, 2008.

    [9] J. Hayashi, M. Yasumoto, H. Ito, and H. Koshimizu. A method forestimating and modeling age and gender using facial imageprocessing. In International Conference on Virtual Systems and

    Multimedia, pages 439448, 2001.

    [10] I. Joliffe. Principal component analysis. Springer-Verlag, New York,1986.

    [11] J. Keeler, D. Rumelhart, and W. Leow. Integrated segmentation andrecognition of hand-printed numerals. In Neural InformationProcessing Systems, pages 557563, 1990.

    [12] Y. Kwon and N. Lobo. Age classification from facial images. IEEETransactions on Pattern Analysis and Machine Intelligence,74(1):121, 1999.

    [13] A. Lanitis, C. Draganova, and C. Christodoulou. Comparingdifferent classifiers for automatic age estimation. IEEE Transactionson Systems, Man and Cybernetics, Part B, 34(1):621628, 2004.

    [14] O. Maron and T. Lozano-Plerez. A framework for multiple-instancelearning. In Neural Information Processing Systems, pages 570576,1998.

    [15] S. Ray and D. Page. Multiple instance regression. In InternationalConference on Machine Learning, pages 425432, 2001.

    [16] K. Ricanek and T. Tesafaye. Morph: A longitudinal image database

    of normal adult age-progression. In IEEE International Conferenceon Automatic Face and Gesture Recognition, pages 341345, March2006.

    [17] W. Rudin. Principles of Mathematical Analysis, 3nd Edition.McGray-Hill, 1978.

    [18] P. Viola and M. Jones. Robust real-time face detection. InInternational Conference on Computer Vision, 2001.

    [19] P. Viola, J. Platt, and C. Zhang. Multiple instance boosting for objectdetection. In Neural Information Processing Systems, 2005.

    [20] J. Wang and J. Zucker. Solving the multiple-instance problem: a lazylearning approach. In International Conference on Machine

    Learning, pages 11191125, 2000.

    [21] S. Yan, H. Wang, X. Tang, J. Liu, and T. Huang. Regression fromuncertain labels and its applications to soft-biometrics. IEEETransactions on Information Forensics and Security, 3(4):698708.

    [22] S. Yan, X. Zhou, M. Liu, M. Hasegawa-Johnson, and T. S. Huang.

    Regression from patch-kernel. In IEEE Conference on ComputerVision and Pattern Recognition, pages 18, 2008.

    [23] K. Yanai and K. Barnard. Finding visual concept by web imagemining. In International World Wide Web Conference, 2006.

    [24] Q. Zhang and S. Goldman. Em-dd: An improved multiple-instancelearning technique. In Neural Information Processing Systems, 2001.

    [25] Z. Zhou and M. Zhang. Multi-instance multi-label learning withapplication to scene classification. In Neural Information ProcessingSystems, 2007.

    94