Self-taught machine learning

download Self-taught machine learning

of 8

Transcript of Self-taught machine learning

  • 7/27/2019 Self-taught machine learning

    1/8

    Self-taught Learning: Transfer Learning from Unlabeled Data

    Rajat Raina [email protected]

    Alexis Battle [email protected]

    Honglak [email protected]

    Benjamin Packer [email protected]

    Andrew Y. Ng [email protected] Science Department, Stanford University, CA 94305 USA

    AbstractWe present a new machine learning frame-work called self-taught learning for usingunlabeled data in supervised classificationtasks. We do not assume that the unla-beled data follows the same class labels orgenerative distribution as the labeled data.Thus, we would like to use a large numberof unlabeled images (or audio samples, ortext documents) randomly downloaded fromthe Internet to improve performance on agiven image (or audio, or text) classificationtask. Such unlabeled data is significantly eas-ier to obtain than in typical semi-supervisedor transfer learning settings, making self-taught learning widely applicable to manypractical learning problems. We describe anapproach to self-taught learning that usessparse coding to construct higher-level fea-tures using the unlabeled data. These fea-tures form a succinct input representationand significantly improve classification per-formance. When using an SVM for classifi-cation, we further show how a Fisher kernelcan be learned for this representation.

    1. Introduction

    Labeled data for machine learning is often very diffi-cult and expensive to obtain, and thus the ability touse unlabeled data holds significant promise in termsof vastly expanding the applicability of learning meth-ods. In this paper, we study a novel use of unlabeleddata for improving performance on supervised learn-ing tasks. To motivate our discussion, consider as arunning example the computer vision task of classi-fying images of elephants and rhinos. For this task,it is difficult to obtain many labeled examples of ele-phants and rhinos; indeed, it is difficult even to obtainmany unlabeled examples of elephants and rhinos. (Infact, we find it difficult to envision a process for col-lecting such unlabeled images, that does not immedi-

    Appearing in Proceedings of the 24 th International Confer-ence on Machine Learning, Corvallis, OR, 2007. Copyright2007 by the author(s)/owner(s).

    ately also provide the class labels.) This makes theclassification task quite hard with existing algorithmsfor using labeled and unlabeled data, including mostsemi-supervised learning algorithms such as the oneby Nigam et al. (2000). In this paper, we ask how un-labeled images from other object classeswhich aremuch easier to obtain than images specifically of ele-phants and rhinoscan be used. For example, givenunlimited access to unlabeled, randomly chosen im-

    ages downloaded from the Internet (probably none ofwhich contain elephants or rhinos), can we do betteron the given supervised classification task?

    Our approach is motivated by the observation thateven many randomly downloaded images will containbasic visual patterns (such as edges) that are similarto those in images of elephants and rhinos. If, there-fore, we can learn to recognize such patterns from theunlabeled data, these patterns can be used for the su-pervised learning task of interest, such as recognizingelephants and rhinos. Concretely, our approach learnsa succinct, higher-level feature representation of the in-puts using unlabeled data; this representation makesthe classification task of interest easier.

    Although we use computer vision as a running exam-ple, the problem that we pose to the machine learningcommunity is more general. Formally, we considersolving a supervised learning task given labeled andunlabeled data, where the unlabeled data does notshare the class labels or the generative distribution ofthe labeled data. For example, given unlimited accessto natural sounds (audio), can we perform betterspeaker identification? Given unlimited access to newsarticles (text), can we perform better email folderingof ICML reviewing vs. NIPS reviewing emails?

    Like semi-supervised learning (Nigam et al., 2000),our algorithms will therefore use labeled and unlabeleddata. But unlike semi-supervised learning as it is typ-ically studied in the literature, we do not assume thatthe unlabeled data can be assigned to the supervised

    learning tasks class labels. To thus distinguish ourformalism from such forms of semi-supervised learn-ing, we will call our task self-taught learning.

    There is no prior general, principled framework forincorporating such unlabeled data into a supervised

  • 7/27/2019 Self-taught machine learning

    2/8

    Self-taught Learning

    learning algorithm. Semi-supervised learning typicallymakes the additional assumption that the unlabeleddata can be labeled with the same labels as the clas-sification task, and that these labels are merely unob-served (Nigam et al., 2000). Transfer learning typi-cally requires further labeled data from a different butrelated task, and at its heart typically transfers knowl-edge from one supervised learning task to another;thus it requires additional labeled (and therefore oftenexpensive-to-obtain) data, rather than unlabeled data,for these other supervised learning tasks.1 (Thrun,1996; Caruana, 1997; Ando & Zhang, 2005) Becauseself-taught learning places significantly fewer restric-tions on the type of unlabeled data, in many practi-cal applications (such as image, audio or text classi-fication) it is much easier to apply than typical semi-supervised learning or transfer learning methods. Forexample, it is far easier to obtain 100,000 Internet im-ages than to obtain 100,000 images of elephants andrhinos; far easier to obtain 100,000 newswire articlesthan 100,000 articles on ICML reviewing and NIPSreviewing, and so on. Using our running example ofimage classification, Figure 1 illustrates these crucialdistinctions between the self-taught learning problemthat we pose, and previous, related formalisms.

    We pose the self-taught learning problem mainly toformalize a machine learning framework that we thinkhas the potential to make learning significantly eas-ier and cheaper. And while we treat any biologi-cal motivation for algorithms with great caution, theself-taught learning problem perhaps also more accu-rately reflects how humans may learn than previous

    formalisms, since much of human learning is believedto be from unlabeled data. Consider the following in-formal order-of-magnitude argument.2 A typical adulthuman brain has about 1014 synapses (connections),and a typical human lives on the order of 109 seconds.Thus, even if each synapse is parameterized by just aone bit parameter, a learning algorithm would requireabout 1014/109 = 105 bits of information per secondto learn all the connections in the brain. It seemsextremely unlikely that this many bits of labeled infor-mation are available (say, from a humans parents orteachers in his/her youth). While this argument hasmany (known) flaws and is not to be taken too seri-

    ously, it strongly suggests that most of human learn-ing is unsupervised, requiring only data without anylabels (such as whatever natural images, sounds, etc.one may encounter in ones life).

    1We note that these additional supervised learning taskscan sometimes b e created via ingenious heuristics, as inAndo & Zhang (2005).

    2This argument was first described to us by GeoffreyHinton (personal communication) but appears to reflect aview that is fairly widely held in neuroscience.

    Supervised Classification

    Semi-supervised Learning

    Transfer Learning

    Self-taught Learning

    Figure 1. Machine learning formalisms for classifying im-ages of elephants and rhinos. Images on orange backgroundare labeled; others are unlabeled. Top to bottom: Super-vised classification uses labeled examples of elephants andrhinos; semi-supervised learning uses additional unlabeledexamples of elephants and rhinos; transfer learning uses ad-ditional labeled datasets; self-taught learning just requiresadditional unlabeled images, such as ones randomly down-loaded from the Internet.

    Inspired by these observations, in this paper we presentlargely unsupervised learning algorithms for improvingperformance on supervised classification tasks. Ouralgorithms apply straightforwardly to different inputmodalities, including images, audio and text. Our ap-

    proach to self-taught learning consists of two stages:First we learn a representation using only unlabeleddata. Then, we apply this representation to the la-beled data, and use it for the classification task. Oncethe representation has been learned in the first stage,it can then be applied repeatedly to different classifi-cation tasks; in our example, once a representation hasbeen learned from Internet images, it can be appliednot only to images of elephants and rhinos, but also toother image classification tasks.

    2. Problem Formalism

    In self-taught learning, we are given a labeled

    training set of m examples {(x(1)l , y(1)), (x(2)l , y

    (2)),

    . . . , (x(m)l , y

    (m))} drawn i.i.d. from some distribution

    D. Here, each x(i)l R

    n is an input feature vector (thel subscript indicates that it is a labeled example),and y(i) {1, . . . , C } is the corresponding class label.In addition, we are given a set ofk unlabeled examples

    x(1)u , x

    (2)u , . . . , x

    (k)u Rn. Crucially, we do not assume

    that the unlabeled data x(j)u was drawn from the same

    distribution as, nor that it can be associated with the

  • 7/27/2019 Self-taught machine learning

    3/8

    Self-taught Learning

    Figure 2. Left: Example sparse coding bases learned from

    image patches (14x14 pixels) drawn from random grayscaleimages of natural scenery. Each square in the grid repre-sents one basis. Right: Example acoustic bases learned bythe same algorithm, using 25ms sound samples from speechdata. Each of the four rectangles in the 2x2 grid shows the25ms long acoustic signal represented by a basis vector.

    same class labels as, the labeled data. Clearly, as intransfer learning (Thrun, 1996; Caruana, 1997), thelabeled and unlabeled data should not be completelyirrelevant to each other if unlabeled data is to help theclassification task. For example, we would typically

    expect that x(i)l and x

    (j)u come from the same input

    type or modality, such as images, audio, text, etc.

    Given the labeled and unlabeled training set, a self-taught learning algorithm outputs a hypothesis h :Rn {1, . . . , C } that tries to mimic the input-label

    relationship represented by the labeled training data;this hypothesis h is then tested under the same distri-bution D from which the labeled data was drawn.

    3. A Self-taught Learning Algorithm

    We hope that the self-taught learning formalism thatwe have proposed will engender much novel researchin machine learning. In this paper, we describe justone approach to the problem.

    We present an algorithm that begins by using the un-labeled data x

    (i)u to learn a slightly higher-level, more

    succinct, representation of the inputs. For example, if

    the inputs x(i)u (and x

    (i)l ) are vectors of pixel intensity

    values that represent images, our algorithm will use

    x(i)u to learn the basic elements that comprise an im-

    age. For example, it may discover (through examiningthe statistics of the unlabeled images) certain strongcorrelations between rows of pixels, and therefore learnthat most images have many edges. Through this, itthen learns to represent images in terms of the edgesthat appear in it, rather than in terms of the raw pixelintensity values. This representation of an image interms of the edges that appear in itrather than theraw pixel intensity valuesis a higher level, or moreabstract, representation of the input. By applying this

    learned representation to the labeled data x(i)l , we ob-

    tain a higher level representation of the labeled dataalso, and thus an easier supervised learning task.

    3.1. Learning Higher-level Representations

    We learn the higher-level representation using a mod-ified version of the sparse coding algorithm due to Ol-

    Figure 3. The features computed for an image patch (left)by representing the patch as a sparse weighted combina-tion of bases (right). These features act as robust edge

    detectors.

    Figure 4. Left: An example platypus image from the Cal-tech 101 dataset. Right: Features computed for the platy-pus image using four sample image patch bases (trainedon color images, and shown in the small colored squares)by computing features at different locations in the image.In the large figures on the right, white pixels representshighly positive feature values for the corresponding basis,

    and black pixels represents highly negative feature values.These activations capture higher-level structure of the in-put image. (Bases have been magnified for clarity; bestviewed in color.)

    shausen & Field (1996), which was originally proposedas an unsupervised computational model of low-levelsensory processing in humans. More specifically, given

    the unlabeled data {x(1)u ,...,x

    (k)u } with each x

    (i)u Rn,

    we pose the following optimization problem:

    minimizeb,a

    i x(i)u

    j a

    (i)j bj

    22 + a

    (i)1 (1)

    s.t. bj2 1, j 1,...,sThe optimization variables in this problem are the ba-

    sis vectors b = {b1, b2, . . . , bs} with each bj Rn,and the activations a = {a(1), . . . , a(k)} with each

    a(i) Rs; here, a(i)j is the activation of basis bj for

    input x(i)u . The number of bases s can be much larger

    than the input dimension n. The optimization objec-tive (1) balances two terms: (i) The first quadratic

    term encourages each input x(i)u to be reconstructed

    well as a weighted linear combination of the bases bj(with corresponding weights given by the activations

    a(i)j ); and (ii) it encourages the activations to have low

    L1 norm. The latter term therefore encourages the ac-tivations a to be sparsein other words, for most of

    its elements to be zero. (Tibshirani, 1996; Ng, 2004)This formulation is actually a modified version of Ol-shausen & Fields, and can be solved significantly moreefficiently. Specifically, the problem (1) is convex overeach subset of variables a and b (though not jointlyconvex); in particular, the optimization over activa-tions a is an L1-regularized least squares problem,and the optimization over basis vectors b is an L2-constrained least squares problem. These two convexsub-problems can be solved efficiently, and the objec-

  • 7/27/2019 Self-taught machine learning

    4/8

    Self-taught Learning

    tive in problem (1) can be iteratively optimized overa and b alternatingly while holding the other set ofvariables fixed. (Lee et al., 2007)

    As an example, when this algorithm is applied to small14x14 images, it learns to detect different edges in theimage, as shown in Figure 2 (left). Exactly the samealgorithm can be applied to other input types, such asaudio. When applied to speech sequences, sparse cod-ing learns to detect different patterns of frequencies,as shown in Figure 2 (right).

    Importantly, by using an L1 regularization term, weobtain extremely sparse activationsonly a few bases

    are used to reconstruct any input x(i)u ; this will give

    us a succinct representation for x(i)u (described later).

    We note that other regularization terms that result

    in most of the a(i)j being non-zero (such as that used

    in the original Olshausen & Field algorithm) do notlead to good self-taught learning performance; this isdescribed in more detail in Section 4.

    3.2. Unsupervised Feature Construction

    It is often easy to obtain large amounts of unlabeleddata that shares several salient features with the la-beled data from the classification task of interest. Inimage classification, most images contain many edgesand other visual structures; in optical character recog-nition, characters from different scripts mostly com-prise different pen strokes; and for speaker identifi-cation, speech even in different languages can often bebroken down into common sounds (such as phones).

    Building on this observation, we propose the follow-ing approach to self-taught learning: We first apply

    sparse coding to the unlabeled data x(i)u Rn to learna set of bases b, as described in Section 3.1. Then, for

    each training input x(i)l R

    n from the classification

    task, we compute features a(x(i)l ) R

    s by solving thefollowing optimization problem:

    a(x(i)l ) = arg mina(i) x

    (i)l

    j a

    (i)j bj

    22 + a

    (i)1 (2)

    This is a convex L1-regularized least squares problemand can be solved efficiently (Efron et al., 2004; Lee

    et al., 2007). It approximately expresses the input x(i)l

    as a sparse linear combination of the bases bj . The

    sparse vector a(x(i)l ) is our new representation for x

    (i)l .

    Using a set of 512 learned image bases (as in Fig-ure 2, left), Figure 3 illustrates a solution to this op-timization problem, where the input image x is ap-proximately expressed as a combination of three ba-sis vectors b142, b381, b497. The image x can now berepresented via the vector a R512 with a142 = 0.6,a381 = 0.8, a497 = 0.4. Figure 4 shows such features acomputed for a large image. In both of these cases, thecomputed features capture aspects of the higher-levelstructure of the input images. This method applies

    equally well to other input types; the features com-puted on audio samples or text documents similarlydetect useful higher-level patterns in the inputs.

    We use these features as input to standard supervisedclassification algorithms (such as SVMs). To classify atest example, we solve (2) to obtain our representationa for it, and use that as input to the trained classifier.Algorithm 1 summarizes our algorithm for self-taughtlearning.

    Algorithm 1 Self-taught Learning via Sparse Coding

    input Labeled training set

    T = {(x(1)l , y

    (1)), (x(2)l , y

    (2)), . . . , (x(m)l , y

    (m))}.

    Unlabeled data {x(1)u , x

    (2)u , . . . , x

    (k)u }.

    output Learned classifier for the classification task.

    algorithm Using unlabeled data {x(i)u }, solve the op-

    timization problem (1) to obtain bases b.Compute features for the classification taskto obtain a new labeled training set T =

    {(a(x

    (i)

    l ), y(i)

    )}m

    i=1, wherea(x

    (i)l ) = arg mina(i) x

    (i)l

    j a

    (i)j bj

    22 + a

    (i)1.

    Learn a classifier Cby applying a supervised learningalgorithm (e.g., SVM) to the labeled training set T.return the learned classifier C.

    3.3. Comparison with Other Methods

    It seems that any algorithm for the self-taught learningproblem must, at some abstract level, detect structureusing the unlabeled data. Many unsupervised learn-ing algorithms have been devised to model differentaspects of higher-level structure; however, their ap-plication to self-taught learning is more challenging

    than might be apparent at first blush.Principal component analysis (PCA) is among themost commonly used unsupervised learning algo-rithms. It identifies a low-dimensional subspaceof maximal variation within unlabeled data. In-terestingly, the top T n principal componentsb1, b2, . . . , bT are a solution to an optimization problemthat is cosmetically similar to our formulation in (1):

    minimizeb,a

    i x(i)u

    j a

    (i)j bj

    22 (3)

    s.t. b1, b2, . . . , bT are orthogonal

    PCA is convenient because the above optimizationproblem can be solved efficiently using standard nu-

    merical software; further, the features a(i)j can be com-

    puted easily because of the orthogonality constraint,

    and are simply a(i)j = b

    Tj x

    (i)u .

    When compared with sparse coding as a method forconstructing self-taught learning features, PCA hastwo limitations. First, PCA results in linear feature

    extraction, in that the features a(i)j are simply a linear

    function of the input.3 Second, since PCA assumes

    3As an example of a nonlinear but useful feature for im-

  • 7/27/2019 Self-taught machine learning

    5/8

    Self-taught Learning

    Table 1. Details of self-taught learning applications evaluated in the experiments.Domain Unlabeled data Labeled data Classes Raw featuresImageclassification

    10 images of outdoorscenes

    Caltech101 image classifi-cation dataset

    101 Intensities in 14x14 pixelpatch

    Handwritten char-acter recognition

    Handwritten digits(09)

    Handwritten English char-acters (az)

    26 Intensities in 28x28 pixelcharacter/digit image

    Font characterrecognition

    Handwritten Englishcharacters (az)

    Font characters (a/A z/Z)

    26 Intensities in 28x28 pixelcharacter image

    Song genreclassification Song snippets from 10genres Song snippets from 7 dif-ferent genres 7 Log-frequency spectrogramover 50ms time windowsWebpageclassification

    100,000 news articles(Reuters newswire)

    Categorized webpages(from DMOZ hierarchy)

    2 Bag-of-words with 500 wordvocabulary

    UseNet articleclassification

    100,000 news articles(Reuters newswire)

    Categorized UseNet posts(from SRAA dataset)

    2 Bag-of-words with 377 wordvocabulary

    Table 2. Classification accuracy on the Caltech 101 imageclassification dataset. For PCA and sparse coding results,each image was split into the specified number of regions,and features were aggregated within each region by takingthe maximum absolute value.

    FeaturesNumber of regions

    1 4 9 16PCA 20.1% 30.6% 36.8% 37.0%

    Sparse coding 30.8% 40.9% 46.0% 46.6%Published baseline (Fei-Fei et al., 2004) 16%

    the bases bj to be orthogonal, the number of PCA fea-tures cannot be greater than the dimension n of theinput. Sparse coding does not have either of these lim-itations. Its features a(x) are an inherently nonlinearfunction of the input x, due to the presence of the L1term in Equation (1).4 Further, sparse coding can usemore basis vectors/features than the input dimensionn. By learning a large number of basis vectors butusing only a small number of them for any particularinput, sparse coding gives a higher-level representationin terms of the many possible basic patterns, such asedges, that may appear in an input. Section 6 furtherdiscusses other unsupervised learning algorithms.

    4. Experiments

    We apply our algorithm to several self-taught learn-ing tasks shown in Table 3.3. Note that the unlabeleddata in each case cannot be assigned the labels fromthe labeled task. For each application, the raw inputexamples x were represented in a standard way: rawpixel intensities for images, the frequency spectrogramfor audio, and the bag-of-words (vector) representationfor text. For computational reasons, the unlabeleddata was preprocessed by applying PCA to reduce its

    ages, consider the phenomenon called end-stopping (whichis known to occur in biological visual perception) in whicha feature is maximally activated by edges of only a specificorientation and length; increasing the length of the edgefurther significantly decreases the features activation. Alinear response model cannot exhibit end-stopping.

    4For example, sparse coding can exhibit end-stopping (Lee et al., 2007). Note also that even thoughsparse coding attempts to express x as a linear combina-tion of the bases bj , the optimization problem (2) resultsin the activations aj being a non-linear function ofx.

    dimension;5 the sparse coding basis learning algorithmwas then applied in the resulting principal componentspace.6 Then, the learned bases were used to constructfeatures for each input from the supervised classifica-tion task.7 For each such task, we report the resultfrom the better of two standard, off-the-shelf super-vised learning algorithms: a support vector machine

    (SVM) and Gaussian discriminant analysis (GDA). (Aclassifier specifically customized to sparse coding fea-tures is described in Section 5.)

    We compare our self-taught learning algorithm againsttwo baselines, also trained with an SVM or GDA: us-ing the raw inputs themselves as features, and usingprincipal component projections as features, where theprincipal components were computed on the unlabeled

    5We picked the number of principal components to pre-serve approximately 96% of the unlabeled data variance.

    6Reasonable bases can often be learned even using a

    smooth approximation such as (Pjqa2j + ) to the L1-

    norm sparsity penalty a1. However, such approximationsdo not produce sparse features, and in our experiments, wefound that classification performance is significantly worseif such approximations are used to compute a(x). Sincethe labeled and unlabeled data can sometimes lead to verydifferent numbers of non-zero coefficients ai, in our exper-iments was also recalibrated prior to computing the la-beled datas representations a(xl).

    7Partly for scaling and computational reasons, an ad-ditional feature aggregation step was applied to the imageand audio classification tasks (since a single image is severaltimes larger than the individual/small image patch basesthat can be learned tractably by sparse coding). We aggre-gated features for the large image by extracting features forsmall image patches in different locations in the large im-

    age, and then aggregating the features per-basis by takingthe feature value with the maximum absolute value. Theaggregation procedure effectively looks for the strongestoccurrence of each basis pattern within the image. (Evenbetter performance is obtained by aggregating features overa KxK grid of regions, thus looking for strong activationsseparately in different parts of the large image; see Ta-ble 3.3.) These region-wise aggregated features were usedas input to the classification algorithms (SVM or GDA).Features for audio snippets were similarly aggregated bycomputing the maximum activation per basis vector over50ms windows in the snippet.

  • 7/27/2019 Self-taught machine learning

    6/8

    Self-taught Learning

    Figure 5. Left: Example images from the handwritten digit

    dataset (top), the handwritten character dataset (middle)and the font character dataset (bottom). Right: Examplesparse coding bases learned on handwritten digits.

    Table 3. Top: Classification accuracy on 26-way handwrit-ten English character classification, using bases trained onhandwritten digits. Bottom: Classification accuracy on26-way English font character classification, using basestrained on English handwritten characters. The numbersin parentheses denote the accuracy using raw and sparsecoding features together. Here, sparse coding featuresalone do not perform as well as the raw features, but per-form significantly better when used in combination withthe raw features.

    Digits English handwritten characters

    Training set size Raw PCA Sparse coding100 39.8% 25.3% 39.7%500 54.8% 54.8% 58.5%

    1000 61.9% 64.5% 65.3%

    Handwritten characters Font charactersTraining set size Raw PCA Sparse coding

    100 8.2% 5.7% 7.0% (9.2%)500 17.9% 14.5% 16.6% (20.2%)

    1000 25.6% 23.7% 23.2% (28.3%)

    data (as described in Section 3.3). In the PCA re-sults presented in this paper, the number of principalcomponents used was always fixed at the number ofprincipal components used for preprocessing the raw

    input before applying sparse coding. This control ex-periment allows us to evaluate the effects of PCA pre-processing and the later sparse coding step separately,but should therefore not be treated as a direct evalua-tion of PCA as a self-taught learning algorithm (wherethe number of principal components could then also bevaried).

    Tables 3.3-4 report the results for various domains.Sparse coding features, possibly in combination withraw features, significantly outperform the raw featuresalone as well as PCA features on most of the domains.

    On the 101-way Caltech 101 image classification taskwith 15 training images per class (Table 3.3), sparse

    coding features achieve a test accuracy of 46.6%. Incomparison, the first published supervised learning al-gorithm for this dataset achieved only 16% test accu-racy even with computer vision specific features (in-stead of raw pixel intensities).8

    8Since the time we ran our experiments, other re-searchers have reported better results using highly spe-cialized computer vision algorithms (Zhang et al., 2006:59.1%; Lazebnik et al., 2006: 56.4%). We note that ouralgorithm was until recently state-of-the-art for this well-

    Table 4. Accuracy on 7-way music genre classification.Training set size Raw PCA Sparse coding

    100 28.3% 28.6% 44.0%1000 34.0% 26.3% 45.5%5000 38.1% 38.1% 44.3%

    Table 5. Text bases learned on 100,000 Reuters newswiredocuments. Top: Each row represents the basis most ac-tive on average for documents with the class label at the

    left. For each basis vector, the words corresponding to thelargest magnitude elements are displayed. Bottom: Eachrow represents the basis that contains the largest magni-tude element for the word at the left. The words corre-sponding to other large magnitude elements are displayed.

    Design design, company, product, work, market

    Business car, sale, vehicle, motor, market, import

    vaccine infect, report, virus, hiv, decline, product

    movie share, disney, abc, release, office, movie, pay

    Figure 5 shows example inputs from the three char-acter datasets, and some of the learned bases. Thelearned bases appear to represent pen strokes. In

    Table 4, it is thus not surprising that sparse cod-ing is able to use bases (strokes) learned on dig-its to significantly improve performance on handwrit-ten charactersit allows the supervised learning algo-rithm to see the characters as comprising strokes,rather than as comprising pixels.

    For audio classification, our algorithm outperforms theoriginal (spectral) features (Table 4).9 When appliedto text, sparse coding discovers word relations thatmight be useful for classification (Table 5). The per-formance improvement over raw features is small (Ta-ble 4).10 This might be because the bag-of-words rep-resentation of text documents is already sparse, unlike

    the raw inputs for the other applications.11

    We envision self-taught learning as being most use-ful when labeled data is scarce. Table 4 shows thatwith small amounts of labeled data, classification per-formance deteriorates significantly when the bases (insparse coding) or principal components (in PCA) are

    known dataset, even with almost no explicit computer-vision engineering, and indeed it significantly outperformsmany carefully hand-designed, computer-vision specificmethods published on this task (E.g., Fei-Fei et al., 2004:16%; Serre et al., 2005: 35%; Holub et al., 2005: 40.1%).

    9Details: We learned bases over songs from 10 genres,and used these bases to construct features for a music genre

    classification over songs from 7 different genres (with dif-ferent artists, and possibly different instruments). Eachtraining example comprised a labeled 50ms song snippet;each test example was a 1 second song snippet.

    10Details: Learned bases were evaluated on 30 binarywebpage category classification tasks. PCA applied to textdocuments is commonly referred to as latent semantic anal-ysis. (Deerwester et al., 1990)

    11The results suggest that algorithms such as LDA (Bleiet al., 2002) might also be appropriate for self-taught learn-ing on text (though LDA is specific to a bag-of-words rep-resentation and would not apply to the other domains).

  • 7/27/2019 Self-taught machine learning

    7/8

    Self-taught Learning

    Table 6. Classification accuracy on webpage classification(top) and UseNet article classification (bottom), usingbases trained on Reuters news articles.

    Reuters news WebpagesTraining set size Raw PCA Sparse coding

    4 62.8% 63.3% 64.3%10 73.0% 72.9% 75.9%20 79.9% 78.6% 80.4%

    Reuters news UseNet articlesTraining set size Raw PCA Sparse coding

    4 61.3% 60.7% 63.8%10 69.8% 64.6% 68.7%

    Table 7. Accuracy on the self-taught learning tasks whensparse coding bases are learned on unlabeled data (thirdcolumn), or when principal components/sparse codingbases are learned on the labeled training set (fourth/fifthcolumn). Since Tables 3.3-4 already show the results forPCA trained on unlabeled data, we omit those results fromthis table. The performance trends are qualitatively pre-served even when raw features are appended to the sparsecoding features.

    Domain Trainingset size Unlabeled LabeledSC PCA SC

    Handwrittencharacters

    100 39.7% 36.2% 31.4%500 58.5% 50.4% 50.8%

    1000 65.3% 62.5% 61.3%5000 73.1% 73.5% 73.0%

    Fontcharacters

    100 7.0% 5.2% 5.1%500 16.6% 11.7% 14.7%

    1000 23.2% 19.0% 22.3%

    Webpages4 64.3% 55.9% 53.6%

    10 75.9% 57.0% 54.8%20 80.4% 62.9% 60.5%

    UseNet4 63.8% 60.5% 50.9%

    10 68.7% 67.9% 60.8%

    learned on the labeled data itself, instead of on large

    amounts of additional unlabeled data.12 As more andmore labeled data becomes available, the performanceof sparse coding trained on labeled data approaches(and presumably will ultimately exceed) that of sparsecoding trained on unlabeled data.

    Self-taught learning empirically leads to significantgains in a large variety of domains. An importanttheoretical question is characterizing how the simi-larity between the unlabeled and labeled data affectsthe self-taught learning performance (similar to theanalysis by Baxter, 1997, for transfer learning). Weleave this question open for further research.

    5. Learning a Kernel via Sparse CodingA fundamental problem in supervised classification isdefining a similarity function between two input ex-amples. In the experiments described above, we usedthe regular notions of similarity (i.e., standard SVMkernels) to allow a fair comparison with the baseline

    12For the sake of simplicity (and due to space con-straints), we performed this comparison only for the do-mains that the basic sparse coding algorithm applies to,and that do not require the extra feature aggregation step.

    algorithms. However, we now show that the sparsecoding model also suggests a specific specialized simi-larity function (kernel) for the learned representations.

    The sparse coding model (1) can be viewed as learn-ing the parameter b of the following linear generativemodel, that posits Gaussian noise on the observationsx and a Laplacian (L

    1) prior over the activations:

    P(x =

    j ajbj + | b, a) exp(22/2

    2),

    P(a) exp(

    j |aj |)Once the bases b have been learned using unlabeleddata, we obtain a complete generative model for theinput x. Thus, we can compute the Fisher kernel tomeasure the similarity between new inputs. (Jaakkola& Haussler, 1998) In detail, given an input x, wefirst compute the corresponding features a(x) by ef-ficiently solving optimization problem (2). Then, theFisher kernel implicitly maps the input x to the high-dimensional feature vector Ux = b log P(x, a(x)|b),where we have used the MAP approximation a(x) for

    the random variable a.13 Importantly, for the sparsecoding generative model, the corresponding kernel hasa particularly intuitive form, and for inputs x(s) andx(t) can be computed efficiently as:

    K(x(s), x(t)) =

    a(x(s))Ta(x(t))

    r(s)T

    r(t)

    ,

    where r = x

    j ajbj represents the residual vec-tor corresponding to the MAP features a. Note thatthe first term in the product above is simply the innerproduct of the MAP feature vectors, and correspondsto using a linear kernel over the learned sparse rep-resentation. The second term, however, compares thetwo residuals as well.

    We evaluate the performance of the learned kernel onthe handwritten character recognition domain, since itdoes not require any feature aggregation. As a base-line, we compare against all choices of standard kernels(linear/polynomials of degree 2 and 3/RBF) and fea-tures (raw features/PCA/sparse coding features). Ta-ble 5 shows that an SVM with the new kernel outper-forms the best choice of standard kernels and features,even when that best combination was picked on thetest data (thus giving the baseline a slightly unfair ad-vantage). In summary, using the Fisher kernel derivedfrom the generative model described above, we obtaina classifier customized specifically to the distribution

    of sparse coding features.6. Discussion and Other MethodsIn the semi-supervised learning setting, several authorshave previously constructed features using data fromthe same domain as the labeled data (e.g., Hinton &Salakhutdinov, 2006). In contrast, self-taught learning

    13In our experiments, the marginalized kernel (Tsudaet al., 2002), that takes an expectation over a (computedby MCMC sampling) rather than the MAP approximation,did not perform better.

  • 7/27/2019 Self-taught machine learning

    8/8

    Self-taught Learning

    Table 8. Classification accuracy using the learned sparsecoding kernel in the Handwritten Characters domain, com-pared with the accuracy using the best choice of standardkernel and input features. (See text for details.)

    Training set size Standard kernel Sparse coding100 35.4% 41.8%500 54.8% 62.6%

    1000 61.9% 68.9%

    poses a harder problem, and requires that the struc-ture learned from unlabeled data be useful for rep-resenting data from the classification task. Several ex-isting methods for unsupervised and semi-supervisedlearning can be applied to self-taught learning, thoughmany of them do not lead to good performance. Forexample, consider the task of classifying images of En-glish characters (az), using unlabeled images ofdigits (09). For such a task, manifold learn-ing algorithms such as ISOMAP (Tenenbaum et al.,2000) or LLE (Roweis & Saul, 2000) can learn a low-dimensional manifold using the unlabeled data (dig-

    its); however, these manifold representations do notgeneralize straightforwardly to the labeled inputs (En-glish characters) that are dissimilar to any single unla-beled input (digit). We believe that these and severalother learning algorithms such as auto-encoders (Hin-ton & Salakhutdinov, 2006) or non-negative matrixfactorization (Hoyer, 2004) might be modified to makethem suitable for self-taught learning.

    We note that even though semi-supervised learningwas originally defined with the assumption that theunlabeled and labeled data follow the same class la-bels (Nigam et al., 2000), it is sometimes conceived

    as learning with labeled and unlabeled data. Un-der this broader definition of semi-supervised learning,self-taught learning would be an instance (a particu-larly widely applicable one) of it.

    Examining the last two decades of progress in ma-chine learning, we believe that the self-taught learningframework introduced here represents the natural ex-trapolation of a sequence of machine learning problemformalisms posed by various authorsstarting frompurely supervised learning, through semi-supervisedlearning, to transfer learningwhere researchers haveconsidered problems making increasingly little use ofexpensive labeled data, and using less and less re-

    lated data. In this light, self-taught learning can alsobe described as unsupervised transfer or transferlearning from unlabeled data. Most notably, Ando &Zhang (2005) propose a method for transfer learningthat relies on using hand-picked heuristics to generatelabeled secondary prediction problems from unlabeleddata. It might be possible to adapt their method toseveral self-taught learning applications. It is encour-aging that our simple algorithms produce good resultsacross a broad spectrum of domains. With this paper,we hope to initiate further research in this area.

    AcknowledgmentsWe give warm thanks to Bruno Olshausen, Geoff Hinton

    and the anonymous reviewers for helpful comments. This

    work was supported by the DARPA transfer learning pro-

    gram under contract number FA8750-05-2-0249, and by

    ONR under award number N00014-06-1-0828.

    ReferencesAndo, R. K., & Zhang, T. (2005). A framework for learningpredictive structures from multiple tasks and unlabeleddata. JMLR, 6, 18171853.

    Baxter, J. (1997). Theoretical models of learning to learn.In T. Mitchell and S. Thrun (Eds.), Learning to learn.

    Blei, D., Ng, A. Y., & Jordan, M. (2002). Latent dirichletallocation. NIPS.

    Caruana, R. (1997). Multitask learning. ML Journal, 28.Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas,

    G. W., & Harshman, R. A. (1990). Indexing by latentsemantic analysis. J. Am. Soc. Info. Sci., 41, 391407.

    Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R.(2004). Least angle regression. Ann. Stat., 32, 407499.

    Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning gen-

    erative visual models from few training examples: anincremental Bayesian approach tested on 101 object cat-egories. CVPR Workshop on Gen.-Model Based Vision.

    Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducingthe dimensionality of data with neural networks. Sci-ence, 313, 504507.

    Holub, A., Welling, M., & Perona, P. (2005). Combin-ing generative models and Fisher kernels for object classrecognition. ICCV.

    Hoyer, P. O. (2004). Non-negative matrix factorizationwith sparseness constraints. JMLR, 5, 14571469.

    Jaakkola, T., & Haussler, D. (1998). Exploiting generativemodels in discriminative classifiers. NIPS.

    Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bagsof features: Spatial pyramid matching for recognizingnatural scene categories. CVPR.

    Lee, H., Battle, A., Raina, R., & Ng, A. Y. (2007). Efficientsparse coding algorithms. NIPS.

    Ng, A. Y. (2004). Feature selection, L1 vs. L2 regulariza-tion, and rotational invariance. ICML.

    Nigam, K., McCallum, A., Thrun, S., & Mitchell, T.(2000). Text classification from labeled and unlabeleddocuments using EM. Machine Learning, 39, 103134.

    Olshausen, B. A., & Field, D. J. (1996). Emergence ofsimple-cell receptive field properties by learning a sparsecode for natural images. Nature, 381, 607609.

    Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensional-ity reduction by locally linear embedding. Science, 290.

    Serre, T., Wolf, L., & Poggio, T. (2005). Object recognitionwith features inspired by visual cortex. CVPR.

    Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A

    global geometric framework for nonlinear dimensionalityreduction. Science, 290, 23192323.

    Thrun, S. (1996). Is learning the n-th thing any easier thanlearning the first? NIPS.

    Tibshirani, R. (1996). Regression shrinkage and selectionvia the lasso. J. R. Stat. Soc. B., 58, 267288.

    Tsuda, K., Kin, T., & Asai, K. (2002). Marginalized ker-nels for biological sequences. Bioinformatics, 18.

    Zhang, H., Berg, A., Maire, M., & Malik, J. (2006). SVM-KNN: Discriminative nearest neighbor classification forvisual category recognition. CVPR.