Siam Talk 2003

download Siam Talk 2003

of 56

Transcript of Siam Talk 2003

  • 7/30/2019 Siam Talk 2003

    1/56

    1

    SIAM WORKSHOP MAY 2003

    RF/ tools

    A Class of Tw o-eyed Algorithms

    Leo BreimanStatistics Departm ent, [email protected]

  • 7/30/2019 Siam Talk 2003

    2/56

    2

    The Data Avalanches and the Genesisof Smart Algorithms

    Its hard to construct a terabyte database if each item has to beentered by hand .

    The avalanche was started by theadvent of high speed computing

    and electronic entering of data.

    Aided and assisted by technologygrowth such as

    high resolution and more bandwidth in satellite pictures.

    more weather stations gathering moredetailed information.

    centralized data warehousing forcommercial firms.

  • 7/30/2019 Siam Talk 2003

    3/56

    3

    The Data Avalanche--A Small Example

    San Francisco Chronicle-October 24,2002

    Since Neurome was founded three yearsago, its app etite for powerful comp uters andmore data storage has grown so rapidly that

    it now uses over 100 gigabytes a week.

    "And that's just taking it easy" said WarrenYoung, a co-found er of the biotech companythat processes data from d igital images of the

    brain for pharm aceutical research firms."otherw ise we'll run ou t of space"

  • 7/30/2019 Siam Talk 2003

    4/56

    4

    Current Avalanches and Problems

    Multi-terabyte data bases are becoming comm on.The interesting d ata bases are ones used toanswer significant questions.

    Ast ronomy data base of one billion stellar objects-soon to be tw o billion.

    what kinds of objects are there in this data base?dwarf stars, quasars...? find objects d issimilar to

    anything seen before.

    NSA data bases of millions of documents.

    which of these are relevant to terrorism inMacedonia?

    Merck data base of molecular bond structure of

    millions of chemicals.

    which of these are biologically active?

    Satellit e Photos terabyte data base

    has anyth ing changed since the last fly over--ofwhat nature

    El Nino data base of hu nd reds of gigabytes over30-40 years.

    how accurately can El Nino activity be p redicted?

  • 7/30/2019 Siam Talk 2003

    5/56

    5

    Smart Algorithms

    Trying to get answers to interestingquestions from a large data base cannot be doneby human inspection.

    It requires algorithms initially trained by humans.

    They have to be fast and accurate.Mistakes are often costly.

    In document recognition, if the algorithm does not

    recognize a relevant documents, importantinformation may be missed.

    In drug searches, incorrectly labeling too manycompounds as being p otentially active leads tohigh laboratory costs.

  • 7/30/2019 Siam Talk 2003

    6/56

    6

    How Algorithms are Constructed

    Algorithms don't start off being smart. Then haveto be trained, by human jud gment.

    The human has to offer the algorithmexamples of what the algorithm needs todistingu ish between.

    Astronomers have to offer examples of novas,dwarf stars, sun-like stars, etc.

    To distingu ish between relevant and irrelevantdocuments, the algorithm has to be shownexamples of both as judged by hu man readers.

    In the language of machine learning,a training set has to offered to the algorithm.

  • 7/30/2019 Siam Talk 2003

    7/56

    7

    Training Sets and Predict ion

    The most common situation is that there is a singleresponse variable y to be predicted , i.e.

    relevant, non relevantdwarf star, super nova, etc.

    The training set consists of a number of examples,each assigned a y-value by human judgment ,

    plus a set of m m easurements on each exampledenoted x 1( ),x(2),...,x(M).

    For instance, in document recognition, themeasurements are the word counts in thedocument of a specified list of words.

    Each stellar object recorded in the sky has

    measurements on its spectrum, shape, extent, etc.

    The algorithm then p roceeds to makepred ictions for the objects it has not seen usingonly the measurements x on the object to form apredicted y.

    Its error rate on the objects whose identity (y) isunknown is called: the generalization error (use errorfor short)

    The lower the generalization error, the smarter thealgorithm

  • 7/30/2019 Siam Talk 2003

    8/56

    8

    Tw o-eyed Dat a Algorithms

    Two algorithm currently give the m ostaccurate pred ictions in classification andregression

    1. Support Vector Machines

    2. Random Forests

    They have their eye on accurate pred iction.

    Think of natu re as a black box:

    xy

    nature

    In p rediction the eye is not concerned with theinside of the black box.

    Given an inpu t x the goal of the algorithm is to

    produce a pred iction y as close as possible to they that "nature" produces.

    "natu re"= nature +humans that classify whatnature p roduces.

  • 7/30/2019 Siam Talk 2003

    9/56

    9

    The Second Eye

    In most scientific applications it is not enough to haveaccurate predictions--more information is necessary.

    Essential information are answers to:

    What' s going on inside the black box?

    xy

    nature

    For Instance

    In analyzing microarray data having thousands ofvariables and sample size usually less that 100,it is often put as a classification p roblem.

    But the goal in the analysis is to find the genes

    that are important in discriminating between theclasses.

  • 7/30/2019 Siam Talk 2003

    10/56

    1 0

    A Stat ist ical Principle and Quandary

    Inferring a mechanism for the black box is a highlyrisky and ambiguous venture. Nature' s mechanismsare generally complex and cannot summarized by arelatively simple stochastic model, even as a firstapproximation.

    An important principle

    The better the model fits the data, the more sound

    the inferences about the black box are.

    Suppose there is an algorithm f(x)that outputs anestimate y of the true yfor each value ofx .

    Then a measure of how well f fits the d ata isgiven by how close y is to y. i.e. by the size ofthe p rediction error (PE)

    The lower the PE, the better the fit to the data

    Quandary: The most accurate prediction

    algorithm s are also the most complex and

    inscrutable.

    i.e. compare trying to und erstand 100 trees in anensemble as comp ared to a single tree structureas in CART.

  • 7/30/2019 Siam Talk 2003

    11/56

    1 1

    Tw o Eyed RF/tools

    I am d eveloping a class of algorithm s calledRF/ tools that have both:

    1. Excellent accuracy--the best currently available.

    2. Gives good insights into the inside of the box.

    There is a:

    classification version (RF/ cl)

    regression version (RF/ rg)

    Soon to be joined by a mu ltiple dependent outputversion (RF/ my)

    These tools are free open source (f77) and

    available at:

    www.stat.berkeley.edu / RFtools

    with manuals for their use and interfaces to R andS+.

    We estimate that there have been almost 3000downloads since the algorithms were pu t on theweb site.

  • 7/30/2019 Siam Talk 2003

    12/56

    1 2

    Outline of Where I'm Going

    What is RF?

    A) The Right Eye ( classificat ion machine)

    i) excellent accuracy

    ii) scales up

    iii) hand les

    thousands of variablesmany valued categoricalsextensive missing valuesbadly unbalanced data sets

    iv) gives internal unbiased estimate oftest set error as trees are added toensemble

    v) cannot overfit

  • 7/30/2019 Siam Talk 2003

    13/56

    1 3

    B) The Left Eye (inside the black box)

    i) variable importance

    ii) outlier detection

    iii) data views via scaling

    From Supervised to Unsupervised

    How about d ata with no class labels,

    or regression y-values.

    Outliers?Clustering?Missing values replacement?

    By a slick device RF/ cl can be turnedinto an algorithm for analyzing unsuperviseddata.

  • 7/30/2019 Siam Talk 2003

    14/56

    1 4

    What is RF?= Random Forests

    A random forest (RF) is a collection of tree predictors

    f(x,T,k), k=1,2,...,K)

    where the kare i.i.d random vectors.

    The forest prediction is the unweighted pluralityof class votes

    The Law of Large N umbers insures convergenceas k

    The test set error rates (modulo a little noise) are

    monotonically decreasing and converge to a limit.

    That is: there is no overfitting as the number of treesincreases

    The key to accuracy is low correlation and bias.

    To keep bias low, trees are grown to maximumdepth.

  • 7/30/2019 Siam Talk 2003

    15/56

    1 5

    Randomizat ion for Low Correlat ion

    To keep correlation low, the current version usesthis rand omization.

    i) Each tree is grown on a bootstrap sam ple ofthe training set.

    ii) A number m is specified much smaller thanthe total number of variables M.

    iii) At each node, m variables are selected atrandom out of the M.

    iv) The sp lit used is the best split on these mvariables

    The only adjustable param eter in RF is m.

    The d efault value for m is M.

    But RF is not sensitive to the value ofm over awide range.

  • 7/30/2019 Siam Talk 2003

    16/56

    1 6

    Tw o Key Byproducts

    The out-of-bag test set

    For every tree grown, about one-third of thecases are ou t-of-bag (out of the bootstrapsample). A

    bbreviated oob.

    The oob samples can serve as a test set for thetree grown on the non-oob data.

    This is used to:

    i) Form unbiased estimates of the forest test seterror as the trees are add ed.

    ii) Form estimates of variable importance.

  • 7/30/2019 Siam Talk 2003

    17/56

    1 7

    The node proximities

    Since the trees are grown to maximum depth, theterminal nodes are small.

    For each tree grown, pou r all the data dow n thetree.

    If two data points xn

    and xk

    occupy the same

    terminal node,

    increase prox(xn

    ,xk

    ) by one.

    At the end of forest growing, these proximities,form an intrinsic similarity measure between pairs

    of data vectors.

    This is used to:

    i) Estimate missing values.

    ii) Give informative data views via metric scaling.

    iii) Locate outliers.

  • 7/30/2019 Siam Talk 2003

    18/56

    1 8

    Propert ies as a classificat ion machine.

    a) excellent accuracy

    b) scales up\

    c) handlesthousands of variablesmany valued categoricals

    extensive missing valuesbadly unbalanced data sets

    d) gives internal unbiased estimate oftest set error as trees are added toensemble

    e) cannot overfit (already d iscussed)

  • 7/30/2019 Siam Talk 2003

    19/56

    1 9

    Accuracy

    My paper: Random Forests , MachineLearning(2001) 45 5-320 gives comparisons:

    The test set accuracy of RF is compared toAdaboost on a number of benchm ark data sets.RF is slightly better.

    Adaboost is very sensitive to noise in the labels.

    RF is not.

    Compared to SVMs:

    RF is not as accurate as SVMs on p ixel imagedata.

    It is superior in document classification.

    On benchmark data sets comm only used inMachine Learning the SVMs have error ratescomparable to RF.

    Based on my present knowledge, RF iscompetitive in accuracy with the best classificationalgorithms that are out there now.

    Ditto regression

  • 7/30/2019 Siam Talk 2003

    20/56

    2 0

    Scaling up to Large Data Sets

    The analysis of RF show s that it's compute time is

    cNT M Nlog(N)

    NT= number of trees

    M =nu mber of variables

    N = nu mber of instances

    The constant was estimated w ith a run on a dataset with 15,000 instances and 16 var iables.

    Using this value leads to the estimate that togrow a forest of 100 trees for a data set w ith

    100,000 instances and 1000 variableswould take three hours on my 800Mhz m achine.

    Parallelizing is Trivial

    Each tree is grown ind epend ently of theoutcomes of the other trees grown. If each of Jprocessors is given the job of growing K trees,

    there is no need for interp rocessorcommunication until all have finished their runsand the results are aggregated .

  • 7/30/2019 Siam Talk 2003

    21/56

    2 1

    Number of Variables

    Unlike other algorithms where variable selectionhas to be used if there are more than a fewhu nd red variables, RF thrives on variables, themore the merrier.

    RF has been run on genetic data with thousandsof variables and no variable selection and givenexcellent results.

    But there are limits. If the number of noisyvariables becomes too large, variable selection will

    have to be used.

    But the threshold for RF is much h igher than fornon-ensemble methods.

  • 7/30/2019 Siam Talk 2003

    22/56

    2 2

    Handling Categorical Variables

    Hand ling categorical values has always beendifficult in many classification algorithms.

    For instance, given a categorical variable with

    20,000 values, how is it to be dealt with? Acustomary w ay is to code it into 20000 0-1,variables, a nasty procedu re which substitutes20,000 variables for one.

    This occurs in practice--in d ocumentclassification, one of the variables may be a list of

    20,000 word s.

    In two class problems, RF handles categoricals inthe efficient way that CART does--with a fastO(N) algorithm to find the best split of thecategoricals at each node.

    If there are m ore than two classes, a fast iterativealgorithm is used.

  • 7/30/2019 Siam Talk 2003

    23/56

    2 3

    T

    Replacing Missing Values

    RF has two ways of replacing missing values.

    The Cheap Way

    Replace every missing value in the mthcoordinate by the median of the non-missingvalues of that coordinate or by the most frequentvalue if it is categorical.

    The Right Way

    This is an iterative process. If the mth coordinate

    in instance xnis missing then it is estimated by a

    weighted average over the instances xk

    with

    non-missing mth coordinate wherethe weight isprox(x

    n,x

    k).

    The rep laced values are used in the next iterationof the forest which computes new proximities.

    The process it au tomatically stopped w hen nomore improvement is possible or when fiveiterations are reached.

  • 7/30/2019 Siam Talk 2003

    24/56

    2 4

    An Example

    The training set for the satellite data has 4434instances, 36 var iables and 6 classes. A test set isavailable with 2000 instances.

    With 200 trees in the forest, the test set error is9.8%

    Then 20%, 40%, 60% and 80% of the data weredeleted as missing (randomly).

    Both the cheap fix and the right fix were applied,and the test error computed .

    Test Set Error (%)

    Missing % 20% 40% 60% 80%

    cheap 11.8 13.4 15.7 20.7right 10.7 11.3 12.5 13.5

    It's surprising that with 80% missing data theerror rate only rises from 9.8% to 13.5%

    I've gotten similar results on other data sets.

  • 7/30/2019 Siam Talk 2003

    25/56

    2 5

    Class Weights

    A data set is unbalanced if one or more classes--often the classes of interest, are severelyunderrepresented.

    These will tend to have higher misclassificationrates than the larger classes.

    In some data sets, even though the data set isnot bad ly unbalanced, some classes may have alarger error rate than the others.

    In RF it is possible to set class weights that act"as if" they are increasing the size of a class that

    gets weight larger that one.

    The result is that the class weights can be set toget any desired d istribution of errors among theclasses.

  • 7/30/2019 Siam Talk 2003

    26/56

    2 6

    The OOB Test Set Error Est imate

    For every tree grown, about one-third of thecases are oob (out-of-bag --out of the bootstrapsample).

    Put these oob cases down the corresponding treeand get a predicted classification for them.

    For each case n, pluralize the pred ictedclassification over all the trees that n was oob toget a test set estimate y

    nfor y

    n.

    Averaging the loss over all n give the oob test setestimate of prediction error.

    Runs on many data sets have shown it to beunbiased with error on the order of using a testset of the sam e size as the training set.

    It is computed at user set intervals in the forest

    construction process and outputted to themonitor.

  • 7/30/2019 Siam Talk 2003

    27/56

    2 7

    Accuracy vs Interpretabilit y

    Nature forms the outputs y from the inpu ts x bymeans of a black box with complex and unknowninterior.

    y xnature

    Current most accurate prediction m ethods are alsocomplex black boxes.

    y xneural nets

    forestssupport vectors

    Two black boxes, of which ours seems only slightlyless inscru table than nature's.

    My biostatisticians friends tell me, "Doctors caninterpret logistic regression."

    There is no way they can interp ret a black box

    containing fifty trees hooked together.

    In a choice between accuracy and interpretability,they'll go for interp retability. "

  • 7/30/2019 Siam Talk 2003

    28/56

    2 8

    Accuracy vs. Int erpretabilit yA False Canard

    Framing the qu estion as the choice betweenaccuracy and interpretability is an incorrectinterpretation of what the goal of a statisticalanalysis is.

    The point of a model is to get useful informationabout the relation between the response andpredictor variables as well as other information

    about the data structure.

    Interpretability is a way of getting information.

    But a model does not have to be simple toprovide reliable information about the relationbetween p redictor and response variables.

    The goal is not interpretability, but accurateinformation

    RF can supply more and better information aboutthe inside of the black box than any current"interpretable" models.

  • 7/30/2019 Siam Talk 2003

    29/56

    2 9

    The Left Eye-Tools for Black Box Inspection

    i) Estimating variable importance

    i.e. which variables are instrumental in theclassification.

    i) Data views via proximities and metric scaling.

    iii) Outlier detection via proximities

    iv) A device for doing similar analyses onunsupervised data

  • 7/30/2019 Siam Talk 2003

    30/56

    3 0

    Variable Importance.

    For the n th case in the data, its margin at the endof a ru n is the p roportion of votes for its trueclass minus the maximum of the proportion ofvotes for each of the other classes.

    To estimated the importance of the mth variable:

    i) In the oob cases for the kth tree, randomlypermu te all values of the mth variable

    ii) Put these new variable values down the kthtree and get classifications.

    iii) Compute the margin.

    The measure of importance of the mth variable isthe average lowering of the margin across allcases when the mth variable is randomlypermuted.

  • 7/30/2019 Siam Talk 2003

    31/56

    3 1

    An Example--Hepat it is Data

    Data: Survival (123) or non-survival (32) of 155hepatitis patients with 19 covariates.

    Analyzed by Diaconis and Efron in 1983 ScientificAmerican.

    The original Stanford Medical School analysisconcluded that the imp ortant variables werenumbers 6, 12, 14, 19.

    Error rate for logistic regression is 17.4%.

    Efron and Diaconis drew 500 bootstrap samples

    from the original data set and looked for theimportant variables in each bootstrapped dataset.

    Their conclusion , "Of the four variablesoriginally selected not one was selected in morethan 60 percent of the samp les. Hence thevariables identified in the original analysis cannotbe taken too seriously."

  • 7/30/2019 Siam Talk 2003

    32/56

    3 2

    Analysis Using RF

    The overall error rate is 14.2%. There is a 53%error in class 1, and 4% in class 2. The variableimportances are graphed below:

    0

    1

    2

    3

    4

    5

    margindecrese

    0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0

    variable

    VARIABLE IMPORTANCE-EQUAL WEIGHTS

    The th ree most imp ortant variables are 11,12,17.

    Since the class of interest is non-survival which,with equal weights, has a h igh error rate, theclassweight of class 1 was increased to 3.

    The run gave an overall error rate of 22%, ,with

    class 1 error 19% and 23% for class 2.

    The variable importances for this run are graphedbelow:

  • 7/30/2019 Siam Talk 2003

    33/56

    3 3

    0

    1

    2

    3

    4

    5

    6

    7

    8

    margindecrease

    0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0variable

    VARIABLE IMPORTANCE 3:1 WEIGHTS

    Variable 11 is the most important variable inseparating non-survival from survival.

    The stand ard procedu re when fitting data modelssuch as logistic regression is to delete variables;

    Diaconis and Efron (1983) state , ".statisticalexperience suggests that it is unwise to fit amodel that d epends on 19 variables with only 155data points available."

    Newer methods in Machine Learning thrive onvariables--the more the better. The next example

    is an illustration.

  • 7/30/2019 Siam Talk 2003

    34/56

    3 4

    Microarray Analysis

    Rand om forests was run on a microarraylymphoma data set with three classes,

    sample size of 81 and 4682 variables (genes)without any variable selection.

    The error rate was low (1.2%).

    What was also interesting from a scientific

    viewpoint was an estimate of the importance ofeach of the 4682 genes.

    The graph below were produ ced by a run ofrand om forests.

    - 2 5

    1 5

    5 5

    9 5

    135

    175

    215

    Importance

    0 1000 2000 3000 4000 5000

    Variable

    Variable Importance Measure 2

  • 7/30/2019 Siam Talk 2003

    35/56

    3 5

    An Int rinsic Proximity Measure and Clustering

    Since an individual tree is unpruned, the terminalnodes will contain only a small number of instances.

    Run all cases in the training set down the tree. I

    If case i and case j both land in the same terminalnod e increase the proximity between i and j by one.

    At the end of the ru n, the proximities are norm alizedby divid ing by twice the number of trees in the forest.

    To cluster-use the proximity measures.

  • 7/30/2019 Siam Talk 2003

    36/56

    3 6

    Example-Bupa Liver Disorders

    This is a two-class biomedical data set consistingof the covariates

    1. mcv mean corpuscular volume2. alkphos alkaline phosphotase3. sgpt alamine aminotransferase4. sgot aspartate aminotransferase5. gammagt gamma-glutamyltranspeptidase

    6. drinks number of half-pint equivalentsof alcoholic beverage d runk p erday

    The first five attributes are the results of bloodtests thought to be related to liver functioning.

    The 345 patients are classified into tw o classes by

    the severity of their liver d isorders.

    The class populations are 145 and 200( severe).

    The misclassification error rate is 28% in an RFrun.

    Class 1 has a 50% error rate with a rate of 12% forclass 2.

    Setting the weight of class 2 to 1.4 gives an overallrates 28% and 31% for classes 1 and 2.

  • 7/30/2019 Siam Talk 2003

    37/56

    3 7

    Variable Importance

    0

    1

    2

    3

    4

    5

    6

    7

    8

    margindecrease

    1 2 3 4 5 6

    VARIABLE IMPORTANCE-BUPA LIVER

    Blood tests 3 and 5 are the most important,followed by test 4.

  • 7/30/2019 Siam Talk 2003

    38/56

    3 8

    Clustering

    Using the p roximity measure given by RF tocluster, there are two class #2 clusters.

    In each of these clusters, the average of eachvariable is computed and plotted:

    FIGURE 3 CLUSTER VARIABLE AVERAGES

    Something interesting emerges.

    The class two subjects consist of two distinctgroups:

    Those that have h igh scores on blood tests 3, 4,

    and 5

    Those that have low scores on those tests.

  • 7/30/2019 Siam Talk 2003

    39/56

    3 9

    Scaling Coordinat es

    The proximities between cases n and k form amatrix {prox(n,k)}.

    The values 1-prox(n,k) are squared d istancesbetween vectors x(n),x(k)in a Euclidean space ofd imension not greater than the number of cases.

    The goal is to project the x(n)} down into a lowdimensional space while preserving the d istances

    between them to the extent p ossible.

    In metric scaling, the idea is to approximate thevectors x(n) by the first few scaling coord inates.

    These correspond to eigenvectors of a modifiedprox matrix. T

    The tw o dimensional plots of the ith scalingcoord inate vs. the jth often gives usefulinformation about the data.

    The most useful is usually the graph of the 2ndvs. the 1st.

    We illustrate with three examp les. The first isthe graph of 2nd vs. 1st scaling coord inates forthe liver data

  • 7/30/2019 Siam Talk 2003

    40/56

    4 0

    - . 2 5

    - . 2

    - . 1 5

    - . 1

    - . 0 5

    0

    .05

    .1

    .15

    .2

    - . 2 - . 15 - . 1 - . 05 0 .05 .1 .15 .2 .25

    1st Scaling Coordinate

    class 2

    class 1

    Metric Scaling

    Liver Data

    The tw o arm s of the class #2 data in th is picturecorrespond to the two clusters found anddiscussed above.

  • 7/30/2019 Siam Talk 2003

    41/56

    4 1

    Microarray Data.

    With 4682 variables, it is d ifficult to see how tocluster this data. Using proximities and the firsttwo scaling coordinates gives this picture:

    - . 2

    - . 1

    0

    .1

    .2

    .3

    .4

    .5

    .6

    2ndScalingCoordinate

    - . 5 - . 4 - . 3 - . 2 - . 1 0 .1 .2 .3 .4

    1st Scaling Coordinate

    class 3

    class 2

    class 1

    Metric Scaling

    Microarray Data

    RF misclassifies one case.

    This case is represented by the isolated point in thelower left hand corner of the p lot.

  • 7/30/2019 Siam Talk 2003

    42/56

    4 2

    The Glass Dat a

    The th ird example is glass data w ith 214 cases, 9variables and 6 classes. Here is a plot of the 2ndvs. the 1st scaling coord inates.:

    - . 4

    - . 3

    - . 2

    - . 1

    0

    .1

    .2

    .3

    .4

    2ndscalingcoordinate

    - . 5 - . 4 - . 3 - . 2 - . 1 0 .1 .2

    1st scaling coordinate

    class 6

    class 5

    class 4

    class 3

    class 2

    class 1

    Metric Scaling

    Glass data

    None of the analyses to data have picked u p thisinteresting and revealing structure of the data--compare the plots in Ripley's book.

  • 7/30/2019 Siam Talk 2003

    43/56

    4 3

    Out lier Locat ion

    Outliers are defined as cases having smallproximities to all other cases.

    Since the data in some classes is more spread outthan others, outlyingness is defined only withrespect to other data in the same class as thegiven case.

    To define a measure of outlyingness,

    i) Compute, for a case n, the sum of the squaresof prox(n,k) for all k in the same class as case n.

    ii)Take the inverse of this sum--it will be large ifthe p roximities prox(n,k) from n to the othercases k in the same class are generally small.Denote this quan tity by out(n).

    iii) For all n in the same class, compute the m edianof the out(n), and then the mean absolutedeviation from the median.

    Subtract the median from each out(n) and divideby the deviation to give a norm alized measure ofoutlyingness. The values less than zero are set tozero.

    Generally, a value above 10 is reason to suspectthe case of being outlying.

  • 7/30/2019 Siam Talk 2003

    44/56

    4 4

    Out ly ingness Graph--Microarray Dat a

    - 2

    0

    2

    4

    6

    8

    1 0

    1 2

    1 4

    outlyingness

    0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0

    sequence number

    class 3

    class 2class 1

    Outlyingness Measure

    Microarray Data

    There are two possible outliers--one is the first

    case in class 1, the second is the first case in class2.

  • 7/30/2019 Siam Talk 2003

    45/56

    4 5

    Out ly ingness Graph-Pima Indians Data.

    This hep atitus data set has 768 cases, 8 variablesand 2 classes. It has been used often as anexample in Machine Learning research but issuspected of containing a substantial nu mber ofoutliers.

    0

    5

    1 0

    1 5

    2 0

    2 5

    outlyin

    gness

    0 200 400 600 800

    sequence number

    class 2

    class 1

    Outlyingness Measure

    Pima Data

    If 10 is used as a cutoff point, there are 12 casessuspected of being ou tliers.

  • 7/30/2019 Siam Talk 2003

    46/56

    4 6

    From Supervised to Unsuperv ised

    Unsupervised date consists of N vectors {x(n)} inM dimensions.

    Desire:

    i) Clustering

    ii) Outlier detection

    iii) Missing value replacement

    A ll this can be done using RF

    i) Give class label 1 to the original data.

    ii) Create a synthetic class 2 by samplingindependently from the one-dimensional marginaldistributions of the original data.

  • 7/30/2019 Siam Talk 2003

    47/56

    4 7

    The Synthet ic Second Class

    More explicitly:

    If the value of the mth coordinate of the originaldata for the nth case is x(m,n), then a case in thesynthetic data is constru cted as follows:

    i) Its first coordinate is sampled at random from

    the N values x(1,n).

    ii) Its second coord inate is sampled at randomfrom the N values x(2,n), and so on.

    The synthetic data set has the distribution of Mindependent variables where the d istribution ofthe mth variable is the same as the univariatedistribution of the mth variable in the originaldata.

  • 7/30/2019 Siam Talk 2003

    48/56

    4 8

    The Synthet ic Tw o Class Problem

    The are now two classes where the second classhas been constructed by d estroying alldependencies in the original unsupervised data.

    When th is two class data is run through RF ahigh misclassification rate--say over 40%, impliesthat there is not much d epend ence structure inthe original data.

    That is, that its structure is largely that of Mindependent variables--not a very interestingdistribution.

    But if there is a strong d epend ence structure

    between the variables in the original data, theerror rate will be low.

    In this situation, the output of RF can be used tolearn something about the structure of the data.Following are some examples.

  • 7/30/2019 Siam Talk 2003

    49/56

    4 9

    Clustering the Glass Dat a

    All class labels were removed from the glass data.

    Unsupervised data consisting of 214 nine-dimensional vectors that were labeled class 1.A second synthetic data set was constructed.

    The two class problem was run through RF.Proximities for class 1 only and metric scalingprojected the class 1 data onto tw o dimension.

    Here is the outcome:

    - . 5

    - . 4

    - . 3

    - . 2

    - . 1

    0

    .1

    .2

    .3

    .4

    .5

    2ndscalingcoordinate

    - . 4 - . 3 - . 2 - . 1 0 .1 .2 .3 .41st scaling coordinate

    GLASS DATA-UNSUPERVISED SCALING

    This is a good rep lica of the original starfish-likeprojection using the class data..

    `

  • 7/30/2019 Siam Talk 2003

    50/56

    5 0

    Application to the Microarray Data

    Recall that the scaling plot of the microarray d atashowed three clusters--

    two larger ones in the lower left hand and righthand corners and a smaller one in the top middle.

    Again, we erased labels from the data andprojected down an unsup ervised view:

    - . 2

    - . 1

    0

    .1

    .2

    .3

    .4

    2ndscalingcoordinate

    - . 6 - . 4 - . 2 0 .2 .4

    1st scaling coordinate

    UNSUPERVISED PROJECTION OF MICROARRAY DATA

    The three clusters are m ore d iffuse but stillapparent.

  • 7/30/2019 Siam Talk 2003

    51/56

    5 1

    Finding Out liers

    An Application to Chemical Spectra

    Data graciously supplied by Merck consists ofthe first 468 spectral intensities in the spectrumsof 764 compound s.

    The challenge p resented by Merck was to findsmall cohesive groups of outlying cases in th is

    data.

    There w as excellent separation between the twoclasses.

    An error rate of 0.5%, ind icating strong

    depend encies in the original data.

    We looked at outliers and generated this plot.

  • 7/30/2019 Siam Talk 2003

    52/56

    5 2

    0

    1

    2

    3

    4

    5

    outlyingness

    0 100 200 300 400 500 600 700 800

    sequence numner

    Outlyingness Measure

    Spectru Data

    This plot gives no ind ication of outliers.

    But outliers must be fairly isolated to show up inthe ou tlier d isplay.

  • 7/30/2019 Siam Talk 2003

    53/56

    5 3

    Project ing Dow n

    To search for ou tlying groups scaling coord inateswere computed. The plot of the 2nd vs. the 1st isbelow:

    - . 2

    - .15

    - . 1

    - .05

    0

    .05

    .1

    .15

    .2

    .25

    2ndscalingcoordinate

    - .25 - . 2 - .15 - . 1 - .05 0 .05 .1 .15 .2 .25

    1st scaling coordinate

    Metric Scaling

    Specta Data

    The spectra fall into two main clusters.

    There is a possibility of a small outlying group inthe up per left hand corner.

    To get another p icture, the 3rd scaling coord inateis plotted vs. the 1st.

  • 7/30/2019 Siam Talk 2003

    54/56

    5 4

    - . 3

    - . 2 5

    - . 2

    - . 1 5

    - . 1

    - . 0 5

    0

    .05

    .1

    .15

    3rdscalingcoordiante

    - . 25 - . 2 - .15 - . 1 - .05 0 .05 .1 .15 .2 .25

    1st scaling coordinate

    Metrc Scaling

    Specta Data

    The group in question is now in the lower lefthand corner and its separation from the body ofthe spectra has become more apparent.

  • 7/30/2019 Siam Talk 2003

    55/56

    5 5

    RF/tools--An Ongoing Project

    RF/ tools are continually being upgraded.

    My coworker Ad ele Cutler, is writing some veryilluminating graphic displays for the informationoutputted by RF.

    The next version of RF/ cl will have the ability to

    detect interactions between variables--a capabilitythat the microarray people lobbied for.

    Plus other improvements, both major and minor.

    RF/ rg needs up grading to the same level asRF/ cl.

    A new program RF/ my that can use two eyes onsituations with multiple dependent outputs willbe added next month.

    CONCLUSION

    There is no data miniing program available thatcan p rovide the predictive accuracy andunderstanding of the data that RF/ tools does.

    and it s free!!

  • 7/30/2019 Siam Talk 2003

    56/56

    5 6