Siam Talk 2003

7/30/2019 Siam Talk 2003

1/56

1

SIAM WORKSHOP MAY 2003

RF/ tools

A Class of Tw o-eyed Algorithms

Leo BreimanStatistics Departm ent, [email protected]

7/30/2019 Siam Talk 2003

2/56

2

The Data Avalanches and the Genesisof Smart Algorithms

Its hard to construct a terabyte database if each item has to beentered by hand .

The avalanche was started by theadvent of high speed computing

and electronic entering of data.

Aided and assisted by technologygrowth such as

high resolution and more bandwidth in satellite pictures.

more weather stations gathering moredetailed information.

centralized data warehousing forcommercial firms.

7/30/2019 Siam Talk 2003

3/56

3

The Data Avalanche--A Small Example

San Francisco Chronicle-October 24,2002

Since Neurome was founded three yearsago, its app etite for powerful comp uters andmore data storage has grown so rapidly that

it now uses over 100 gigabytes a week.

"And that's just taking it easy" said WarrenYoung, a co-found er of the biotech companythat processes data from d igital images of the

brain for pharm aceutical research firms."otherw ise we'll run ou t of space"

7/30/2019 Siam Talk 2003

4/56

4

Current Avalanches and Problems

Multi-terabyte data bases are becoming comm on.The interesting d ata bases are ones used toanswer significant questions.

Ast ronomy data base of one billion stellar objects-soon to be tw o billion.

what kinds of objects are there in this data base?dwarf stars, quasars...? find objects d issimilar to

anything seen before.

NSA data bases of millions of documents.

which of these are relevant to terrorism inMacedonia?

Merck data base of molecular bond structure of

millions of chemicals.

which of these are biologically active?

Satellit e Photos terabyte data base

has anyth ing changed since the last fly over--ofwhat nature

El Nino data base of hu nd reds of gigabytes over30-40 years.

how accurately can El Nino activity be p redicted?

7/30/2019 Siam Talk 2003

5/56

5

Smart Algorithms

Trying to get answers to interestingquestions from a large data base cannot be doneby human inspection.

It requires algorithms initially trained by humans.

They have to be fast and accurate.Mistakes are often costly.

In document recognition, if the algorithm does not

recognize a relevant documents, importantinformation may be missed.

In drug searches, incorrectly labeling too manycompounds as being p otentially active leads tohigh laboratory costs.

7/30/2019 Siam Talk 2003

6/56

6

How Algorithms are Constructed

Algorithms don't start off being smart. Then haveto be trained, by human jud gment.

The human has to offer the algorithmexamples of what the algorithm needs todistingu ish between.

Astronomers have to offer examples of novas,dwarf stars, sun-like stars, etc.

To distingu ish between relevant and irrelevantdocuments, the algorithm has to be shownexamples of both as judged by hu man readers.

In the language of machine learning,a training set has to offered to the algorithm.

7/30/2019 Siam Talk 2003

7/56

7

Training Sets and Predict ion

The most common situation is that there is a singleresponse variable y to be predicted , i.e.

relevant, non relevantdwarf star, super nova, etc.

The training set consists of a number of examples,each assigned a y-value by human judgment ,

plus a set of m m easurements on each exampledenoted x 1( ),x(2),...,x(M).

For instance, in document recognition, themeasurements are the word counts in thedocument of a specified list of words.

Each stellar object recorded in the sky has

measurements on its spectrum, shape, extent, etc.

The algorithm then p roceeds to makepred ictions for the objects it has not seen usingonly the measurements x on the object to form apredicted y.

Its error rate on the objects whose identity (y) isunknown is called: the generalization error (use errorfor short)

The lower the generalization error, the smarter thealgorithm

7/30/2019 Siam Talk 2003

8/56

8

Tw o-eyed Dat a Algorithms

Two algorithm currently give the m ostaccurate pred ictions in classification andregression

1. Support Vector Machines

2. Random Forests

They have their eye on accurate pred iction.

Think of natu re as a black box:

xy

nature

In p rediction the eye is not concerned with theinside of the black box.

Given an inpu t x the goal of the algorithm is to

produce a pred iction y as close as possible to they that "nature" produces.

"natu re"= nature +humans that classify whatnature p roduces.

7/30/2019 Siam Talk 2003

9/56

9

The Second Eye

In most scientific applications it is not enough to haveaccurate predictions--more information is necessary.

Essential information are answers to:

What' s going on inside the black box?

xy

nature

For Instance

In analyzing microarray data having thousands ofvariables and sample size usually less that 100,it is often put as a classification p roblem.

But the goal in the analysis is to find the genes

that are important in discriminating between theclasses.

7/30/2019 Siam Talk 2003

10/56

1 0

A Stat ist ical Principle and Quandary

Inferring a mechanism for the black box is a highlyrisky and ambiguous venture. Nature' s mechanismsare generally complex and cannot summarized by arelatively simple stochastic model, even as a firstapproximation.

An important principle

The better the model fits the data, the more sound

the inferences about the black box are.

Suppose there is an algorithm f(x)that outputs anestimate y of the true yfor each value ofx .

Then a measure of how well f fits the d ata isgiven by how close y is to y. i.e. by the size ofthe p rediction error (PE)

The lower the PE, the better the fit to the data

Quandary: The most accurate prediction

algorithm s are also the most complex and

inscrutable.

i.e. compare trying to und erstand 100 trees in anensemble as comp ared to a single tree structureas in CART.

7/30/2019 Siam Talk 2003

11/56

1 1

Tw o Eyed RF/tools

I am d eveloping a class of algorithm s calledRF/ tools that have both:

1. Excellent accuracy--the best currently available.

2. Gives good insights into the inside of the box.

There is a:

classification version (RF/ cl)

regression version (RF/ rg)

Soon to be joined by a mu ltiple dependent outputversion (RF/ my)

These tools are free open source (f77) and

available at:

www.stat.berkeley.edu / RFtools

with manuals for their use and interfaces to R andS+.

We estimate that there have been almost 3000downloads since the algorithms were pu t on theweb site.

7/30/2019 Siam Talk 2003

12/56

1 2

Outline of Where I'm Going

What is RF?

A) The Right Eye ( classificat ion machine)

i) excellent accuracy

ii) scales up

iii) hand les

thousands of variablesmany valued categoricalsextensive missing valuesbadly unbalanced data sets

iv) gives internal unbiased estimate oftest set error as trees are added toensemble

v) cannot overfit

7/30/2019 Siam Talk 2003

13/56

1 3

B) The Left Eye (inside the black box)

i) variable importance

ii) outlier detection

iii) data views via scaling

From Supervised to Unsupervised

How about d ata with no class labels,

or regression y-values.

Outliers?Clustering?Missing values replacement?

By a slick device RF/ cl can be turnedinto an algorithm for analyzing unsuperviseddata.

7/30/2019 Siam Talk 2003

14/56

1 4

What is RF?= Random Forests

A random forest (RF) is a collection of tree predictors

f(x,T,k), k=1,2,...,K)

where the kare i.i.d random vectors.

The forest prediction is the unweighted pluralityof class votes

The Law of Large N umbers insures convergenceas k

The test set error rates (modulo a little noise) are

monotonically decreasing and converge to a limit.

That is: there is no overfitting as the number of treesincreases

The key to accuracy is low correlation and bias.

To keep bias low, trees are grown to maximumdepth.

7/30/2019 Siam Talk 2003

15/56

1 5

Randomizat ion for Low Correlat ion

To keep correlation low, the current version usesthis rand omization.

i) Each tree is grown on a bootstrap sam ple ofthe training set.

ii) A number m is specified much smaller thanthe total number of variables M.

iii) At each node, m variables are selected atrandom out of the M.

iv) The sp lit used is the best split on these mvariables

The only adjustable param eter in RF is m.

The d efault value for m is M.

But RF is not sensitive to the value ofm over awide range.

7/30/2019 Siam Talk 2003

16/56

1 6

Tw o Key Byproducts

The out-of-bag test set

For every tree grown, about one-third of thecases are ou t-of-bag (out of the bootstrapsample). A

bbreviated oob.

The oob samples can serve as a test set for thetree grown on the non-oob data.

This is used to:

i) Form unbiased estimates of the forest test seterror as the trees are add ed.

ii) Form estimates of variable importance.

7/30/2019 Siam Talk 2003

17/56

1 7

The node proximities

Since the trees are grown to maximum depth, theterminal nodes are small.

For each tree grown, pou r all the data dow n thetree.

If two data points xn

and xk

occupy the same

terminal node,

increase prox(xn

,xk

) by one.

At the end of forest growing, these proximities,form an intrinsic similarity measure between pairs

of data vectors.

This is used to:

i) Estimate missing values.

ii) Give informative data views via metric scaling.

iii) Locate outliers.

7/30/2019 Siam Talk 2003

18/56

1 8

Propert ies as a classificat ion machine.

a) excellent accuracy

b) scales up\

c) handlesthousands of variablesmany valued categoricals

extensive missing valuesbadly unbalanced data sets

d) gives internal unbiased estimate oftest set error as trees are added toensemble

e) cannot overfit (already d iscussed)

7/30/2019 Siam Talk 2003

19/56

1 9

Accuracy

My paper: Random Forests , MachineLearning(2001) 45 5-320 gives comparisons:

The test set accuracy of RF is compared toAdaboost on a number of benchm ark data sets.RF is slightly better.

Adaboost is very sensitive to noise in the labels.

RF is not.

Compared to SVMs:

RF is not as accurate as SVMs on p ixel imagedata.

It is superior in document classification.

On benchmark data sets comm only used inMachine Learning the SVMs have error ratescomparable to RF.

Based on my present knowledge, RF iscompetitive in accuracy with the best classificationalgorithms that are out there now.

Ditto regression

7/30/2019 Siam Talk 2003

20/56

2 0

Scaling up to Large Data Sets

The analysis of RF show s that it's compute time is

cNT M Nlog(N)

NT= number of trees

M =nu mber of variables

N = nu mber of instances

The constant was estimated w ith a run on a dataset with 15,000 instances and 16 var iables.

Using this value leads to the estimate that togrow a forest of 100 trees for a data set w ith

100,000 instances and 1000 variableswould take three hours on my 800Mhz m achine.

Parallelizing is Trivial

Each tree is grown ind epend ently of theoutcomes of the other trees grown. If each of Jprocessors is given the job of growing K trees,

there is no need for interp rocessorcommunication until all have finished their runsand the results are aggregated .

7/30/2019 Siam Talk 2003

21/56

2 1

Number of Variables

Unlike other algorithms where variable selectionhas to be used if there are more than a fewhu nd red variables, RF thrives on variables, themore the merrier.

RF has been run on genetic data with thousandsof variables and no variable selection and givenexcellent results.

But there are limits. If the number of noisyvariables becomes too large, variable selection will

have to be used.

But the threshold for RF is much h igher than fornon-ensemble methods.

7/30/2019 Siam Talk 2003

22/56

2 2

Handling Categorical Variables

Hand ling categorical values has always beendifficult in many classification algorithms.

For instance, given a categorical variable with

20,000 values, how is it to be dealt with? Acustomary w ay is to code it into 20000 0-1,variables, a nasty procedu re which substitutes20,000 variables for one.

This occurs in practice--in d ocumentclassification, one of the variables may be a list of

20,000 word s.

In two class problems, RF handles categoricals inthe efficient way that CART does--with a fastO(N) algorithm to find the best split of thecategoricals at each node.

If there are m ore than two classes, a fast iterativealgorithm is used.

7/30/2019 Siam Talk 2003

23/56

2 3

T

Replacing Missing Values

RF has two ways of replacing missing values.

The Cheap Way

Replace every missing value in the mthcoordinate by the median of the non-missingvalues of that coordinate or by the most frequentvalue if it is categorical.

The Right Way

This is an iterative process. If the mth coordinate

in instance xnis missing then it is estimated by a

weighted average over the instances xk

with

non-missing mth coordinate wherethe weight isprox(x

n,x

k).

The rep laced values are used in the next iterationof the forest which computes new proximities.

The process it au tomatically stopped w hen nomore improvement is possible or when fiveiterations are reached.

7/30/2019 Siam Talk 2003

24/56

2 4

An Example

The training set for the satellite data has 4434instances, 36 var iables and 6 classes. A test set isavailable with 2000 instances.

With 200 trees in the forest, the test set error is9.8%

Then 20%, 40%, 60% and 80% of the data weredeleted as missing (randomly).

Both the cheap fix and the right fix were applied,and the test error computed .

Test Set Error (%)

Missing % 20% 40% 60% 80%

cheap 11.8 13.4 15.7 20.7right 10.7 11.3 12.5 13.5

It's surprising that with 80% missing data theerror rate only rises from 9.8% to 13.5%

I've gotten similar results on other data sets.

7/30/2019 Siam Talk 2003

25/56

2 5

Class Weights

A data set is unbalanced if one or more classes--often the classes of interest, are severelyunderrepresented.

These will tend to have higher misclassificationrates than the larger classes.

In some data sets, even though the data set isnot bad ly unbalanced, some classes may have alarger error rate than the others.

In RF it is possible to set class weights that act"as if" they are increasing the size of a class that

gets weight larger that one.

The result is that the class weights can be set toget any desired d istribution of errors among theclasses.

7/30/2019 Siam Talk 2003

26/56

2 6

The OOB Test Set Error Est imate

For every tree grown, about one-third of thecases are oob (out-of-bag --out of the bootstrapsample).

Put these oob cases down the corresponding treeand get a predicted classification for them.

For each case n, pluralize the pred ictedclassification over all the trees that n was oob toget a test set estimate y

nfor y

n.

Averaging the loss over all n give the oob test setestimate of prediction error.

Runs on many data sets have shown it to beunbiased with error on the order of using a testset of the sam e size as the training set.

It is computed at user set intervals in the forest

construction process and outputted to themonitor.

7/30/2019 Siam Talk 2003

27/56

2 7

Accuracy vs Interpretabilit y

Nature forms the outputs y from the inpu ts x bymeans of a black box with complex and unknowninterior.

y xnature

Current most accurate prediction m ethods are alsocomplex black boxes.

y xneural nets

forestssupport vectors

Two black boxes, of which ours seems only slightlyless inscru table than nature's.

My biostatisticians friends tell me, "Doctors caninterpret logistic regression."

There is no way they can interp ret a black box

containing fifty trees hooked together.

In a choice between accuracy and interpretability,they'll go for interp retability. "

7/30/2019 Siam Talk 2003

28/56

2 8

Accuracy vs. Int erpretabilit yA False Canard

Framing the qu estion as the choice betweenaccuracy and interpretability is an incorrectinterpretation of what the goal of a statisticalanalysis is.

The point of a model is to get useful informationabout the relation between the response andpredictor variables as well as other information

about the data structure.

Interpretability is a way of getting information.

But a model does not have to be simple toprovide reliable information about the relationbetween p redictor and response variables.

The goal is not interpretability, but accurateinformation

RF can supply more and better information aboutthe inside of the black box than any current"interpretable" models.

7/30/2019 Siam Talk 2003

29/56

2 9

The Left Eye-Tools for Black Box Inspection

i) Estimating variable importance

i.e. which variables are instrumental in theclassification.

i) Data views via proximities and metric scaling.

iii) Outlier detection via proximities

iv) A device for doing similar analyses onunsupervised data

7/30/2019 Siam Talk 2003

30/56

3 0

Variable Importance.

For the n th case in the data, its margin at the endof a ru n is the p roportion of votes for its trueclass minus the maximum of the proportion ofvotes for each of the other classes.

To estimated the importance of the mth variable:

i) In the oob cases for the kth tree, randomlypermu te all values of the mth variable

ii) Put these new variable values down the kthtree and get classifications.

iii) Compute the margin.

The measure of importance of the mth variable isthe average lowering of the margin across allcases when the mth variable is randomlypermuted.

7/30/2019 Siam Talk 2003

31/56

3 1

An Example--Hepat it is Data

Data: Survival (123) or non-survival (32) of 155hepatitis patients with 19 covariates.

Analyzed by Diaconis and Efron in 1983 ScientificAmerican.

The original Stanford Medical School analysisconcluded that the imp ortant variables werenumbers 6, 12, 14, 19.

Error rate for logistic regression is 17.4%.

Efron and Diaconis drew 500 bootstrap samples

from the original data set and looked for theimportant variables in each bootstrapped dataset.

Their conclusion , "Of the four variablesoriginally selected not one was selected in morethan 60 percent of the samp les. Hence thevariables identified in the original analysis cannotbe taken too seriously."

7/30/2019 Siam Talk 2003

32/56

3 2

Analysis Using RF

The overall error rate is 14.2%. There is a 53%error in class 1, and 4% in class 2. The variableimportances are graphed below:

0

1

2

3

4

5

margindecrese

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0

variable

VARIABLE IMPORTANCE-EQUAL WEIGHTS

The th ree most imp ortant variables are 11,12,17.

Since the class of interest is non-survival which,with equal weights, has a h igh error rate, theclassweight of class 1 was increased to 3.

The run gave an overall error rate of 22%, ,with

class 1 error 19% and 23% for class 2.

The variable importances for this run are graphedbelow:

7/30/2019 Siam Talk 2003

33/56

3 3

0

1

2

3

4

5

6

7

8

margindecrease

0 2 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0variable

VARIABLE IMPORTANCE 3:1 WEIGHTS

Variable 11 is the most important variable inseparating non-survival from survival.

The stand ard procedu re when fitting data modelssuch as logistic regression is to delete variables;

Diaconis and Efron (1983) state , ".statisticalexperience suggests that it is unwise to fit amodel that d epends on 19 variables with only 155data points available."

Newer methods in Machine Learning thrive onvariables--the more the better. The next example

is an illustration.

7/30/2019 Siam Talk 2003

34/56

3 4

Microarray Analysis

Rand om forests was run on a microarraylymphoma data set with three classes,

sample size of 81 and 4682 variables (genes)without any variable selection.

The error rate was low (1.2%).

What was also interesting from a scientific

viewpoint was an estimate of the importance ofeach of the 4682 genes.

The graph below were produ ced by a run ofrand om forests.

- 2 5

1 5

5 5

9 5

135

175

215

Importance

0 1000 2000 3000 4000 5000

Variable

Variable Importance Measure 2

7/30/2019 Siam Talk 2003

35/56

3 5

An Int rinsic Proximity Measure and Clustering

Since an individual tree is unpruned, the terminalnodes will contain only a small number of instances.

Run all cases in the training set down the tree. I

If case i and case j both land in the same terminalnod e increase the proximity between i and j by one.

At the end of the ru n, the proximities are norm alizedby divid ing by twice the number of trees in the forest.

To cluster-use the proximity measures.

7/30/2019 Siam Talk 2003

36/56

3 6

Example-Bupa Liver Disorders

This is a two-class biomedical data set consistingof the covariates

1. mcv mean corpuscular volume2. alkphos alkaline phosphotase3. sgpt alamine aminotransferase4. sgot aspartate aminotransferase5. gammagt gamma-glutamyltranspeptidase

6. drinks number of half-pint equivalentsof alcoholic beverage d runk p erday

The first five attributes are the results of bloodtests thought to be related to liver functioning.

The 345 patients are classified into tw o classes by

the severity of their liver d isorders.

The class populations are 145 and 200( severe).

The misclassification error rate is 28% in an RFrun.

Class 1 has a 50% error rate with a rate of 12% forclass 2.

Setting the weight of class 2 to 1.4 gives an overallrates 28% and 31% for classes 1 and 2.

7/30/2019 Siam Talk 2003

37/56

3 7

Variable Importance

0

1

2

3

4

5

6

7

8

margindecrease

1 2 3 4 5 6

VARIABLE IMPORTANCE-BUPA LIVER

Blood tests 3 and 5 are the most important,followed by test 4.

7/30/2019 Siam Talk 2003

38/56

3 8

Clustering

Using the p roximity measure given by RF tocluster, there are two class #2 clusters.

In each of these clusters, the average of eachvariable is computed and plotted:

FIGURE 3 CLUSTER VARIABLE AVERAGES

Something interesting emerges.

The class two subjects consist of two distinctgroups:

Those that have h igh scores on blood tests 3, 4,

and 5

Those that have low scores on those tests.

7/30/2019 Siam Talk 2003

39/56

3 9

Scaling Coordinat es

The proximities between cases n and k form amatrix {prox(n,k)}.

The values 1-prox(n,k) are squared d istancesbetween vectors x(n),x(k)in a Euclidean space ofd imension not greater than the number of cases.

The goal is to project the x(n)} down into a lowdimensional space while preserving the d istances

between them to the extent p ossible.

In metric scaling, the idea is to approximate thevectors x(n) by the first few scaling coord inates.

These correspond to eigenvectors of a modifiedprox matrix. T

The tw o dimensional plots of the ith scalingcoord inate vs. the jth often gives usefulinformation about the data.

The most useful is usually the graph of the 2ndvs. the 1st.

We illustrate with three examp les. The first isthe graph of 2nd vs. 1st scaling coord inates forthe liver data

7/30/2019 Siam Talk 2003

40/56

4 0

- . 2 5

- . 2

- . 1 5

- . 1

- . 0 5

0

.05

.1

.15

.2

- . 2 - . 15 - . 1 - . 05 0 .05 .1 .15 .2 .25

1st Scaling Coordinate

class 2

class 1

Metric Scaling

Liver Data

The tw o arm s of the class #2 data in th is picturecorrespond to the two clusters found anddiscussed above.

7/30/2019 Siam Talk 2003

41/56

4 1

Microarray Data.

With 4682 variables, it is d ifficult to see how tocluster this data. Using proximities and the firsttwo scaling coordinates gives this picture:

- . 2

- . 1

0

.1

.2

.3

.4

.5

.6

2ndScalingCoordinate

- . 5 - . 4 - . 3 - . 2 - . 1 0 .1 .2 .3 .4

1st Scaling Coordinate

class 3

class 2

class 1

Metric Scaling

Microarray Data

RF misclassifies one case.

This case is represented by the isolated point in thelower left hand corner of the p lot.

7/30/2019 Siam Talk 2003

42/56

4 2

The Glass Dat a

The th ird example is glass data w ith 214 cases, 9variables and 6 classes. Here is a plot of the 2ndvs. the 1st scaling coord inates.:

- . 4

- . 3

- . 2

- . 1

0

.1

.2

.3

.4

2ndscalingcoordinate

- . 5 - . 4 - . 3 - . 2 - . 1 0 .1 .2

1st scaling coordinate

class 6

class 5

class 4

class 3

class 2

class 1

Metric Scaling

Glass data

None of the analyses to data have picked u p thisinteresting and revealing structure of the data--compare the plots in Ripley's book.

7/30/2019 Siam Talk 2003

43/56

4 3

Out lier Locat ion

Outliers are defined as cases having smallproximities to all other cases.

Since the data in some classes is more spread outthan others, outlyingness is defined only withrespect to other data in the same class as thegiven case.

To define a measure of outlyingness,

i) Compute, for a case n, the sum of the squaresof prox(n,k) for all k in the same class as case n.

ii)Take the inverse of this sum--it will be large ifthe p roximities prox(n,k) from n to the othercases k in the same class are generally small.Denote this quan tity by out(n).

iii) For all n in the same class, compute the m edianof the out(n), and then the mean absolutedeviation from the median.

Subtract the median from each out(n) and divideby the deviation to give a norm alized measure ofoutlyingness. The values less than zero are set tozero.

Generally, a value above 10 is reason to suspectthe case of being outlying.

7/30/2019 Siam Talk 2003

44/56

4 4

Out ly ingness Graph--Microarray Dat a

- 2

0

2

4

6

8

1 0

1 2

1 4

outlyingness

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0

sequence number

class 3

class 2class 1

Outlyingness Measure

Microarray Data

There are two possible outliers--one is the first

case in class 1, the second is the first case in class2.

7/30/2019 Siam Talk 2003

45/56

4 5

Out ly ingness Graph-Pima Indians Data.

This hep atitus data set has 768 cases, 8 variablesand 2 classes. It has been used often as anexample in Machine Learning research but issuspected of containing a substantial nu mber ofoutliers.

0

5

1 0

1 5

2 0

2 5

outlyin

gness

0 200 400 600 800

sequence number

class 2

class 1


Pima Data

If 10 is used as a cutoff point, there are 12 casessuspected of being ou tliers.

7/30/2019 Siam Talk 2003

46/56

4 6

From Supervised to Unsuperv ised

Unsupervised date consists of N vectors {x(n)} inM dimensions.

Desire:

i) Clustering

ii) Outlier detection

iii) Missing value replacement

A ll this can be done using RF

i) Give class label 1 to the original data.

ii) Create a synthetic class 2 by samplingindependently from the one-dimensional marginaldistributions of the original data.

7/30/2019 Siam Talk 2003

47/56

4 7

The Synthet ic Second Class

More explicitly:

If the value of the mth coordinate of the originaldata for the nth case is x(m,n), then a case in thesynthetic data is constru cted as follows:

i) Its first coordinate is sampled at random from

the N values x(1,n).

ii) Its second coord inate is sampled at randomfrom the N values x(2,n), and so on.

The synthetic data set has the distribution of Mindependent variables where the d istribution ofthe mth variable is the same as the univariatedistribution of the mth variable in the originaldata.

7/30/2019 Siam Talk 2003

48/56

4 8

The Synthet ic Tw o Class Problem

The are now two classes where the second classhas been constructed by d estroying alldependencies in the original unsupervised data.

When th is two class data is run through RF ahigh misclassification rate--say over 40%, impliesthat there is not much d epend ence structure inthe original data.

That is, that its structure is largely that of Mindependent variables--not a very interestingdistribution.

But if there is a strong d epend ence structure

between the variables in the original data, theerror rate will be low.

In this situation, the output of RF can be used tolearn something about the structure of the data.Following are some examples.

7/30/2019 Siam Talk 2003

49/56

4 9

Clustering the Glass Dat a

All class labels were removed from the glass data.

Unsupervised data consisting of 214 nine-dimensional vectors that were labeled class 1.A second synthetic data set was constructed.

The two class problem was run through RF.Proximities for class 1 only and metric scalingprojected the class 1 data onto tw o dimension.

Here is the outcome:

- . 5

- . 4

- . 3

- . 2

- . 1

0

.1

.2

.3

.4

.5


- . 4 - . 3 - . 2 - . 1 0 .1 .2 .3 .41st scaling coordinate

GLASS DATA-UNSUPERVISED SCALING

This is a good rep lica of the original starfish-likeprojection using the class data..

`

7/30/2019 Siam Talk 2003

50/56

5 0

Application to the Microarray Data

Recall that the scaling plot of the microarray d atashowed three clusters--

two larger ones in the lower left hand and righthand corners and a smaller one in the top middle.

Again, we erased labels from the data andprojected down an unsup ervised view:

- . 2

- . 1

0

.1

.2

.3

.4


- . 6 - . 4 - . 2 0 .2 .4


UNSUPERVISED PROJECTION OF MICROARRAY DATA

The three clusters are m ore d iffuse but stillapparent.

7/30/2019 Siam Talk 2003

51/56

5 1

Finding Out liers

An Application to Chemical Spectra

Data graciously supplied by Merck consists ofthe first 468 spectral intensities in the spectrumsof 764 compound s.

The challenge p resented by Merck was to findsmall cohesive groups of outlying cases in th is

data.

There w as excellent separation between the twoclasses.

An error rate of 0.5%, ind icating strong

depend encies in the original data.

We looked at outliers and generated this plot.

7/30/2019 Siam Talk 2003

52/56

5 2

0

1

2

3

4

5

outlyingness

0 100 200 300 400 500 600 700 800

sequence numner


Spectru Data

This plot gives no ind ication of outliers.

But outliers must be fairly isolated to show up inthe ou tlier d isplay.

7/30/2019 Siam Talk 2003

53/56

5 3

Project ing Dow n

To search for ou tlying groups scaling coord inateswere computed. The plot of the 2nd vs. the 1st isbelow:

- . 2

- .15

- . 1

- .05

0

.05

.1

.15

.2

.25


- .25 - . 2 - .15 - . 1 - .05 0 .05 .1 .15 .2 .25


Metric Scaling

Specta Data

The spectra fall into two main clusters.

There is a possibility of a small outlying group inthe up per left hand corner.

To get another p icture, the 3rd scaling coord inateis plotted vs. the 1st.

7/30/2019 Siam Talk 2003

54/56

5 4

- . 3

- . 2 5

- . 2

- . 1 5

- . 1

- . 0 5

0

.05

.1

.15

3rdscalingcoordiante

- . 25 - . 2 - .15 - . 1 - .05 0 .05 .1 .15 .2 .25


Metrc Scaling

Specta Data

The group in question is now in the lower lefthand corner and its separation from the body ofthe spectra has become more apparent.

7/30/2019 Siam Talk 2003

55/56

5 5

RF/tools--An Ongoing Project

RF/ tools are continually being upgraded.

My coworker Ad ele Cutler, is writing some veryilluminating graphic displays for the informationoutputted by RF.

The next version of RF/ cl will have the ability to

detect interactions between variables--a capabilitythat the microarray people lobbied for.

Plus other improvements, both major and minor.

RF/ rg needs up grading to the same level asRF/ cl.

A new program RF/ my that can use two eyes onsituations with multiple dependent outputs willbe added next month.

CONCLUSION

There is no data miniing program available thatcan p rovide the predictive accuracy andunderstanding of the data that RF/ tools does.

and it s free!!

7/30/2019 Siam Talk 2003

56/56

5 6

Siam Talk 2003

Documents

Transcript of Siam Talk 2003