SAS Enterprise Miner Release 4.3

Post on 03-Jan-2016

34 views 3 download

description

SAS Enterprise Miner Release 4.3. A brief overview: analysis of the Donor Recapture Case (Case 3). Kevin Garsek … Class of 2006. Importing Base Data. SAS’s main drawback is the fact that if any line of data has a null or blank value it will totally disregard the full record - PowerPoint PPT Presentation

Transcript of SAS Enterprise Miner Release 4.3

SAS Enterprise MinerRelease 4.3

A brief overview: analysis of the Donor Recapture Case (Case 3)

Kevin Garsek … Class of 2006

Importing Base Data

• SAS’s main drawback is the fact that if any line of data has a null or blank value it will totally disregard the full record

• In this case, if we were unable to manipulate the data, the available records would decrease dramatically

• We can fight back by recoding the data as will be shown in the import step

Base SAS Interface Screen

Importing Charity Data

Text Editor

Text Editor

We will use the text editor in Base SAS to import the Charity Case data. In orderto use this editor, you simply type as you would in any text editor.

Text Editor

A line by line example of the code that we will use is as follows:

libname charity 'C:\Documents and Settings\Kevin\Desktop\Datamining\charity.1';denotes the master folder where the raw data is housed your local PC

data charity.raw;tells SAS to create a new dataset named charity raw

infile 'chr\2.dat' missover firstobs=2;lets SAS know the individual subfolder in which the data is housed and tells it to import it into the new dataset

input OSOURCE $;names the data column OSOURCE and the $ tells SAS that this is character based data (if this was left out, SASassumes that the data is numerical in format)

OSOURCE_D = 0;due to prevalent missing data, this creates a new dummy variable termed OSOURCE_D and makes the value 0for every record

if trim(OSOURCE) = "“the trim statement deletes any erroneous spaces and the if sets up the opening of an if then statement to compensate for blank data

then do; OSOURCE = "0";this sets all missing values in the OSOURCE column to 0

OSOURCE_D = 1;this sets the newly created dummy variable to 1 when OSOURCE was blank in the input file

end;this ends this statement as all code from infile to end can be written on a single line in the text editor

Importing Charity Data

The below depicts the completed code. The actual code can be easily writtenIn Excel using a & statement and then pasted into the text editor. Moving thewriting process to Excel will save considerable time during this laborious process.

Importing Charity Data

Once the code is completed, you will need to right hand click in the text editorand select “submit all”. This will tell SAS to read through the code in the texteditor and execute. Be prepared, due to the large size of the data, this will take considerable time to complete.

Starting Enterprise Miner from Base SAS moduleYou should now have a fully working dataset and you are now ready to openEnterprise Miner by following the subsequent slides.

Starting Enterprise Miner from Base SAS module

Starting Enterprise Miner from Base SAS module

Binding Data to Program

• This is an exasperating activity

• Even for someone who took a SAS training course in Enterprise Miner

• The documentation is pathetic

• I’ll document each step carefully in case this ever happens to you

Name Project Charity and Drag Input Data Node to Workspace

Bind Data to Project

Right click on tools to get this menu.

Bind Data to Project

Left click on initialization, left click top edit.

Bind Data to Project

Right click select; browse for library RDATA; click ok

Bind Data to Project

Gotcha: Must select RAW and hit enter even though only data set in RDATA

Change to Larger Sample

Left click change; changed to 10,000 to give low response items representation

Success!

Click Variables Tab

Notice that some variables rejected including some, this is typically due to the fact that that column has only one value throughout e.g. a dummy variable that is 0 due to no variation in the input data.

Then Bad Things Happen

• Who knows why.

• If I hadn’t taken the course the slides would stop here.

• That’s the only reason I know what to do

• I’ll document this also, in case it happens to you.

Crash Recovery

Right click on top level icon; select explore

Crash Recovery

Open emproj; delete all files with extension .lck; open user subfolder; delete everything in user subfolder

Analysis Resumes

• We’ll have a look at MAILCODE.

• Enterprise Miner has some neat graphical tools that are easy to use.

• The simplest and easiest are part of the data input tool.

A Histogram

Right click item, select “view distribution of MAILCODE” from drop down menu

Histogram of Mailcode

SAS has classified as missing data that R accepted and used!

Must Identify TARGET_D as Target

Right click row item in column “Model Role”, select “Change Model Role” from drop down menu, select “target” from next drop down menu

Histogram of Target

This is what makes the problem hard: extremely low response rate!

Save changes!

Add Data Partition Node

Drag down from tool bar above and connect line by dragging the mouse.

This is What it Does

We will choose to use an 80%/20% training/validation allocation.Close box, right click, click “Run” on drop down menu.

Design Philosophy

Click lower tools tab. Note tools on left. One drags a tool to worksheet andconnects with arrows. We’ll now drag and connect regression.

Regression

Chose stepwise selection, validation error. That mimics what we did in R.

Regression

Right hand click on the Regression node and select run

Regression

Regression is highlighted in green while running

Regression

Lets take a look at the results; SAS has a very different interpretation of importantvariables that the R analysis

Regression

The error rate is not that bad, but the significant variables are not necessarily easilyinterpretable.

Regression

Lets try it again with a few changes to the model selection

Regression

Again, we get results, but nothing easily interpretable.

Regression

Lets limit the regression to those variables determined by R to be significant.To do this, we will again right hand click on regression and select open.

Regression

Then go to the variables tab. Right hand click under the status column for eachunneeded variable and set the status to “don’t use”.

Regression

In addition to limiting our variables to those from the R results we are going to addan interaction as well as a squared variable. The first step is to add the squared term by adding a transform variables node and right hand clicking on the node and selecting open.

Regression

From the variables tab, we will right hand click on DOB and select Transform.

Regression

We will now select square. This will create a new variable, DOB_L1S6, which willthen be used in our next regression.

Regression

Our next step is to create an interaction. To do this, go back to the main diagram anddouble click on regression. This should bring you into the model manager where youwill click on the Interaction Builder icon.

Regression

On this screen, you should use the Ctrl button to highlight both Lastgift and Pepstrfl.Next, press the Cross button in order to create the new interaction variable. The newvariable should be added to the available terms window and should be used insubsequent regressions.

Regression

Results! While the initial bar graph may look complex, this is how SAS handlescharacter data and creating dummy variables.

Regression

As we now look at the table, or coefficient estimates, we have interpretable results!

Regression

For those that are interested, you can look at the Code tab and see the actual SAS coding that one would have to write if you were to program this regression manually.

Regression

Lets add another level of analysis and try to rid the data of outliers. To do this, you will need to incorporate a Filter Outlier node between the Transform Variables and Regression nodes.

Regression

Double click on the Filter Outliers node and then go to the Settings tab. I have used the above settings, but feel free to experiment for the best outcome. Once you have completed this step, run the regression.

Moving On, Try a Tree

The tree itself is on the next slide.

Does this look familiar?

This is exactly the same as Fig 22,Learning and Validation MSEof Topic 2, Bias Variance Tradeoff.

Tree

SAS does have some great graphics! Below is the tree which istypically presentable to a general audience.

Tree

Moving On, Try a Neural Net

NetWe will use the defaults for this round of processing. During the run we see the below graphic.

NetThe results. Decent output but very difficult to disseminate to a general audience.

Assessment Tool

• The assessment tool is supposed to give lift charts.

• Apparently it only does so for binary response.

• The menu item is blank for predictive models.

• The tool is good for easily comparing varying model error rates.

Assessment Tool

Assessment ToolWhen you double click on the node you will see the following:

Tool Root ASE Root ASE ^2Tree 4.457445 19.86881593Regresion 4.421218 19.5471686Neural Network 4.455325 19.84992086

Assessment ToolAs for lift charts, they are unavailable for this analysis …

Done!

• The intention was to illustrate the interface, not assess the SAS’s Enterprise Miner per se.

• With more effort to fix the missing values problems on input, better results can surely be achieved.

• With more experience, many of the false steps would not have occurred.

Looping and Control

• SAS’s biggest deficiency is the lack of looping and control structures.

• This affects all of SAS, not just Enterprise Miner.

• Any data manipulation, such as fixing missing values, must be done by hand, one variable at a time.

• R has a huge advantage here!