PASW Modeler Tutorial 1&2

download PASW Modeler Tutorial 1&2

of 21

Transcript of PASW Modeler Tutorial 1&2

  • 7/30/2019 PASW Modeler Tutorial 1&2

    1/21

    1

    PASW Modeler Tutorial 1 The Drug Project

    Data Warehousing and Data Mining

    March 2011

    PASW Modeler (formerly Clementine) is the SPSS enterprise-strength data mining workbench. It helps organizations to improve

    customer and citizen relationships through an in-depth understanding of data. Organizations use the insight gained from PASWModeler to retain profitable customers, identify cross-selling opportunities, attract new customers, detect fraud, reduce risk, and

    improve government service delivery. The current version is PASW Modeler 14.

    1 The Drug Project ExerciseBriefing: Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of

    whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications. Part of yourjob is to use data mining to find out which drug might be appropriate for a future patient with the same illness.

    1.1 LaunchthePASWModeler:Open the PASW Modeler by going to the Start menuAll ProgramsSPSS IncPASW Modeler 14 PASW Modeler 14.

    Select Open an existing project and double-click on More files. In the Open dialog window, goto the path of C:\ProgramFiles\SPSSInc\PASWModeler14\Demos and double-click on the drug.cpj file to open it. The PASW Modeler should open and

    displays as Figure 1.

    Figure1:

    The

    Drug

    Project

    ControlPanel

    MainPanel

    ModulePanel

    CurrentWorkingSpace

    ProjectSpace

  • 7/30/2019 PASW Modeler Tutorial 1&2

    2/21

    2

    1.2 DisplayingthePropertiesoftheDataTo open a data source, the PASW Modeler provides many options listed in the Sources tab from the Module Panel.

    Here, we will use the Var. File node.

    1. Select the Sources tab from the Module Panel2. Double click on the Var.File node and it will appear in the Main Panel. You can also add a node by single

    left-click on the node in the Module Panel, then single left-click at the place where you want to place that node

    in the Main Panel.

    3. Double click the Var.File node in the Main Panel to open its property window (Figure 2), and Click the button next to the File field. In the Open dialog window, select to open the DRUG1n file that contains

    records of drug information. The Var.File node now should have properties as in Figure 2. The DRUG1n file

    contains records for 7 attributes, termed Age, Sex, BP, Cholesterol, Na, K, and Drug.

    4. Click OK to close the Var.File property window.

    Figure2:Var.FilePropertyTo display the properties of the data, we use a Distribution node.

    1. Select the Distribution node listed in the Graphs tab from the Module Panel, and add it to the Main Panel.2. Establish a link between the DRUG1n node and the Distribution node by right-clicking on the DRUG1n

    node and select the Connect option, then left-clicking on the Distribution node (Figure 3).

    Figure3:Linkbetweentwonodes3. Double-click the Distribution node to open its property window.4. Select Drug for the Field option (Figure 4) to display the distribution of drugs. Click Run

  • 7/30/2019 PASW Modeler Tutorial 1&2

    3/21

    3

    Figure4:DistributionNodeProperty5. You should see a distribution window for the Drug attribute in the DRUG1n file (Figure 5). This window

    illustrates the count of different drugs and their percentages.

    Figure5:DistributionofDrugs6. Click OK to close the window.

    1.3 FindingaRelationshipinNumericDataTo investigate a relationship between sodium (Na) and potassium (K) levels, the most natural way would be to produce a

    point plot. To do this, we create a Plot node and connect it to the Var.File node.

    1. Select the Plot node listed in the Graphs tab from the Module Panel, and add it to the Main Panel.2. Establish a link between the DRUG1n node and the Distribution node by right-clicking on the DRUG1n

    node and select the Connect option, then left-clicking on the Plot node (Figure 6).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    4/21

    4

    Figure6:LinkbetweenDRUG1nandPlot3. Double-click the Plot node to open its property window.4. Select K (Potasium) for the X Field option and select Na (Sodium) for the Y Field option (Figure 7).

    Figure7:PlotNodeProperty5. Click Run. The plot window of the K attribute and Na attribute will be displayed (Figure 8). This appears to be a

    random scattering, with no obviously apparent relationship between the Na and K attributes. However, this graph

    takes no account of which drug was used in each case. Therefore, we need to modify the property of the Plotnode in order to display the correlations between Na and K with respect to different drugs.

  • 7/30/2019 PASW Modeler Tutorial 1&2

    5/21

    5

    Figure8:PlotofKv.Na6. Double-click the Plot node to open its property window.7. Select Drug for the Color option in the Overlay group (Figure 9).

    Figure9:PlotNodeProperty8. Click Run. The plot window of the K attribute and Na attribute with respect to different drugs will be displayed

    (Figure 10). We can observe that a clear pattern emerges in the overlaid plot. The threshold is neither the Na nor

    K field, but in a ration between them.

    9. Click OK to close the window.

  • 7/30/2019 PASW Modeler Tutorial 1&2

    6/21

    6

    Figure10:PlotofKv.Na

    1.4 FindingtheThresholdWe can find the threshold by calculating the ration and examining its distribution. To do so, we need to create a Derive

    node and connect it to the Var.file node.

    1. Select the Derive node listed in the Field Ops tab from the Module Panel, and add it to the Main Panel.2. Establish a link between the DRUG1n node and the Derive node by right-clicking on the DRUG1n node

    and select the Connect option, then left-clicking on the Derive node (Figure 11).

    Figure11:LinkbetweenVar.FileandDeriveNodes3. Double-click the Derive node to open its property window.4. Type string Na_to_K in the Derive filed, and formula Na/K in the Formula area (Figure 12). This will

    create a new filed named Na_to_K containing numbers calculated as Na/K.

  • 7/30/2019 PASW Modeler Tutorial 1&2

    7/21

    7

    5. Click OK to close this property window. The Derive node will be renamed to Na_to_K.

    Figure12:DeriveNodeProperty

    Next, we need to create a Histogram node to display the output from the Derive node.

    1. Select the Histogram node listed in the Graphs tab from the Module Panel, and add it to the Main Panel.2. Establish a link between the Na_to_K node and the Histogram node by right-clicking on the Na_to_K node

    and select the Connect option, then left-clicking on the Histogram node (Figure 13).

    Figure13:LinkbetweenNa_to_KandHistogramNodes3. Double-click the Histogram node to open its property window.4. Select Na_to_K for the Field option, and Drug for the Color option in the Overlay group (Figure 14).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    8/21

    8

    Figure14:HistogramNodeProperty5. Click Run. The histogram window will be display as in Figure 15.

    Figure15:HistogramofNa_to_KThe histogram shows that the distribution of the ratio of Na and K. In addition, the threshold is clear as the column in the

    bars change from multi-coloured to the pure yellow colour at the critical value.

    We can now add a band selection line to this histogram to separate the records before and after the threshold.

    1. Tick the Interactions option from the View menu (Figure 16).2.

    Left-click the Activates band selection option (Figure 17).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    9/21

    9

    3. Place the RED colour line as close as possible to the point at which the bars of the histogram change colour (thethreshold point). (Figure 18).

    4. Right-click at the right side of the threshold line, and select Generate Derive Node for Band option (Figure 19).5. A new Derive node will then be added to the Main Panel. Open its property window, and observe the

    selection condition. Rename this node as band.

    6. Connect this band node to Na_to_K derive node and also add a new histogram node to connect to it (Figure20).

    7. Run the new histogram node and observe the result (Figure 21).

    Figure16:HistogramInteractions Figure17:Activatesbandselections

    Figure18:ThresholdLine Figure19:GenerateDeriveNode

  • 7/30/2019 PASW Modeler Tutorial 1&2

    10/21

    10

    Figure20:Thenewbandnode Figure21:Thenewband

    EndofTutorial1

  • 7/30/2019 PASW Modeler Tutorial 1&2

    11/21

    11

    PASW Modeler Tutorial 2 The Market Basket ProjectData Warehousing and Data Mining

    March 2011

    2.TheMarketBasketProjectBriefing: This example deals with fictitious data describing the contents of supermarket baskets (that is, collections of

    items bought together) plus the associated personal data of the purchaser, which might be acquired through a loyalty card

    scheme. The goal is to discover groups of customers who buy similar products and can be characterized demographically,

    such as by age, income, and so on.

    This example illustrates two phases of data mining:

    Association rule modeling and a web display revealing links between items purchased

    C5.0 rule induction profiling the purchasers of identified product groupsNote: This application does not make direct use of predictive modeling, so there are no accuracy measurements for the

    resulting models and no associated training/test distinction in the data mining process.

    2.1 AccessingthedataOpen the PASW Modeler by going to the Start menuAll ProgramsSPSS IncPASW Modeler 14 PASW

    Modeler 14. Select Open an existing project and double-click on More files. In the Open dialog window, go to the

    path of C:\Program Files\SPSSInc\PASWModeler14\Demos and double-click on the bask.cpj file to open it.In this project, we need to use the data file BASKETS1n.

    1. Select the Var.File node listed in the Sources tab from the Module Panel, and add it to the Main Panel.2. Double click the Var.File node in the Main Panel to open its property window, and Click the button next to

    the File field. In the Open dialog window, select to open the BASKETS1n file that contains records of basket

    information (Figure 22). The BASKETS1n file contains records for 18 attributes, termed cardid, value,

    pmethod, sex, homeown, income, age, fruitveg, freshmeat, dairy, cannedveg, cannedmeat,

    frozenmeal, beer, wine, softdrink, fish, and confectionery.

    3. Click OK to close the Var.File property window.

  • 7/30/2019 PASW Modeler Tutorial 1&2

    12/21

    12

    Figure22:BASKETS1nFileProperty2.2 Findanddisplayassociationsbetweendataattributes.1. Select the Type node listed in the Field Ops tab from the Module Panel, and add it to the Main Panel.2. Establish a link between the BASKET1n node and the Type node by right-clicking on the BASKET1n node

    and select the Connect option, then left-clicking on the Type node (Figure 23).

    Figure23:LinkbetweenBASKET1nandTypeNodes3. Double-click the Type node to open its property window. The Type node provides a way to modify the property

    of data attributes in the source node it connects to. Full detail of the Type node can be found in the Help file by

    pressing F1 from the keyboard (Figure 24).

    Note: You can click the Read Values button to detect value ranges for data attributes in the data source.

  • 7/30/2019 PASW Modeler Tutorial 1&2

    13/21

    13

    Figure24:DescriptionofLabelsinthe"Type"node4. Modify the properties of attributes as in Figure 25.

    a. Set the Type property of cardid to Typeless. This is because that each loyalty card ID occurs only oncein the dataset can therefore be of no use in modeling.

    b. Set the Type property of Sex to Nominal. This is to ensure that the Apriori modeling algorithm will nottreat sex as a flag.

    c. Set the Type property of other attributes to Flag.d. Set the Direction property to None for cardid, value, pmethod, sex, homeown, income, and

    age.

    e. Set the Direction property to Both for the rest attributes.5. Select the Apriori node listed in the Modelling tab from the Module Panel, and add it to the Main Panel.

    Apriori node discovers association rules in the data.

  • 7/30/2019 PASW Modeler Tutorial 1&2

    14/21

    14

    6. Connect the Apriori node to the Type node in the Main Panel (Figure 26).7. Double click the Apriori node to open its property window.8. Click Run. It creates a new model. Double-click this model and you can observe a table that displays detected

    associations between data attributes, which directions are set to Both in step 4, should appear as in Figure 27. These

    rules show a variety of associations between frozen meals, canned vegetables, and beer.

    Figure25:ModifiedPropertiesinthe"Type"node

    Figure26:ConnectionsbetweenBASKETS1n,Type,andApriorinodes

  • 7/30/2019 PASW Modeler Tutorial 1&2

    15/21

    15

    Figure27:AssociationsbetweenDataAttributes9. Select the Web node listed in the Graphs tab from the Module Panel, and add it to the Main Panel.10.Connect the Web node to the Type node to have a visual view of how different data attributes are associated as in

    Figure 28.

    Figure28:ConnectionbetweenWebandTypeNodes11.Double click the Web node to open its property window.12.Select fruitveg, freshmeat, dairy, cannedveg, cannedmeat, frozenmeal, beer, wine, softdrink,

    fish, and confectionery for the Fields, and tick Show true flags only (Figure 29).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    16/21

    16

    Figure29:PropertyWindowofWebnode13.Click Run and a graphical display of associations between data attributes should be generated as Figure 30. Your

    result may look different from Figure 30. This is because the threshold used, which can be set using the scroll bar at

    the bottom of the window.We can observe that three groups of customers stand out

    Those who buy fish and fruits and vegetables, who might be called Healthy eaters

    Those who buy wine and confectionery Those who buy beer, frozen meals, and canned vegetables (Beer, beans, and pizza)

    Figure30:ResultofWebNode

  • 7/30/2019 PASW Modeler Tutorial 1&2

    17/21

    17

    2.3 ProfilingtheCustomerGroupsYou have now identified three groups of customers based on the types of products they buy, but you would also like to

    know who these customers are, their demographic profile. This can be achieved by tagging each customer with a flag for

    each of these groups and using rule induction (C5.0) to build rule-based profiles of these flags.

    1. You must derive a flag for each group. This can be automatically generated using the web display that you justcreated. Using the right mouse button, click the link between fruitveg and fish and select Derive Node (And)

    from the Generate tab (Figure 31). A new node should appear in the Main Panel.

    Figure31:DeriveaNodefromWebnode2. Double click the newly generated node to open its property window, and change the Derive field to

    "Healthy (Figure 32).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    18/21

    18

    Figure32:TheHealthyNode3. Repeat Step 1 and 2 for the link between Wine and Confectionery, and rename the derived node to WineChocs.4. Repeat Step 1 and 2 for the links between cannedveg, beer, and frozenmeal. To derive a node from multiple

    links, you need to

    a. Goto the interaction mode, by selecting Interactions from the View menu.b. Hold the Shift key on the keyboard, and select the links you want.c. Then, select Derive Node (And) option from the Generate menu (Figure 33).A new node will be generated in the Main Panel. Rename it as beer_beans_pizza.

    Figure33:DeriveaNodefromMultipleLinks5. To profile these customer groups, connect the existing Type node to these three newly generated nodes in series,

    and then attach another Type node. (Figure 34).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    19/21

    19

    Figure34:Connectionsforcustomerprofiling6. Double click the new Type node to open its property window.

    a. Set Direction for value, pmethod, sex, homeown, income, and age to In;b. Set Direction for a customer group, which is one of Healthy, WineChocs, and beer_beans_pizza, to

    Target.

    c.

    Set Direction for the rest data attributes to None (Figure 35).7. Click OK to close the property window.

    Figure35:ModifiedTypeNode8. Select the C5.0 node listed in the Modelling tab from the Module Panel, and add it to the Main Panel.9. Double click the C5.0 node to open its property window.10.Set the Output type to Rule set (Figure 36).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    20/21

    20

    Figure36:PropertyWindowofC5.0Node11.Click Run, and a new model will be generated in the Current Working Space area.12.Right click on the new model, and select Browse (Figure 37).

    Figure37:ResultmodelfromC5.013.The result shows a clear demographic profile for this customer group (Figure 38).

  • 7/30/2019 PASW Modeler Tutorial 1&2

    21/21

    21

    Figure38:DemographicProfileforbeer_beans_pizzaCustomerGroupThe same method can be applied to the other customer group flags by selecting them as the output in the second Type

    node. A wider range of alternative profiles can be generated by using Apriori instead of C5.0 in this context; Apriori can

    also be used to profile all of the customer group flags simultaneously because it is not restricted to a single output field.

    EndofTutorial2