IM Easy Mining

download IM Easy Mining

of 178

description

Istrazivanje poataka

Transcript of IM Easy Mining

  • IBM InfoSphere Warehouse

    Data Mining with Easy Mining procedures Version 9.5.1

    SH12-6837-02

  • IBM InfoSphere Warehouse

    Data Mining with Easy Mining procedures Version 9.5.1

    SH12-6837-02

  • Note

    Note: Before using this information and the product it supports, read the information in Notices on page 157.

    This edition applies to Version 9.5.1 of the IBM InfoSphere Warehouse products and to all subsequent releases and modifications until otherwise indicated in new editions.

    Copyright International Business Machines Corporation 2001, 2008. All rights reserved. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

  • Contents Chapter 1. Overview of the Easy Mining procedures . . . . . . . . . . . . . 1 Easy Mining procedures for typical mining tasks . . 1 Easy Mining procedures for basic mining steps . . . 1 Easy Mining procedures for preprocessing and for utilities . . . . . . . . . . . . . . . . 2

    Chapter 2. Data mining with the Easy Mining procedures . . . . . . . . . . 3 Quick start sample . . . . . . . . . . . . 3

    Scenario . . . . . . . . . . . . . . . 3 Finding deviations . . . . . . . . . . . 4

    Using Easy Mining procedures for typical mining tasks . . . . . . . . . . . . . . . . . 6

    Finding deviations (FindDeviations procedure) . . 7 Finding groups with similar characteristics (ClusterTable procedure) . . . . . . . . . 9 Finding relationships (FindRules procedure) . . 18 Finding sequential relationships in your data (FindSeqRules procedure) . . . . . . . . . 24 Prediction of future behavior (PredictColumn procedure) . . . . . . . . . . . . . . 29 Prediction of an outcome (PredictColValue procedure) . . . . . . . . . . . . . . 37 Finding explanations for specific events (ExplainColValue procedure) . . . . . . . 41 Most important fields (FindMostImpFields procedure) . . . . . . . . . . . . . . 45

    Using Easy Mining procedures for basic mining steps . . . . . . . . . . . . . . . . 48

    Easy Mining procedures for classification mining steps . . . . . . . . . . . . . . . 48 Easy Mining procedures for regression mining steps . . . . . . . . . . . . . . . 55 Easy Mining procedures for clustering mining steps . . . . . . . . . . . . . . . 60 Easy Mining procedures for associations mining steps . . . . . . . . . . . . . . . 65

    Easy Mining procedures for sequences mining steps . . . . . . . . . . . . . . . 74 Exporting models and test results . . . . . . 82

    Using Easy Mining procedures for preprocessing and for utilities . . . . . . . . . . . . . 83

    Preprocessing procedures . . . . . . . . . 83 Utility procedures . . . . . . . . . . . 85

    Putting it all together . . . . . . . . . . . 88 Scenario . . . . . . . . . . . . . . 88 Identifying characteristics . . . . . . . . . 89 Building a prediction model . . . . . . . . 89

    Data mining at a glance . . . . . . . . . . 92 Data mining goals . . . . . . . . . . . 92 The data mining process . . . . . . . . . 93 Data mining functions . . . . . . . . . . 94

    Chapter 3. Easy Mining reference . . . 97 Syntax diagrams and parameters . . . . . . . 97

    Using the Easy Mining procedures . . . . . 97 Optional parameter strings . . . . . . . . 99 Easy Mining procedures for typical mining tasks 101 Easy Mining procedures for basic mining steps 116 Easy Mining procedures for preprocessing and utilities . . . . . . . . . . . . . . 148

    Easy Mining conventions and mining field types 153 Conventions . . . . . . . . . . . . . 153 Mining field types . . . . . . . . . . . 154

    Notices . . . . . . . . . . . . . . 157 Trademarks . . . . . . . . . . . . . . 159

    Contacting IBM . . . . . . . . . . 161 Product Information . . . . . . . . . . . 161 Accessible documentation . . . . . . . . . 161 Comments on the documentation . . . . . . . 162

    Index . . . . . . . . . . . . . . . 163

    Copyright IBM Corp. 2001, 2008 iii

  • iv IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • Chapter 1. Overview of the Easy Mining procedures There are Easy Mining procedures for typical mining tasks and for basic mining steps. There are also Easy Mining procedures for preprocessing and for utilities.

    Easy Mining procedures for typical mining tasks The Easy Mining procedures for typical mining tasks correspond to the main steps of the data mining process. They are easy to use because any parameter modifications that might be required are done by the mining functions. This means that you do not need to have in-depth knowledge about data mining.

    With the Easy Mining procedures for typical mining tasks, you can solve most of the typical business problems in various application areas, for example, Banking or Manufacturing, without having in-depth data mining skills.

    The following table provides an overview of the available Easy Mining procedures for typical mining tasks.

    Table 1. Overview of the Easy Mining procedures for typical mining tasks Mining tasks Easy Mining procedure

    Finding deviations FindDeviations

    Finding groups with similar characteristics ClusterTable

    Finding relationships FindRules

    Finding sequential rules FindSeqRules

    Predicting future behavior PredictColumn

    Predicting an outcome PredictColValue

    Finding explanations for specific events ExplainColValue

    Finding most important fields FindMostImpFields

    Easy Mining procedures for basic mining steps The Easy Mining procedures for basic mining steps correspond to the SQL API of IM Modeling and IM Scoring. They are easy to use because their syntax is easy. Furthermore, they concentrate on the more frequently used concepts of SQL/MM. They might even provide better results compared to the Easy Mining procedures for typical mining tasks because you can modify the parameters yourself.

    However, modifying the parameters yourself means that you need knowledge about the data mining process. For example, you must know how to modify the maximum number of clusters to gain better clustering results.

    With the Easy Mining procedures for basic mining steps, you can create, test, and modify data models. You can later apply these data models to new data to help you make successful business decisions.

    The following table provides an overview of the available Easy Mining procedures for basic mining steps.

    Copyright IBM Corp. 2001, 2008 1

  • Table 2. Overview of the Easy Mining procedures for basic mining steps Tasks Classification

    mining procedure Regression mining procedure

    Clustering mining procedure

    Associations mining procedure

    Sequences mining procedure

    Building models

    BuildClasModel BuildRegModel BuildClusModel BuildRuleModel BuildSeqRuleModel

    Testing models TestClasModel TestRegModel - - -

    Applying models

    ApplyClasModel ApplyRegModel ApplyClusModel ApplyRuleModel ApplySeqRuleModel

    Exporting models

    ExportClasModel ExportRegModel ExportClusModel ExportRuleModel ExportSeqRuleModel

    Exporting test result

    ExportClasTestResult ExportRegTestResult - - -

    Building mining views

    BuildClasView BuildRegView BuildClusView BuildRuleView BuildSeqRuleView

    Easy Mining procedures for preprocessing and for utilities Easy Mining procedures for preprocessing and for utilities help you to prepare your data for an Easy Mining procedure, or to accomplish administrative tasks, for example, handling error messages, canceling Easy Mining procedures, or working with the trace file.

    The following Easy Mining procedures are available: v SplitData v GetLastError v SetTraceFile v GetTraceFile v GetCancelTask v GetCleanUpTask

    2 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • Chapter 2. Data mining with the Easy Mining procedures With the Easy Mining procedures, you can perform data mining efficiently and successfully in a business context without the need of in-depth data mining skills. You only need to be familiar with the appropriate business area where you want to apply data mining.

    Quick start sample The quick start sample describes a banking scenario. It explains the everyday life situation of a bank and the goals the bank might have based on this everyday life situation. It also describes the advantage of finding deviations and how you can do this by using the Easy Mining procedure FindDeviations.

    Scenario A bank has quite a lot of information about its customers. For example, it knows the gender, the age, and the profession of its customers. Besides this demographic information, the bank also collects data about the status of its customers. For example, the bank might know since when a person has been a customer of this bank.

    Of course, a bank has much more information about its customers. It might also know, for example, the banking products its customers hold, the average balance of the accounts, or the number of transactions. To keep it simple, only a small amount of information is used in this example.

    The information that the bank collected is stored in a database. You can create tables or views from this information. One row in a table or view contains the complete information about a particular customer. The table or view might look like the table that is shown in Figure 1.

    Figure 1. The input table BANK.CUSTOMERS_MASTERDATA

    Copyright IBM Corp. 2001, 2008 3

  • The table that is shown in Figure 1 on page 3 contains one record for each customer. Each customer is identified by a value in the column CLIENT_ID, for example, 00861101. The other columns in the table contain values for age, gender, marital status, profession, and the number of years as client at that bank.

    Creating the database Demobank Create the database DEMOBANK by following these steps: 1. Create the database DEMOBANK by entering the following command:

    db2 "create database DEMOBANK"

    2. Enable the database for Intelligent Miner Modeling and Intelligent Miner Scoring with the recommended configuration parameters for this database by entering the following command: idmenabledb DEMOBANK dbcfg

    If the executable file idmenabledb is not in the search path, it is in the bin directory of your Intelligent Miner Modeling or Intelligent Miner Scoring installation path.

    3. Create the sample tables for the Easy Mining procedures including the BANK.CUSTOMERS_MASTERDATA table by going to the EasyMining subdirectory in the samples directory of your Intelligent Miner Modeling or Intelligent Miner Scoring installation path and entering the following command: db2 -tvf SampleTables.db2

    Business goals of banks The bank might want to find the customers who are to some respect different from the other customers. These customers might have been neglected in the past because they fall out of the standard target groups, however, they might be as profitable as other groups of the customers.

    Another goal of the bank might be to find implausible combinations of values that should be corrected in this table. For example, the value employee in the PROFESSION column and the value 80 in the AGE column indicates that the record needs to be updated because someone at this age is not actively working as an employee anymore.

    This means that the bank is looking for deviations.

    Finding deviations You can find deviations by using the Easy Mining procedure FindDeviations.

    Use the following command to run the Easy Mining procedure: db2 "call IDMMX.FindDeviations(BANK.DEV_CUSTOMERS, BANK.CUSTOMERS_MASTERDATA)"

    Where:

    BANK.DEV_CUSTOMERS is the name of the view that you want to create

    BANK.CUSTOMERS_MASTERDATA is the name of the input table within which you want to find deviations

    4 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The output view BANK.DEV_CUSTOMERS The result of this call is the output view BANK.DEV_CUSTOMERS that is shown in Figure 2.

    The BANK.DEV_CUSTOMERS view contains the columns of the input table BANK.CUSTOMERS_MASTERDATA and the following additional columns:

    CLUSTER_ID This column contains the identification of the generated clusters.

    DEV_DEGREE This column contains the measure of how much this record deviates from the average of all records in the BANK.DEV_CUSTOMERS view. The higher this value is, the greater is the deviation.

    The records in the BANK.DEV_CUSTOMERS view that is shown in Figure 2 are sorted in descending order by their deviation degree value. The first two rows in this table have a deviation degree of 7506. This means that these rows represent the most remarkable deviations. These rows include the following values: v The values worker and intermediate professions in the PROFESSION column v The value F in the GENDER column v The values 66 and 64 in the AGE column

    Typically, women at the age of 64 or 66 are already retired. Therefore this information might not be up to date.

    The records with the deviation degree of 3753 represent a group of 4 women at the age of approximately 30 years who are separated or divorced. They are new customers. They are inactive or they have an intermediate profession. Therefore they might not represent an interesting target group for sales of securities. The same is likely to be true for the next group with a deviation degree of 3002.4. This group represents old retired men.

    Figure 2. The output view BANK.DEV_CUSTOMERS

    Chapter 2. Data mining with the Easy Mining procedures 5

  • For more information about the FindDeviations procedure, see Finding deviations (FindDeviations procedure) on page 7.

    How to continue Using Easy Mining procedures for typical mining tasks describes the available Easy Mining procedures for typical mining tasks. It starts with the complete description of the FindDeviations procedure that is introduced in this chapter as quick start example. If the information about the FindDeviations procedure in this chapter is sufficient for you, you can skip Finding deviations (FindDeviations procedure) on page 7 and continue with Finding groups with similar characteristics (ClusterTable procedure) on page 9.

    Using Easy Mining procedures for typical mining tasks This section describes a set of typical data mining tasks, for example, finding unusual deviations in your data. You can try out these tasks on your own data by using the Easy Mining procedures. This might reveal interesting information that you have not been aware of before, because this information was hidden in your data.

    For your first attempts, you can use the Easy Mining procedures with their default parameter settings. However, by using the default parameter settings, not all of the information that was previously hidden might be revealed. You can get even better results by using additional option strings.

    The following sections are divided into the following subsections:

    When to do it This section describes a couple of scenarios where you can apply the appropriate Easy Mining procedure.

    How to do it This section explains the Easy Mining procedure and its results.

    Example This section illustrates the Easy Mining procedure along the lines of an example.

    How to go further This section provides and explains additional parameters. It also provides hints and tips how you can refine the results.

    This section helps you to improve the results that you obtained by implementing the basic information that is provided in the previous chapters.

    It is assumed that your database contains tables or views that include many data records. Each data record describes different characteristics of a distinct entity. For example: v In the retail business, the distinct entity might represent a customer. The characteristics of the customer might include information about the age and the gender of the customer, the preferred shopping day, the products the customer has bought in the past, or the revenue the retail store has made with the customer in the last years.

    v In the Manufacturing business, the distinct entity might represent a particular type of car. The characteristics of this type of car might include information about the production line in which the car was manufactured, the engine type, or the lacquer that was used for this car.

    6 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • Finding deviations (FindDeviations procedure) You can find deviations in your data by using the FindDeviations procedure.

    When to do it If your database contains customer data, you might want to know whether there are customers who are different from the majority of customers because they have combinations of characteristics that exist only for very few customers.

    Knowing such deviations in your data can be very useful for you. For example, some of the data records might represent customers who have an unusual buying behavior. There might be customers who deserve special attention because they buy expensive products of high quality.

    Deviations in data tables can represent anything. This depends on the kind of input data you are using. For example, deviations can indicate fraudulent behavior because there are customers who have unusually high discount rates. Other unusual combinations can represent inconsistencies in your data. For example, if you identify a customer who is 40 years old and still goes to school, there is something wrong with your data.

    For databases that describe characteristics of entities other than customers, the types of deviations that you can find depend on the kind of input data that you are using.

    How to do it

    To find deviations, use the FindDeviations procedure.

    Syntax:

    IDMMX.FindDeviations(, )

    Input parameters:

    With the FindDeviations procedure, you must specify the following parameters:

    The name of the view that you want to build.

    The FindDeviations procedure creates a view and a model. The model is stored in the table IDMMX.CLUSTERMODELS under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

    This parameter is of type VARCHAR. Its size is 240.

    The name of the input table or the input view.

    The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

    This parameter is of type VARCHAR. Its size is 257.

    Chapter 2. Data mining with the Easy Mining procedures 7

  • Output: The FindDeviations procedure creates a view that contains the columns of the input table and the following additional columns:

    DEV_DEGREE This column indicates the degree of deviation for each record. The degree of deviation is a number greater than 1. High numbers represent a high degree of deviation.

    CLUSTER_ID This column indicates the identification of the clusters that these records belong to.

    The small clusters are interesting for you because you are looking for unusual behavior. The cluster ID helps you to interpret the deviation because the typical characteristics of a cluster characterize the deviation.

    To explore the clusters in detail, you can use the table function DM_getClusters of IM Modeling, IBM InfoSphere Intelligent Miner Visualization, in this book referred to as IM Visualization, or any other visualization tool.

    Figure 3 shows the data flow of the FindDeviations procedure.

    Example For an example of the IDMMX.FindDeviations procedure, see Quick start sample on page 3.

    How to go further Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99.

    Sometimes you might be interested in deviations based only on a subset of the columns in the input table. If you want to use a subset of the columns only, you can remove one or more fields from the input table.

    Removing fields from the input table:

    Figure 3. Data flow of the FindDeviations procedure

    8 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • You might want to remove one or more fields from the input table because you do not want to use all fields to compute the model.

    To remove one field, you can use the DM_remDataSpecFld option. For example, to remove the column NBR_YEARS_CLI from the input table, you can use the following optional parameter string: DM_remDataSpecFld(NBR_YEARS_CLI)

    To remove more fields from the input table, you can use several DM_remDataSpecFld options in an Easy Mining procedure, or you can create a view that contains only the columns that you want to use. Use this view as input for the Easy Mining procedure.

    There are more optional parameter strings. For more information, see Optional parameter strings on page 99.

    Complete procedure call:

    The complete procedure call including the optional parameter string looks like this: db2 "call IDMMX.FindDeviations(BANK.DEV_CUSTOMERS, BANK.CUSTOMERS_MASTERDATA, DM_remDataSpecFld(NBR_YEARS_CLI))"

    Finding groups with similar characteristics (ClusterTable procedure)

    You can find groups with similar characteristics by using the ClusterTable procedure.

    When to do it Your database might contain customer data including demographic data, for example: v Gender v Age v Profession v Family statusThe information might also include the income or the socio-demographic group of the customer.

    Furthermore, you might have collected other customer information. This information depends on the business that you are in. For example, a retail store might collect the sales transactions of their customers. From this information, you can compute the following results: v How often a customer visited your store in a certain time frame v How much money a customer spent in total v How much money a customer spent for particular product categories, for example, beverages or delicatessen

    v The preferred shopping days of a week

    Another example is an insurance company that knows the contracts that their customers have signed. Or a bank that knows the accounts of their customers and the amount of transactions per account.

    Chapter 2. Data mining with the Easy Mining procedures 9

  • These are only few examples of the kind of information that data tables can contain. The data tables can also contain data other than customer data. A manufacturing company might collect information about the production of their products, or a retail chain might collect information about their stores.

    If you have data tables that contain this kind of information, you might want to know whether this data set has an inherent structure, or if it contains groups of objects that are very similar.

    Knowing such groups enhances your business operations immensely because you can treat your customers according to the group that they belong to. For example, you can define specific product offerings or marketing campaigns targeted for each important group instead of treating all customers equally.

    The Easy Mining procedure ClusterTable might find a group of customers that prefers healthy food of high quality. This is the appropriate customer group for a promotion of ecologically produced French cheese. It does not make any sense to send this group an advertisement about frozen pizza.

    How to do it

    To find groups with similar characteristics, use the ClusterTable procedure.

    Syntax:

    IDMMX.ClusterTable(, , , )

    Input parameters:

    With the ClusterTable procedure, you must specify the following parameters:

    The name of the view that you want to build.

    The ClusterTable procedure creates a view and a model. The model represents the characteristics of the clusters. It is stored in the table IDMMX.ClusterModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

    This parameter is of type VARCHAR. Its size is 240.

    The name of the input table or the input view.

    The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

    This parameter is of type VARCHAR. Its size is 257.

    The value to define the minimum percentage of records in a cluster.

    This parameter is of type REAL.

    10 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The value to define the maximum percentage of records in a cluster.

    This parameter is of type REAL.

    The values for the minimum and the maximum size of a cluster are indicated in percent. For example, a minimum size of 10,0 means that the smallest cluster contains 10% of the records that are contained in the input table.

    Output: The view created by the ClusterTable procedure contains the columns of the input table and the following columns:

    CLUSTER_ID This column contains the identification of the cluster that the record belongs to.

    QUALITY This column contains a value that indicates how well the records fit into the cluster.

    The value can range from 0 to 1. v 0 means that this record does not fit at all into this cluster. v 1 means that this record fits perfectly into this cluster.The quality value refers to a single record within the model.

    CONFIDENCE This column contains a value that indicates the confidence that the cluster is the best cluster for this record.

    The value can range from 0 to 1. v A value close to 0.5 indicates that the record fits another cluster equally well.

    v A value close to 1 indicates that the record does not fit into a different cluster.

    The confidence value refers to a single record within the model.

    To explore the characteristics of each cluster in more detail, you can open the clustering model with IM Visualization or with any other visualization tool that supports PMML. Visualizing the clustering model helps you to assess whether the clustering model is useful for you.

    Figure 4 on page 12 shows the data flow of the ClusterTable procedure.

    Chapter 2. Data mining with the Easy Mining procedures 11

  • Interpreting the results: You can use the quality value to select the best records that belong to a cluster. For example, you might have a limited budget for a mailing campaign. This budget does not allow to address all customers that belong to a cluster. With the quality value, you can select the most promising customers to address.

    Typically, you do not want clusters that are too large. For example, a cluster that represents 75% of the whole population is not interesting because it does not differ too much from the characteristics of the whole population.

    Based on your business requirements, clusters that are too small also might not be interesting. For example, if you apply clustering for target marketing, there might be a lower limit for the size of a target group for a promotion campaign. If this target group is too small, the costs of the campaign might exceed the expected revenue.

    In other cases, however, you might be interested in very small clusters. These clusters represent the niches. For example, if you are looking for patterns of unusual behavior in your data to detect fraud or other kinds of irregularities. If you are looking for patterns of this kind, use the FindDeviations procedure. See Finding deviations (FindDeviations procedure) on page 7 for more information.

    Example Figure 5 on page 13 shows customer data of a bank. The table contains the following information:

    Demographic information Demographic information includes age, gender, marital status, and profession.

    Banking products Banking products include savings account and international credit card.

    Customer activities Customer activities include number of debit and credit transactions and average balance.

    Figure 4. Data flow of the ClusterTable procedure

    12 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • In a real banking scenario, a table containing customer data includes probably more columns than the example shown in Figure 5.

    Procedure call: To discover interesting clusters in the table BANK.CUSTOMERS, you can use the Easy Mining procedure ClusterTable.

    Use the following command to run the Easy Mining procedure: db2 "call IDMMX.ClusterTable(BANK.CLUSVIEW, BANK.CUSTOMERS, 10, 35)"

    Where:

    IDMMX.ClusterTable is the name of the Easy Mining procedure

    BANK.CLUSVIEW is the name of the view that you want to create

    BANK.CUSTOMERS is the name of the input table

    10 is the value that you specified for the minimum percentage of records in a cluster

    35 is the value that you specified for the maximum percentage of records in a cluster

    Output: Figure 6 on page 14 shows the data flow of the ClusterTable procedure based on the example used in this section.

    Figure 5. The input table BANK.BANKCUSTOMERS

    Chapter 2. Data mining with the Easy Mining procedures 13

  • The output view BANK.CLUSVIEW shown in Figure 7 contains the columns of the input table BANK.CUSTOMERS and additionally the columns CLUSTER_ID, QUALITY, and CONFIDENCE . The CLUSTER_ID column contains values from 1 to 4. This means that the Clustering mining function has computed 4 clusters.

    Figure 6. Data flow of the ClusterTable procedure based on the example used in this section

    Figure 7. The output view BANK.CLUSVIEW

    14 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The result view BANK.CLUSVIEW shown in Figure 7 on page 14 only shows which record belongs to which cluster. If you want to analyze the characteristics of the clusters, you can open the clustering model DEMOBANK.CLUSVIEW that is stored in the IDMMX.CLUSTERMODELS table with IM Visualization or with any other visualization tool that supports PMML.

    Analyzing the characteristics: Figure 8 shows the clustering model BANK.CLUSVIEW with IM Visualization. The Graphic View of the Clustering visualizer shows that there are 4 clusters. The largest cluster contains 33,71% of the total population. The smallest cluster contains 13,56% of the total population.

    The pie charts and the bar charts show the distribution of the values of the columns in the clusters compared to the total population. v In the pie charts, the inner circle represents the population of a cluster. The outer circle represents the total population. For example, the pie chart INT_CREDITCARD in Figure 9 on page 16 shows that only few customers in cluster 1 own an international credit card compared to the total amount of customers.

    Figure 8. The output view BANK.CLUSVIEW displayed by the Clustering visualizer

    Chapter 2. Data mining with the Easy Mining procedures 15

  • v In the bar charts, the outlined histograms represent the distribution of the population of a cluster. The compact histograms represent the total population. For example, the bar chart NO_DEBIT_TRANS in Figure 10 shows that the customers in this cluster are less active compared to the total amount of customers.

    Only in the two leftmost histograms of the bar chart NO_DEBIT_TRANS , the relative frequency of debit transactions in cluster 1 is higher than in the total population. These histograms indicate the fraction of customers with the lowest number of debit transactions. These customers are the least active with regard to debit transactions.

    Figure 9. The pie chart INT_CREDITCARD

    Figure 10. The bar chart NO_DEBIT_TRANS

    16 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The columns INT_CREDITCARD and NO_DEBIT_TRANS show you that the customers included in this cluster are not the most interesting customers.

    How to go further Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99.

    There are a lot of ways how to partition the contents of a table into homogenous clusters. The Clustering mining function is using one of these ways. Often the first clustering result does not fully satisfy your business requirements. Therefore you need to refine the results.

    The clustering results depend mainly on the input data. Therefore the main issue is to select the appropriate columns as input for the Clustering mining function. The ClusterTable procedure is using all columns of an input table. To specify particular columns as input data, you can choose between the following options: v Removing one or more fields from the input table v Defining a column as supplementary for the Clustering mining function

    Removing fields from the input table:

    You can remove one or more fields from the computation of the model. For more information, see Removing fields from the input table on page 8.

    Defining supplementary columns:

    Values of supplementary columns are not used to compute the similarity of records. However, statistics of these columns are computed and included in the generated model for reference purposes.

    To define a column as supplementary, you must set the field usage type of this column to 2. You can set the field usage type with the DM_setFldUsageType option. For example, to define the column PROFESSION as supplementary, you can use the following option string: DM_setFldUsageType(PROFESSION,2)

    Where:

    DM_setFldUsageType is the options string parameter

    PROFESSION is the column that you want to define as supplementary column

    2 is the value that denotes a column as supplementary column

    There are more optional parameter strings. For more information, see Optional parameter strings on page 99.

    Complete procedure call: The complete procedure call including the optional parameter string looks like this: db2 "call IDMMX.ClusterTable(BANK.CLUSVIEW, BANK.CUSTOMERS, 10, 35 DM_setFldUsageType(PROFESSION,2))"

    Chapter 2. Data mining with the Easy Mining procedures 17

  • Finding relationships (FindRules procedure) You can find relationships in your data by using the FindRules procedure.

    When to do it Your database might include a data table that contains customer data. You might want to find out whether there are relationships between the values in the columns of this table. For example, such a relationship might indicate that in a certain number of cases a column has a specific value if other columns have a specific value combination.

    For example, the FindRules procedure might find out that 70% of the male customers who have an online access to their account also have a credit card. This is an interesting cross-selling information that you can exploit in the next marketing campaign.

    You can also apply the FindRules procedure to retail transaction data. A sales transaction consists of the items that a customer has bought during a visit to the retail store. The information is stored in a table with at least the following columns: v The identification of the sales transaction v The purchased itemsThe items with the same associated transaction ID are bought together. If you apply the FindRules procedure to a transaction table, you might find out the relationships between the purchased items. They indicate, for example, that in 45% of the cases if customers buy cereals they also buy fruit. With this cross-selling information, you can decide where to place the products in your store or on which products you might put a discount in the next marketing campaign.

    How to do it

    To find relationships in your data, use the FindRules procedure.

    Syntax: IDMMX.FindRules(, , , , )

    Input parameters:

    To find relationships in your data, you must specify the following parameters for the FindRules procedure:

    The name of the view that you want to build.

    The FindRules procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

    This parameter is of type VARCHAR. Its size is 240.

    The name of the input table or the input view.

    18 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

    This parameter is of type VARCHAR. Its size is 257.

    The name of the column that contains the group or transaction ID.

    If you specify a column of the input table as the GROUP column, the remaining columns of the input table are used as item columns.

    If you specify NULL for the GROUP column, each record represents a transaction and all columns of the input table are used as item columns.

    This parameter is of type VARCHAR. Its size is 128.

    For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154.

    The minimum number of rules to be generated.

    The generated rules are stored in a view.

    This parameter is of type INTEGER.

    A value to measure the validity of the rule.

    The confidence of each rule is greater or equal to the value that you specified for minimum confidence.

    This parameter is of type REAL.

    Output: The Associations mining function computes association rules. The set of the computed associations rules is called rule model.

    The first part of an association rule is called rule body. The second part of an association rule is called rule head.

    If you use the Associations mining function, for example, on customer data of a bank, the rule body and the rule head might represent column value pairs, for example, online_access=YES. The association rule is interpreted like this: If online_access=YES then bankcard=YES

    If you use the Associations mining function, for example, on retail transaction data, the rule body and the rule head might represent articles that occur in retail transactions, for example, chocolate, or candy. The association rule is interpreted like this: If customers buy chocolate, they also buy candies.

    Association rules include the following attributes:

    Confidence The confidence value represents the validity of the rule. A confidence value of 50% means that in 50% of the cases where a particular rule body is present in a group, a particular rule head is also present.

    For example, rule ID 123 in Figure 13 on page 22 indicates that in 57,576% of the cases where customers have a savings account with a building society, they also have a home insurance contract and a popular savings plan.

    Chapter 2. Data mining with the Easy Mining procedures 19

  • Support The support value states how many records or how many transactions are covered by a rule. The value for support is expressed as a percentage of the total number of records or transactions.

    Lift The lift value indicates how much the confidence value is higher than expected. It is defined as the quotient of the confidence value and the support value of the rule head.

    The support value of the rule head can be considered as the expected value for the confidence. It indicates the relative frequency of the rule head in the whole transaction set.

    The confidence value of the association rule indicates the relative frequency of the rule head that contains the items of the rule body. For example, if the confidence of the following rule is 30%, and the frequency of the customers who buy candies (which is the support of the rule head) is 10%, you can expect that 10% of the customers who buy chocolate also buy candies. Therefore 10% is the expected confidence, and 30% / 10% = 3.0 is the lift of the rule: If customers buy chocolate, they also buy candies

    Figure 11 shows the data flow of the FindRules procedure.

    Example The FindRules procedure is very useful in the banking business.

    A bank knows the banking products that their customers own. This information might be stored in the table BANK.CUSTOMER_PRODUCTS. This table contains the following columns: v CLIENT_ID v PRODUCTThe table might look like this:

    Figure 11. Data flow of the FindRules procedure

    20 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • Because customers can possess more than one banking product, tables can contain more than one record for a given client ID. The table in Figure 12 shows that the customer with the client ID 395821 owns one product while the customer with the client ID 856 owns five products.

    The bank might want to know whether there are relationships between the products that a customer owns. You can use the FindRules procedure to find such relationships.

    Procedure call

    Use the following command to run the Easy Mining procedure: db2 "call IDMMX.FindRules(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, CLIENT_ID, 100, 30)"

    Where:

    BANK.PRODUCT_RULES is the name of the table that you want to create that contains the rules

    BANK.CUSTOMER_PRODUCTS is the name of the input table

    CLIENT_ID is the identification of a particular customer

    100 is the maximum number of rules to be generated

    30 is the minimum value for confidence

    Results

    The result of calling the IDMMX.FindRules procedure might look like this:

    Figure 12. The input table BANK.CUSTOMER_PRODUCTS

    Chapter 2. Data mining with the Easy Mining procedures 21

  • In Figure 13, rule 122 includes the product home insurance contract in the BODYTEXT column because the home insurance contract represents the body of the rule. The product savings account with a building society represents the head of the rule. Therefore it is included in the HEADNAME column. v The support value of 6.187 means that 6.187% of all customers have a home insurance contract and a savings account with a building society.

    v The confidence value of 41.429 means that in 41.429% of the cases customers who have a home insurance contract also have a savings account with a building society.

    v The lift value of 1.21 means that it is 1.21 times more likely that the item in the rule head is also bought if the item in the rule body is bought.

    How to go further Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99.

    When you use the FindRules procedure, you can do the following actions: v Setting the name for the item ID column v Using name mappings v Removing one or more fields from the input table. For more information, see Removing fields from the input table on page 8.

    Setting the name of the item ID column: The input table CUSTOMER_ID that is shown in Figure 12 on page 21 contains the following columns:

    CLIENT_ID This column represents the group_ID column.

    Figure 13. The output view BANK.PRODUCT_RULES

    22 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • PRODUCT This column represents the item_ID column.

    The PRODUCT column is implicitly specified as item_ID column, because the CLIENT_ID column is specified as group_ID column.

    If you have a table with more than two columns, you can specify a particular column as the item ID column by using the DM_setItemFld option. If you specify a column as item ID column, only this column is considered by the Easy Mining procedure. If you do not specify a column as item ID column, all columns are searched for relationships.

    Example

    If you want to specify the PRODUCT column as the item ID column, the options string looks like this: DM_setItemFld(PRODUCT)

    Name mappings: Transaction tables, for example, the BANK.CUSTOMER_PRODUCTS table that is shown in Figure 12 on page 21, might contain numbers that represent the product identification instead of product names. Running the FindRules procedure on such tables generates rules that include numbers instead of product names. Such a rule might look like this: If [2937], then [5879]

    Where:

    2937 is the identification for the product safe

    5879 is the identification for a savings account with a building society

    This makes the understanding of the rules very difficult. To display the appropriate product names, you can define a name mapping that maps the product ID to the appropriate name in the associations rules.

    The mapping from the product ID to the product name must be defined in a table. If such a table exists, you can use the DM_addNmp option to define a name mapping, and the DM_setFldNmp option to apply the name mapping to the PRODUCT field.

    It is assumed that the mapping is defined by the table BANKING.PRODUCTS . This table contains the following columns:

    ID This column contains numerical values that correspond to the product IDs of the transaction table BANK.CUSTOMER_PRODUCTS.

    DESCRIPTION This column maps the numerical values to a meaningful description, for example safe or savings account with a building society.

    There are more optional parameter strings. For more information, see Optional parameter strings on page 99.

    Defining a name mapping:

    The following option string defines a name mapping that is called PRODUCT_NAMES:

    Chapter 2. Data mining with the Easy Mining procedures 23

  • DM_addNmp("PRODUCT_NAMES", "BANK.PRODUCTS", "ID", "DESCRIPTION")

    Applying the name mapping: The following option string applies the name mapping PRODUCT_NAMES to the PRODUCT column: DM_setFldNmp("PRODUCT", "PRODUCT_NAMES")

    Complete procedure call The complete procedure call including the optional parameter strings looks like this: db2 "call IDMMX.FindRules(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, CLIENT_ID, 100, 30 DM_setItemFld(PRODUCT), DM_addNmp(PRODUCT_NAMES, BANK.PRODUCTS, ID, DESCRIPTION), DM_setFldNmp(PRODUCT, PRODUCT_NAMES))"

    Finding sequential relationships in your data (FindSeqRules procedure)

    You can find sequential relationships in your data by using the FindSeqRules procedure.

    When to do it You might have data tables that include records that can be grouped according to a particular key. These groups might represent, for example, sales transactions or orders of customers. They can also represent defects of a product.

    If groups represent sales transactions or orders of customers, the key might be the customer ID. If the groups represent defects of a product, the key might be the serial number of the product.

    You can sort the members of these groups according to another order key. For sales transactions or customer orders, this order key can be the purchase date or the order date. For product defects, this order key can be the date when the defect occurred.

    With the FindRules procedure that is described in Finding relationships (FindRules procedure) on page 18, you can identify the products that are purchased together or the defects that typically occur for one product. With the FindSeqRules procedure, you can additionally find out the sequential order in which the articles are bought or the sequential order in which the defects have occurred. With regard to retail data, you can use this information for targeted mailing campaigns. For example, you might want to offer your customers the articles that they are likely to buy soon. With regard to defects, you can efficiently manage repair and warranty cases by fixing proactively defects that are likely to happen in the near future.

    You can use the FindSeqRules procedure also in other business areas.

    24 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • How to do it You can find sequential relationships in your data by using the FindSeqRules procedure.

    Syntax IDMMX.FindSeqRules(, ,

    [])

    Input parameters

    With the IDMMX.FindSeqRules procedure, you must specify the following parameters:

    The name of the view that you want to build.

    The FindSeqRules procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. The view contains the following columns: v ID v HEADSETID v HEADSETTEXT v BODYSEQID v BODYSEQTEXT v SUPPORT v CONFIDENCE v LIFTIf a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

    This parameter is of type VARCHAR. Its size is 240.

    The name of the input table or the input view.

    The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

    This parameter is of type VARCHAR. Its size is 257.

    The name of the column that contains the sequence ID.

    A sequence contains the item sets that have the same sequence ID.

    Chapter 2. Data mining with the Easy Mining procedures 25

  • The name of the column that contains the group or the transaction ID.

    If you specify a column of the input table as the GROUP column, the remaining columns of the input table with the exception of the sequence column are used as item columns.

    An item set contains items that have the same sequence ID and the same value in the group column.

    The item sets of a sequence are sorted according to the value in the group column.

    This parameter is of type VARCHAR. Its size is 128.

    For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154.

    The minimum number of sequence rules to be generated.

    The generated rules are stored in a view.

    This parameter is of type INTEGER.

    A value to measure the validity of the sequence rule.

    The confidence of each sequence rule is greater or equal to the value that you specified for minimum confidence.

    This parameter is of type REAL.

    Output

    Sequential relationships are represented as sequence rules. Sequence rules describe patterns in sequences. Depending on the business area, sequences might be, for example, purchases of customers or defects of cars over time.

    For example, customers might buy a digital camera and rechargeable batteries. A couple of weeks later, they buy a memory card and, again a couple of weeks later, they buy a photo printer. The sequence rule of this pattern looks like this: >>> ==>

    where:

    represents an individual item set that is part of the rule body

    >>> represents a temporal ordering of item sets in ascending order

    represents an individual item set that is part of the rule body

    ==> splits the sequence rule into a sequence rule head and a sequence rule body

    represents an item set that is included in the sequence rule head

    26 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • You can interpret the sequence rule above like this: If customers buy a digital camera with rechargeable batteries during their first purchase and a memory card during their second purchase, they will buy a photo printer during a subsequent purchase.

    Sequence rules include the following attributes:

    Confidence The confidence value represents the validity of the rule. A confidence value of 50% means that in 50% of the cases where a particular rule body is present in a sequence, a particular rule head is also present after the item sets of the rule body.

    For example, in the sequence rule above, a confidence value of 50% means that 50% of the customers who bought a digital camera with rechargeable batteries during their first visit and a memory card during any of their subsequent visits, bought a photo printer during another subsequent visit.

    Support The support value indicates how many sequences are covered by a sequence rule. The support value is expressed as the percentage of the total number of sequences.

    For example, a support value of 2% in the following sequence means that 2% of all sequences contain this particular sequence. => =

    Lift The lift value indicates how much the confidence value is different from the expected confidence value.

    The lift value is computed by dividing the confidence value by the support value of the sequence rule head.

    If the support value of the above example is 10% and the confidence value of the sequence rule is 50%, the value for lift is 50% divided by 10% = 5.

    A lift value of 5 means, that customers who buy a digital camera and rechargeable batteries during their first visit and a memory card during their second visit, are 5 times more likely than average customers to buy a photo printer during a subsequent visit.

    Mean time difference This value indicates the mean time difference between the time stamp of the first item set and the time stamp of the last item set in a sequence.

    If the type of the group column is numeric, this value is the mean value of the group values for the sequences.

    Standard Deviation of time difference This value indicates the standard deviation of the time difference between the time stamp of the first item set and the time stamp of the last item set in a sequence.

    If the type of the group column is numeric, this value is the standard deviation of the group values for the sequences.

    Example Besides the client ID and the banking product that is known to the bank in the example of the FindRules procedure, the bank additionally knows the date when their customers have bought the banking products. The table might look like this:

    Chapter 2. Data mining with the Easy Mining procedures 27

  • The figure above shows that one customer can have various banking products that are bought at different dates. Because customers can own more than one banking product, the table can contain more than one record for a given client ID. For example, the client ID 856 owns 5 products.

    The bank might want to know whether there are sequential relationships between the products that a customer owns. You can use the FindSeqRules procedure to find sequential relationships.

    Use the following command to run the FindSeqRules procedure: call IDMMX.FindSeqRules (BANK.PRODUCT_SEQRULES BANK.CUSTOMER_PRODUCTS2, CLIENT_ID DATE, 100,10);

    Where:

    BANK.PRODUCT_SEQRULES is the name of the generated table that contains the sequence rules

    BANK.CUSTOMER_PRODUCTS2 is the name of the input table

    CLIENT_ID is the group column that contains the customer ID

    DATE is the sequence column that contains the date of the purchase

    100 is the maximum number of rules to be generated

    10 is the minimum value for confidence. This means that you want sequence rules with a confidence value of 10% or higher.

    Figure 14. The input table BANK.CUSTOMER_PRODUCTS

    28 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The generated table BANK.PRODUCT_SEQRULES might look like this:

    In the figure above, the sequence rules are sorted by the lift value in descending order. Rule 79 looks interesting. It states that 26% of the customers who first have a CODEVI savings account and after that a popular savings plan, next sign a savings plan with a building society. The lift value indicates that this is 3.2 times more probable for these customers than all customers in general. Therefore the chances are 3.2 higher than in general, that offering a savings plan with a building society to customers with a CODEVI account and a popular savings plan will lead to a concluding contract.

    How to go further Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99.

    When you use the FindSeqRules procedure, you can do the following tasks: v Setting the name for the item ID column For more information, see Setting the name of the item ID column on page 22.

    v Using name mappings For more information, see Name mappings on page 23.

    Prediction of future behavior (PredictColumn procedure) You can predict future behavior by using the PredictColumn procedure.

    When to do it Your database might contain customer data. In the tables or views of your database, there might be one column that you are particularly interested in. For example, if you want to know the aggregated revenue of each customer who

    Figure 15. The generated table BANK.PRODUCT_SEQRULES

    Chapter 2. Data mining with the Easy Mining procedures 29

  • visited your shop last year, you might be interested in the AGGREGATED REVENUE column. This column is called the target column.

    You might want to know whether there are relationships between the occurrence of the values in the target column AGGREGATED REVENUE and the values of the other columns such that you can predict from the values of the other columns the values occurring in the target column AGGREGATED REVENUE . If you have new customer data that does not yet contain values in the AGGREGATED REVENUE column, you can predict the estimated revenue of new customers.

    If you have new customer data that does not yet have values in the target column CATEGORY, you can predict the category that the new customers best fit into. This information helps you to plan a marketing campaign for this particular customer category.

    In customer relationship management, you might want to predict which customers are likely to buy certain products. This information helps you to promote cross-selling.

    In the health care industry, you can find relations between symptoms and diseases. With this information, you can predict the potential diseases of new patients.

    How to do it

    To predict future behavior, you can use the IDMMX.PredictColumn procedure.

    Syntax: IDMMX.PredictColumn(, , )

    Input parameters:

    To predict future behavior, you must specify the following parameters for the PredictColumn procedure:

    The name of the view that you want to build.

    The PredictColumn procedure creates a view and a model. Depending on the mining function that is used to build the model, the model is stored in one of the following tables under the same name as the generated view: v IDMMX.ClassifModels if the target column is categorical v IDMMX.RegressionModels if the target column is numericIf a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

    This parameter is of type VARCHAR. Its size is 240.

    The name of the input table or the input view.

    The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

    This parameter is of type VARCHAR. Its size is 257.

    30 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The name of the target column.

    The PredictColumn procedure derives the values in this column from the values of the other columns in the input table. If the values in the target column are categorical, the Classification mining function is used. If the values in the target column are numeric, the Regression mining function is used.

    This parameter is of type VARCHAR. Its size is 128.

    For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154.

    Data flow: Figure 16 shows the data flow of the PredictColumn procedure. By applying the PredictColumn procedure to the input table with the specified target column, a model and a view are generated. The view includes the columns of the input table and the columns PREDICTION and CONFIDENCE.

    Output: Based on the input parameters, the PredictColumn procedure creates a view. This view contains the columns of the input table and the following additional columns:

    PREDICTION This column contains the predicted values of the target column. These values are derived from the values of the input table.

    CONFIDENCE This column contains the confidence value of the prediction.

    If the target column is categorical, the confidence value can range from 0 to 1. v A value close to 0 indicates a low probability that the prediction is correct.

    v A value close to 1 indicates a high probability that the prediction is correct.

    If the target column is numeric, this column contains only null values.

    With the prediction confidence, you can select the most reliable predictions.

    Figure 16. Data flow of the PredictColumn procedure

    Chapter 2. Data mining with the Easy Mining procedures 31

  • To analyze the prediction model in detail, you can use IM Visualization or any other visualization tool that supports PMML.

    Data flow of the PredictColumn procedure: The PredictColumn procedure splits the input data in the following disjoint data sets:

    Training data set The training data set is used to compute the prediction model.

    Validation data set The quality of the prediction model is based on the records of the validation data set.

    The model quality indicates how well the model might perform on unknown data. Typically, the model quality is better on the training data than on the validation data because the model might be tuned towards the records of the training data set.

    In the extreme case it is as if you learned all records of the training data set by heart. This means that you had an optimal model quality for the training data set because for all records of the training data set the predictions were correct. On the other hand, you would not know what to predict for a record of the validation data set unless it had the same values as a record in the training data set. Therefore, for computing the quality of the model, it is better to use data records that were not used in the training phase.

    Example The input data in Figure 17 on page 33 represents a banking scenario. It shows the following information about banking customers:

    Demographic information Demographic information includes age, gender, marital status, or profession.

    Business specific information Business specific information includes the average balance, the number of years a customer is a client of this bank, savings account, or online access to accounts.

    32 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The BANKCARD column in the DEMOBANK.BANKCUSTOMERS table indicates whether customers have a bank card. Based on this information, the management of the bank might plan a promotion campaign for bank cards.

    The bank management needs to know who might be a good candidate to offer the bank card. Therefore the bank management wants to find out the characteristics of customers who have a bank card, and the characteristics of customers who do not have a bank card. Good candidates for the promotion are customers who do not yet have a bank card, although they possess the characteristics of a bank card holder.

    Input parameters: To determine the characteristics of bank card holders and non-bank card holders, you can use the PredictColumn procedure.

    Use the following command to run the Easy Mining procedure: db2 "call IDMMX.PredictColumn(BANK.BANKCARD_PRED, BANK.BANKCUSTOMERS, BANKCARD)"

    Where:

    BANK.BANKCARD_PRED is the name of the result view to be generated

    BANK.BANKCUSTOMERS is the name of the input table

    BANKCARD is the name of the target column

    Figure 17. Data flow of the PredictColumn procedure based on the example in this section

    Chapter 2. Data mining with the Easy Mining procedures 33

  • Output: Figure 17 on page 33 shows the data flow of the PredictColumn procedure: applying the PredictColumn procedure to the input table BANKCUSTOMERS produces a model and an output view. The model and the output view have the same name, for example, BANK.BANKCARD_PRED. The model is stored in the IDMMX.CassifModels table. The output view BANK.BANKCARD_PRED that is shown in Figure 17 on page 33 contains the columns of the input table and the following additional columns:

    DATA_SET The values in this column show the data set that this record belongs to, for example, the training data set or the validation data set.

    PREDICTION This column contains the predicted values. These are the values that the model expects for the column BANKCARD. The type of this column is the same as the type of the target column BANKCARD.

    CONFIDENCE This column contains the confidence value for the prediction.

    Selection of candidates for bank card promotions: You can use the BANK.BANKCARD_PRED output view to select candidates for a bank card promotion.

    Potential candidates for the bank card promotion are the customers who are identified as not having a bank card yet although they have the characteristics of bank card holders. In the BANK.BANKCARD_PRED output view that is shown in Figure 17 on page 33, these candidates are represented by the records that have a NO value in the BANKCARD column and a YES value in the PREDICTION column.

    Identification of characteristics: With the information in the output view BANK.BANKCARD_PRED, you can select the candidates for the bank card promotion. However, if you want to know the characteristics that distinguish bank card holders from non-bank card holders, you can analyze the model BANK.BANKCARD_PRED that is stored in the table IDMMX.ClassifModels by opening it with IM Visualization or any other visualization tool that supports PMML.

    You can determine the characteristics of bank card holders by analyzing the Tree View of the Classification Visualizer in Figure 18 on page 35.

    34 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The first test ONLINE_ACCESS=YES is performed on the value of the tree node with the node ID 1.1. This tree node is not expanded because it is pruned. The SCORE value of this tree node is YES. This means that there is a good chance that this customer also has a bank card. The PURITY value shows that 83,5% of the customers with an online access to their account have a bank card. The value in the RECORD COUNT column shows that 22% of the customers have online access to their accounts.

    The tree node with the node ID 1.2 is expanded because it is not pruned. The SCORE value of this tree node is NO. Some descendants of this node are labeled with YES in the Score column, for example, the node 1.2.1.1.2.1.

    By analyzing these nodes and their corresponding paths related to the root node, you can determine the other characteristics that identify a bank card holder.

    The quality of the model: The Confusion Matrix View of the Classification Visualizer in Figure 19 on page 36 shows the number of correct and incorrect predictions for the validation data.

    Figure 18. Tree View of the Classification Visualizer

    Chapter 2. Data mining with the Easy Mining procedures 35

  • The target column of the model BANK.BANKCARD_PRED has the following predicted values: v YES v NOThe confusion matrix in Figure 19 shows that 264 of the non-bank card holders are classified as bank card holders. This is the target group that you are interested in.

    How to go further Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99.

    The required quality of a prediction model depends on the application for which you want to use it. Being significantly better than random guessing might be a good result. This applies, for example, if you want to predict cross-selling opportunities.

    In other application areas, a good prediction quality is mandatory, for example, if you want to use a reliable model to predict diseases from symptoms.

    Removing fields from the input table: You might think that the prediction quality always should be as good as possible. In general, this is true. However,

    Figure 19. Confusion Matrix View of the Classification Visualizer

    36 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • sometimes the result might be too good. The reason for a result that is too good can be that there are causal dependencies between the target field and some input fields.

    For example, possessing a bank card leads to an increasing amount of debit transactions. Therefore, a high number of debit transactions seems to be a good indicator for possessing a bank card. However, the causal relationship is just the inverse. The possession of a bank card implies a higher amount of debit transactions. It does not imply that a high number of debit transactions entails having a bank card.

    There are more reasons for having a high amount of debit transactions.

    You should always examine the tests near the root node of the classification tree. You should check whether there are columns that have causal relationships. Whenever you identify columns with causal relationships, you must remove them from the input fields.

    For more information about removing fields from the input table, see Removing fields from the input table on page 8.

    The depth of a classification tree: You can set more parameters of the Tree Classification mining function by using optional parameter strings. For example, if the target field is categorical, you might want to set the maximum tree depth. For more information, see Setting the depth of a classification tree on page 54.

    There are more optional parameter strings. For more information, see Optional parameter strings on page 99.

    Prediction of an outcome (PredictColValue procedure) You can predict values by using the PredictColValue procedure.

    When to do it You might have a database with customer data. In the tables or views of your database, there might be one column that you are particularly interested in. For example, if you want to know whether your customers have answered to a mailing campaign, you might be interested in the RESPONDED column. If you want to know whether customers have canceled their contracts, you might be interested in the CONTRACT column.

    The column that you are particularly interested in is called the target column. The target column CONTRACT_STATUS can have several values, for example, SIGNED, PROLONGED, or CANCELED. Typically, you are not interested in all values of the target column. You might be interested only in one value. For example, to prevent customers from canceling their contracts, you want to know the customers who are likely to cancel a contract. Therefore, you are only interested in the target value CANCELED.

    You might want to know whether there are relationships between the occurrence of the CANCELED value in the target column CONTRACT_STATUS and the values of the other columns such that you can predict from the values of the other columns the likelihood that the target column CONTRACT_STATUS contains the value CANCELED.

    Chapter 2. Data mining with the Easy Mining procedures 37

  • How to do it

    To predict an outcome, use the PredictColValue procedure.

    Syntax:

    IDMMX.PredictColValue(, , , )

    Input parameters:

    With the PredictColValue procedure, you must specify the following parameters:

    The name of the view that you want to build.

    The PredictColValue procedure creates a view and a model. The model is stored in the table IDMMX.PredictColValue under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.

    This parameter is of type VARCHAR. Its size is 240.

    The name of the input table or the input view.

    The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.

    This parameter is of type VARCHAR. Its size is 257.

    The name of the column whose values are to be predicted.

    This parameter is of type VARCHAR. Its size is 128.

    For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154.

    The name of a value in the target column, for example, YES or NO.

    This parameter is of type VARCHAR. Its size is 1024.

    Output: The PredictColValue procedure creates a view. This view contains the columns of the input table and the following additional columns:

    DATA_SET This column indicates whether the data is used for testing or for validation.

    TARGET_VALUE This column contains the target value of the target column, for example,

    SIGNED, PROLONGED, or CANCELED.

    CONFIDENCE This column contains the estimated confidence that is calculated by the Classification mining function that the target column contains the target value. It indicates the reliability that the prediction is correct.

    38 IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

  • The values in this column can range from 0 to 1. v A value close to 0 indicates a low certainty that the prediction is correct. v A value close to 1 indicates that the prediction is reliable.With the confidence value, you can determine whether an individual prediction is sufficiently reliable for your application.

    PRED_VALUE This column contains the estimated confidence that is calculated by the Regression mining function that the target column contains the target value. It has the same validity range and meaning as the column CONFIDENCE.

    It might be useful to consider the results of different prediction techniques. For example, for a mailing campaign it might be more appropriate to use a target audience selected by the Classification mining function instead of a target audience selected by the Regression mining function. It depends on the size of the target audience. If you use a different size of target audience, it might be more appropriate to select the target audience by the Regression mining function. The values in the CLASSIFICATION_CONFIDENCE column and the PREDICTED_VALUE column help you to make this decision.

    For more information, seeDetermining the best suited model on page 90.

    Data flow of the PredictColValue procedure: The PredictColValue procedure does not use the original target column for the mining runs. To build a classification model, it creates a new categorical target column that includes the following values: v v !=

    To build a regression model, the PredictColValue procedure creates a numeric column. This column can contain the value 1 or 0. v It contains 1, if the value of the original target column is equal to the target value.

    v It contains 0, if the value of the original target column is different to the target value.

    Prediction models are computed on a subset of the input data. The PredictColValue procedure splits the input data into the following disjoint data sets:

    Training data set The training data set is used to compute the prediction model.

    Validation data set The quality of the prediction model is based on the records of the validation data set.

    The model quality indicates how well the model might perform on unknown data. You can use the validation results to select the model that is best suited according to the requirements of your application. For more information, see Splitting tables into training data sets and test data sets on page 83.

    Figure 20 on page 40 shows the data flow of the PredictColValue procedure.

    Chapter 2. Data mining with the Easy Mining procedures 39

  • Example To make the difference of