Improving Association Rules

150
Final Year |Project 2 Development of Data Mining Algorithms for Analysing Shopping Centre Dataset by Name: Khoo See Jun ID: SN089817

Transcript of Improving Association Rules

Page 1: Improving Association Rules

Final Year |Project 2

Development of Data Mining Algorithms for Analysing

Shopping Centre Dataset

by

Name: Khoo See Jun

ID: SN089817

Project Supervisor: Alicia Tang Yee Chong, Dr.

Page 2: Improving Association Rules

ii

DECLARATION

I hereby declare that this report, submitted to University Tenaga Nasional as a

partial fulfilment of the requirements for the Bachelor of Computer Science (System

and Networking) has not been submitted as an exercise for a degree at any other

university. I also certify that the work described here is entirely my own except for

excerpts and summaries whose sources are appropriately cited in the references.

This report may be made available within the university library and may be

photocopied or loaned to other libraries for the purposes of consultation.

23 February 2015

KHOO SEE JUN

SN089817

Page 3: Improving Association Rules

iii

APPROVAL SHEET

This thesis entitled:

“Development of Data Mining algorithms for analysing public domain databases”

Submitted by:

KHOO SEE JUN (SN089817)

In requirement for the degree of Bachelor of Computer Science (System and

Networking), College of Information Technology, University TenagaNasional has

been accepted.

Supervisor: Alicia Tang Yee Chong, Dr.

Signature: …………………………….

Date:

Page 4: Improving Association Rules

iv

ABSTRACT

The objectives of this project are to develop Data Mining algorithms for analysing

public domain databases. The public domain for this project is shopping centre. My

task for this project is to identify and perform an association rule mining task which

involves selecting an appropriate data set, preparing and preprocessing the data,

finding rules, including appropriate parameter setting, determining which of the

resulting rules are interesting and figuring out how the interesting rules could be

useful.

This work analyses well-known DM techniques in Weka workbench, and report the

simulation results using sample data by applying four selected DM techniques and

classifiers in the open source workbench to the Customer Relationship Management

(CRM) in shopping centre.

The design of the data mining process has been done in Chapter 4. This will show

how the data mining workflow.

The next section is implementation. The implementation begun with data mining

process (methodology).After that proceed with build models set for prediction. Then

compile modified code using Apache-Ant so that the code can be used by WEKA.

Lastly, generate best rules by import dataset into the WEKA software and compare the

run time between original code and modified code in different WEKA. The result

show that the run time have been improved with modified code.

Page 5: Improving Association Rules

v

TABLE OF CONTENTS

Page

DECLARATION ii

APPROVAL SHEET iii

ABSTRACT iv

TABLE OF CONTENTS v

LIST OF TABLES ix

LIST OF FIGURES x

CHAPTER 1 INTRODUCTION

1.0 INTRODUCTION

1.1 Project Introduction 1

1.2 Problem Statement 2

1.3 Objectives 2

1.4 Benefits 3

1.5 Project Scope 4

1.6 Expected Outcome 5

1.7 Gantt Chart 5

CHAPTER 2: RESEARCH AND LITERATURE REVIEW

Page 6: Improving Association Rules

vi

2.0 LITERATURE REVIEW

2.1 Data Mining Techniques 6

2.1.1 Association 6

2.1.2 Classification 6

2.1.3 Prediction 7

2.1.4 Sequential Patterns (Long-term data) 7

2.1.5 Clustering 8

2.1.6 Decisions Trees (J48) 9

2.1.7Decision Table 10

2.2 Customer Relationship Management (CRM) 10

2.2.1 What is Customer Relationship Management (CRM)? 10

2.2.2 How CRM is Used Today 10

2.2.3 The CRM Strategy 11

2.2.4 The Impact of Technology on CRM 11

2.2.5 The Benefits of CRM 12

2.2.6 Data Mining and Customer Relationship Management 12

2.2.7 Review of Data Mining Tools in CRM 13

2.2.8 Data Mining Tools Applications in CRM 15

2.3 Data Mining Applications 17

2.3.1 Banking/Finance (Financial Data Analysis) 17

2.3.2 Retail/Marketing Industry 18

2.3.3 Telecommunication Industry 19

2.3.4 Biological Data Analysis 20

2.3.5 Medical/Pharma 20

2.3.6 Insurance and Health Car 21

Page 7: Improving Association Rules

vii

2.3.7 Other Scientific Applications 21

2.3.8 Intrusion Detection 22

2.4 Data Mining Systems 23

2.4.1 Data Mining System Classification 23

2.4.2 Data Mining System Products 24

2.4.3 Choosing Data Mining System 25

2.4.4 Trends in Data Mining 27

2.5 Data Mining Process Model 28

2.5.1 Overview of Data Mining Life Cycle 28

CHAPTER 3: ANALYSIS

3.0 ANALYSIS

3.1 Data Mining for Shopping Centers 31

3.1.1 Free Sample Data for Testing Purpose 32

3.1.2 Related Work 33

3.1.3 Methods 35

3.1.4 Result and Discussion 36

3.1.5 Comparison between Naïve Bayes (NB), Decision Table (DT)

and Decision Tree (J48) 42

3.1.6 Comparison between classifiers with time taken to build a model 44

3.2Association Rules Apriori Algorithm 44

3.2.1 Apriori Algorithm 44

3.2.2 Limitations of Apriori Algorithm 44

Page 8: Improving Association Rules

viii

CHAPTER 4: DESIGN

4.0 DESIGN

4.1 Data Mining Process 46

4.1.1 Step One: Translate the business into a data mining problem 48

4.1.2 Step Two: Select appropriate data 48

4.1.3 Step Three: Analyze the data 48

4.1.4 Step Four: Create a Model Set for Prediction 50

4.1.5 Step Five: Fix Problem with the Data 51

4.1.6 Step Six: Transform Data to Bring Information to the Surface 52

4.1.7 Step Seven: Build Models 53

4.1.8 Step Eight: Deploy Models 54

CHAPTER 5: IMPLEMENTATION

5.0 IMPLEMENTATION

5.1 Data Mining Process 56

5.1.1 Translate the business into a data mining problem 57

5.1.2 Select appropriate data 57

5.1.3 Analyze the data 59

5.1.4 Create a Model Set for Prediction 61

5.1.5Fix Problem with the Data 62

5.1.6 Transform Data to Bring Information to the Surface 63

5.1.7 Build Models 64

5.1.8 Deploy Models 66

5.2 Apriori Algorithm Source Code 67

Page 9: Improving Association Rules

viii

5.3 Import dataset into WEKA 69

CHAPTER 6: CONCLUSION

6.0 CONCLUSION

6.1 Progress and Outcome 70

6.2 Problems Encountered 71

6.3 Future Planning 71

REFERENCES 72

APPENDIX 77

Page 10: Improving Association Rules

ix

LIST OF TABLES

Table No. Page

TABLE 3.1 42

TABLE 3.2 44

Page 11: Improving Association Rules

xii

LIST OF FIGURES

Figure No. Page

Figure 2.1 Clustering (Sample Diagram) 8

Figure 2.2 Decision Tree (J48) 9

Figure 2.3 Data Mining Applications Useful For Companies 15

Figure 2.4 Data Mining System Classification 24

Figure 2.5 Data Mining Process Model 28

Figure 3.1 Sample Data (CSV format) 32

Figure 3.2 Sample Data (Notepad format) 33

Figure 3.3 Block Diagram 34

Figure 3.4 Results returned by the Naïve Bayes classifier. 37

Figure 3.5Thedecision table of data analysis 38

Figure 3.6 J48 pruned tree of “sex” analysis 39

Figure 3.7 “Associate Rules” 40

Figure 3.8 Sample Data (CSV format) 41

Figure 3.9 Sample Data (Notepad format) 41

Figure 3.10 “Associate Rules” 42

Figure 4.1 Data Mining is not a linear process 47

Page 12: Improving Association Rules

xi

Figure 4.2 Sample data in ARFF Viewer 49

Figure 4.3 Data Visualize 49

Figure 4.4 Visualization of data by age and sex 50

Figure 4.5 Data from the past mimics data from the past, present, and future 51

Figure 4.6 Sample data 51

Figure 4.7 Data Mining Model 53

Figure 4.8 Data Mining Process Model 54

Figure 4.9 Data Mining Scoring Process Model 55

Figure 4.10 “Scoring” Prediction 55

Figure 5.1 Appropriate data 57

Figure 5.2 Analyse data in ARFF Viewer 59

Figure 5.3 Data Visualize 59

Figure 5.4 Visualization of data by smoker and drink_level 60

Figure 5.5 Prediction Model 61

Figure 5.6 Fixed dataset 62

Figure 5.7 Transformed data 63

Figure 5.8 Data Mining Model 64

Figure 5.9 Model 1 65

Figure 5.10 Model 2 65

Page 13: Improving Association Rules

xii

Figure 5.11 Model 3 66

Figure 5.12 Original Code Part 1 67

Figure 5.13 Modified Code Part 1 67

Figure 5.14 Original Code Part 2 68

Figure 5.15 Modified Code Part 2 68

Figure 5.16 Result of original code 69

Figure 5.17 Result of modified code 69

Page 14: Improving Association Rules

1

CHAPTER 1

1.0INTRODUCTION

1.1 Project Introduction

Far too many companies sit on loads of good customer data and do nothing with it. In

meanwhile they don’t know that data is a gold mine of insight that can increase

customer loyalty, unlock hidden profitability and reduce client churn. By applying

data mining (knowledge discovery), theprocess used by companies to turn raw data

into useful information. By using computer-assisted software and go through the

process of digging and analyzing enormous sets of dataandextract the hidden

predictive information from large databases. It is a powerful new technology with

great potential to help companies focus on the most important information in their

data warehouses. In businesses can learn more about their customers and develop

more effective marketing strategies as well as increase sales and decrease costs.

Grocery stores are well-known users of data mining techniques. Many supermarkets

offer free loyalty cards to customers that give them access to reduced prices not

available to non-members. The cards make it easy for stores to track who is buying

what, when they are buying it, and at what price. The stores can then use this data,

after analyzing it, for multiple purposes, such as offering customers coupons that are

targeted to their buying habits and deciding when to put items on sale and when to sell

them at full price. Data mining tools predict behaviours and future trends, allowing

businesses to make proactive, knowledge-driven decisions. Data mining tools can

answer business questions that traditionally were too time consuming to resolve. They

Page 15: Improving Association Rules

2

scour databases for hidden patterns, finding predictive information that experts may

miss because it lies outside their expectations.

1.2 Problem Statement

Most of the companies have wasted a tons of useful customer data by doing nothing

on it. They do not know what exactly their customer need, what they are missing. By

engaging in data mining, we can gain greater insight into external conditions, internal

processes, company’s market and their customers. We also gain predictive capabilities

that can be used both in strategic planning and in daily interactions. These insights and

predictive capabilities are taking a company’s business results to the next level by

improving the company’s marketing campaign management, up-sell and cross-sell

activities, or customer retention, risk analysis, or fraud detection efforts. This project

aim at a company to create powerful strategies, make fast and feasible decisions and

achieve competitive advantage in future.

1.3 Objectives

The objectives of this project are:

1. To identify parameters of the algorithm

2. To design new data mining algorithms

3. To develop new data mining algorithms

Page 16: Improving Association Rules

3

1.4 Benefits

The benefits of data mining in businesses are:

1. More Money. Money is always a good thing in business. When data is mined

that unearths the kinds of projects past donors contributed to, types of products

customers have purchased in the past, or a not-for-profit can put a number on

statistics for a grant proposal, it can result in serious cash. Once a business

knows who the top donors are or what their customers want, they can

customize approaches and outreach.

2. Improve Branding and Marketing. Data can reveal a number of things like

what direction the marketing department should take. For example, there might

have been a recent customer survey asking about what services or products

consumers want to see. That kind of information is gold, and a marketing

department can do wonders with it. If a survey or any feedback is being

collected, put it to use.

3. Streamline Outreach. Whether a business depends on e-mail blasts, print ads

or social media, knowing how customers want to be approached is important.

Data that includes relevant e-mail addresses, mailing addresses or social media

pages can help streamline any mailers or outreach. It also saves money,

whether it's in postage or time, by keeping consumer information updated.

4. Tap into New Markets.There are some databases available that businesses

can purchase, or the databases might be available to the public free of charge.

Business owners can use the databases of others to find out more information

about potential consumers and identify any holes in the current tactics.

However, when handling outside databases, it's especially important to

Page 17: Improving Association Rules

4

practice caution. Privacy is a big legal issue, and sometimes it's easy to overstep

boundaries.

5. Share and Share Alike. Sharing information is largely illegal, but it all

depends on what the customer has signed. For example, some coalitions may

share information on consumers in order to provide better services. This can be

dangerous grounds, but if it's legally acceptable, some business owners can

access the data of other partner organizations, too. This largely expands the

availability of information and can provide more data--and likely in turn more

accurate data--to improve the bottom line, services and research.

6. Learn from the Past. Data mining past information and comparing it to the

current situation can reveal a lot. Graphs can easily show any troubling sales

years, spikes or other trends that should be taken into consideration. Seeing the

ebb and flow of a business via data can provide insight that otherwise might be

overlooked. For example, a business that knows there's a history of high sales

in July can work on maximizing that month, while giving extra attention to

periods where sales slack.

1.5 Project Scope

Given databases of sufficient size and quality, data mining technology can generate

new business opportunities by providing these capabilities:

1. Select an appropriate data set

2. Preparing and pre-processing the data

3. Finding rules and identify parameter for the algorithm

Page 18: Improving Association Rules

5

1.6 Expected Outcome

The outcome of the project are:

1. A critical review data mining techniques

2. Dataset from a company

3. Design a new data mining algorithm

1.7 Gantt Chart

Page 19: Improving Association Rules

6

Chapter 2

2.0 LITERATURE REVIEW

2.1 Data Mining Techniques [6], [10], [11]

2.1.1 Association

Association (or relation) is probably the better known and most familiar and

straightforward data mining technique. A simple correlation between two or more

items, often of the same type to identify patterns. For example, when tracking people's

buying habits, we might identify that a customer always buys cream when they buy

strawberries, and therefore suggest that the next time that they buy strawberries they

might also want to buy cream.

2.1.2 Classification

We can use classification to build up an idea of the type of customer, item, or object

by describing multiple attributes to identify a particular class. For example, we can

easily classify cars into different types (sedan, 4x4, convertible) by identifying

different attributes (number of seats, car shape, driven wheels). Given a new car, we

might apply it into a particular class by comparing the attributes with our known

definition. We can apply the same principles to customers, for example by classifying

them by age and social group.

Page 20: Improving Association Rules

7

2.1.4 Prediction

Prediction is a wide topic and runs from predicting the failure of components or

machinery, to identifying fraud and even the prediction of company profits. Used in

combination with the other data mining techniques, prediction involves analyzing

trends, classification, pattern matching, and relation. By analyzing past events or

instances, we can make a prediction about an event. For example, using the credit card

authorization, we combine decision tree and classification to analysis an individual

past transaction to identify whether a transaction is fraudulent by matching the

historical pattern of the individual.

2.1.5 Sequential patterns (Long-term data)

Sequential patterns are a useful method for identifying trends, or regular occurrences

of similar events. For example, with customer data we can identify that customers buy

a particular collection of products together at different times of the year. In aonline

shopping website, we can use this information to automatically suggest that certain

items be added to their shopping cart based on their frequency and past purchasing

history.

Page 21: Improving Association Rules

8

2.1.6 Clustering

By examining one or more attributes or classes, we can group individual pieces of

data together to form a structure opinion. Clustering is using one or more attributes as

basis for identifying a cluster of correlating results. Clustering is useful to identify

different information because it correlates with other examples so we can see where

the similarities and ranges agree.

Clustering can work both ways. We can assume that there is a cluster at a certain point

and then use our identification criteria to see if we are correct. The graph in  Figure

2.1 shows a good example. In this example, a sample of sales data compares the age

of the customer to the size of the sale. It is not unreasonable to expect that people in

their twenties (before marriage and kids), fifties, and sixties (when the children have

left home), have more disposable income.

Figure 2.1 Clustering (Sample Diagram).

(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)

Page 22: Improving Association Rules

9

2.1.7 Decision trees(J48)

Related to most of the other techniques (primarily classification and prediction), the

decision tree can be used either as a part of the selection criteria, or to support the use

and selection of specific data within the overall structure. Within the decision tree, we

start with a simple question that has two (or sometimes more) answers. Each answer

leads to a further question to help classify or identify the data so that it can be

categorized, or so that a prediction can be made based on each answer.

Figure 2.2 shows an example where you can classify an incoming error condition.

Figure 2.2 Decision Tree (J48).

(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)

Decision trees are often used with classification systems to attribute type information,

and with predictive systems, where different predictions might be based on past

historical experience that helps drive the structure of the decision tree and the output.

Page 23: Improving Association Rules

10

2.1.8 Decision Table

Decision tables, like decision trees, are classification models used for prediction. They

are induced by machine learning algorithms. A decision table consists of a

hierarchical table in which each entry in a higher level table gets broken down by the

values of a pair of additional attributes to form another table. The structure is similar

to dimensional stacking.

2.2 Customer Relationship Management (CRM) [15], [16], [17], [18]

2.2.1 What is Customer Relationship Management (CRM)?

CRM (customer relationship management) is an information industry term for

methodologies, software, and usually Internet capabilities that help a

company manage customer relationships in an organized way. For example, a

company might build a database about its customers that described relationships in

sufficient detail so that management, salespeople, people providing service, and

perhaps the customer directly could access information, match customer needs with

product plans and offerings, remind customers of service requirements, and know

what other products a customer had purchased, and so on.

2.2.2 How CRM is Used Today

CRM solutions provide a company with the customer business data to provide

services or products that customers want, provide better customer service, cross-sell

Page 24: Improving Association Rules

11

and up-sell more effectively, close deals, retain current customers and better

understand in the customer.

2.2.3 The CRM Strategy

Customer relationship management is often thought of as a business strategy that

enables businesses to improve in a number of areas. The CRM strategy allows a

company to following:

1) Understand the customer

2) Retain customers through better customer experience

3) Attract new customers

4) Win new clients and contracts

5) Increase profitably

6) Decrease customer management costs

2.2.4 The Impact of Technology on CRM

Technology and the Internet have changed the way companies approach customer

relationship strategies. Advances in technology have changed consumer buying

behaviour, and today there are many ways for companies to communicate with

customers and to collect data about them. With each new advance in technology

especially the proliferation of self-service channels like the Web and smartphones

customer relationships are being managed electronically.

Many aspects of customer relationship management rely heavily on technology;

however, the strategies and processes of a good CRM system will collect, manage and

Page 25: Improving Association Rules

12

link information about the customer with the goal of letting you market and sell

services effectively.

2.2.5 The Benefits of CRM

The biggest benefit most businesses realize when moving to a CRM system comes

directly from having all the business data stored and accessed from a single location.

Before CRM systems, customer data was spread out over office productivity suite

documents, email systems, mobile phone data and even paper note cards and Rolodex

entries. Storing all the data from all departments (e.g., sales, marketing, customer

service and HR) in a central location gives management and employees immediate

access to the most recent data when they need it. Departments can collaborate with

ease, and CRM systems help organization to develop efficient automated processes to

improve business processes.

2.2.6Data Mining and Customer Relationship Management [17]

Customer relationship management (CRM) is a process that manages the interactions

between a company and its customers. The primary users of CRM software

applications are database marketers who are looking to automate the process of

interacting with customers. 

To be successful, database marketers must first identify market segments containing

customers or prospects with high-profit potential. They then build and execute

campaigns that favourably impact the behaviour of these individuals. 

The first task, identifying market segments, requires significant data about prospective

Page 26: Improving Association Rules

13

customers and their buying behaviours. In theory, the more data the better. In practice,

however, massive data stores often impede marketers, who struggle to sift through the

minutiae to find the nuggets of valuable information. 

 Recently, marketers have added a new class of software to their targeting arsenal.

Data mining applications automate the process of searching the mountains of data to

find patterns that are good predictors of purchasing behaviours. 

 

After mining the data, marketers must feed the results into campaign management

software that, as the name implies, manages the campaign directed at the defined

market segments. 

 

In the past, the link between data mining and campaign management software was

mostly manual. In the worst cases, it involved "sneaker net," creating a physical file

on tape or disk, which someone then carried to another computer and loaded into the

marketing database. 

 

This separation of the data mining and campaign management software introduces

considerable inefficiency and opens the door for human errors. Tightly integrating the

two disciplines presents an opportunity for companies to gain competitive advantage. 

2.2.7Review of Data Mining Tools in CRM [18]

Data mining uses a combination of an explicit knowledge base, sophisticated

analytical skills, and domain knowledge to uncover hidden trends and patterns. These

trends and patterns form the basis of predictive models that enable analysts to produce

Page 27: Improving Association Rules

14

new observations from existing data. There are number of data mining tools available

in the market spaces that can provide the cutting edge for the firms to achieve

profitable CRM.

Data mining tools helps CRM by providing the complete framework, which covers:

To analyze the business problem.

To prepare the data requirements.

To build the suitable model with respect to business problem.

To validate and evaluate the designed model.

Model building is the next phase of the Data mining tool, which builds the various

models according to the data given in the data preparation phase. The last phase is the

evaluation of the model, so that the proper results in the form of useful patterns can be

drawn from the models built by the tools.

The tools of data mining for CRM should be able to detect the necessary information

from the available data .To achieve this, Data mining tools should have some

characteristic like:

User friendly environment

Efficiency of the tool

Basic task should be accomplished

Low cost of implementation

Page 28: Improving Association Rules

15

2.2.8 Data Mining Tools Applications in CRM [18]

Virtually any process from pharmacology to customer service can be studied,

understood, and improved using data mining. The top three end uses of data mining

are, not surprisingly, in the marketing area. 

Figure 2.3 Data Mining Applications Useful For Companies.

(Adopted from http://www.informationweek.com/673/73iudat.htm)

Figure 2.3 shows that the Customer demographics are one of the most important

applications for the companies. The application of Data Mining tools are in:

Page 29: Improving Association Rules

7

Customer Profiling: In customer profiling, characteristics of good customers

are identified with the goals of predicting; who will become one and helping

marketers target new prospects. Data mining can find patterns in a customer

database that can be applied to a prospective database so that customer

acquisition can be appropriately targeted. For example, by identifying good

candidates for mail offers or catalogues direct-mail marketers can reduce

expenses and increase their sale

Page 30: Improving Association Rules

16

Targeted Marketing: Targeting specific promotions to existing and potential

customers offer similar benefits

Market-basket analysis: Market-basket analysis helps retailers understand

which products are purchased together or by an individual over time. With

data mining, retailers can determine which products to stock in which stores,

and even how to place them within a store. Data mining can also help assess

the effectiveness of promotions and coupons.

Manage customer relationship: Another common use of data mining in many

organizations is to help manage customer relationships. By determining

characteristics of customers who are likely to leave for a competitor, a

company can take action to retain that customer because doing so is usually far

less expensive than acquiring a new customer.

Fraud detection: Fraud detection is of great interest to telecommunications

firms, credit-card companies, insurance companies, stock exchanges, and

government agencies identify and track individual terrorists themselves, such

as through travel and immigration records.

Anticipate and prevent customer attrition: The data mining tool can help to

find the customers which are not satisfied by the firm’s services. This helps the

firms to give promotional services to group of customers who are likely to

attrite.

Mine unstructured data, such as text: The text data is always unstructured.

So data mining tools can help to mine the unstructured data to help the various

organizations to get good out of the data.

Page 31: Improving Association Rules

17

2.3 Data Mining Applications

Data mining is a data analysis approach that has been quickly adapted and used in a

large number of domains that were already using statistics. Here is the list of areas

where data mining is widely used:

Banking/Finance

Retail Industry

Telecommunication Industry

Biological Data Analysis

Medical/Pharma

Insurance and Health Care

Other Scientific Applications

Intrusion Detection

2.3.1 BANKING/FINANCE (FINANCIAL DATA ANALYSIS)

The financial data in banking and financial industry is generally reliable and of high

quality which facilitates the systematic data analysis and data mining. Here are the

few typical cases:

Design and construction of data warehouses for multidimensional data analysis

and data mining.

Page 32: Improving Association Rules

18

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Detection of fraudulent credit card usage patterns.

Risk management related to attribution of loans using scorecards.

Find hidden correlations between different financial indicators.

Identification of stocks trading rules from historical market data.

2.3.2 RETAIL/MARKETING INDUSTRY

Data Mining has its great application in Retail Industry because it collects large

amount data from on sales, customer purchasing history, goods transportation,

consumption and services. It is natural that the quantity of data collected will continue

to expand rapidly because of increasing ease, availability and popularity of web.

The Data Mining in Retail Industry helps in identifying customer buying patterns and

trends. That leads to improved quality of customer service and good customer

retention and satisfaction. Here is the list of examples of data mining in retail industry:

Design and Construction of data warehouses based on benefits of data mining.

Multidimensional analysis of sales, customers, products, time and region.

Analysis of effectiveness of sales campaigns.

Page 33: Improving Association Rules

18

Customer Retention.

Product recommendation and cross-referencing of items.

Discovery of buying behaviour patterns

Detection of associations among customer characteristics.

Prediction of the probability that clients answer to mailing.

Page 34: Improving Association Rules

19

2.3.3 TELECOMMUNICATION INDUSTRY

Today the Telecommunication industry is one of the most emerging industries

providing various services such as fax, pager, cellular phone, Internet messenger,

images, e-mail, web data transmission etc. Due to the development of new computer

and communication technologies, the telecommunication industry is rapidly

expanding. This is the reason why data mining is become very important to help and

understand the business. Data Mining in Telecommunication industry helps in

identifying the telecommunication patterns, catch fraudulent activities, make better

use of resource, and improve quality of service. Here is the list examples for which

data mining improve telecommunication services:

Multidimensional Analysis of Telecommunication data.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization tools in telecommunication data analysis.

Page 35: Improving Association Rules

20

2.3.4 BIOLOGICAL DATA ANALYSIS

Now a days we see that there is vast growth in field of biology such as genomics,

proteomics, functional Genomics and biomedical research. Biological data mining is

very important part of Bioinformatics. Following are the aspects in which data mining

contribute for biological data analysis:

Semantic integration of heterogeneous, distributed genomic and proteomic

databases.

Alignment, indexing, similarity search and comparative analysis multiple

nucleotide sequences.

Discovery of structural patterns and analysis of genetic networks and protein

pathways.

Association and path analysis.

Visualization tools in genetic data analysis.

2.3.5 MEDICAL/PHARMA

Data mining is a very important part in medical field. By getting through data mining,

research for new cure for rare diseases rate will be higher. Below are the aspects in

which data mining contribute for medical field:

Page 36: Improving Association Rules

21

Computer Assisted Diagnosis (expert systems learning)

Characterization/prediction of patient's response to product dosage

Identification of successful medical therapies (successful prescription

patterns).

Study of relations between dosage and potentially related adverse events

2.3.6 INSURANCE AND HEALTH CARE

Following is how the insurance companies manage their businesses and customer with

the help of data mining:

Discovery of medical procedures that are claimed together through claims

analysis

Identification of customers that are potential buyers for new policies.

Detection of behaviour patterns capable of identifying risky customers.

Detection of fraudulent behaviour.

2.3.7 OTHER SCIENTIFIC APPLICATIONS

The applications discussed above tend to handle relatively small and homogeneous

data sets for which the statistical techniques are appropriate. Huge amount of data

have been collected from scientific domains such as geosciences, astronomy etc.

Page 37: Improving Association Rules

22

There is large amount of data sets being generated because of the fast numerical

simulations in various fields such as climate, and ecosystem modelling, chemical

engineering, fluid dynamics etc. Following are the applications of data mining in field

of Scientific Applications:

Data Warehouses and data pre-processing.

Graph-based mining.

Visualization and domain specific knowledge.

2.3.8 INTRUSION DETECTION

Intrusion refers to any kind of action that threatens integrity, confidentiality, or

availability of network resources. In this world of connectivity security has become

the major issue. With increased usage of internet and availability of tools and tricks

for intruding and attacking network prompted intrusion detection to become a critical

component of network administration. Here is the list of areas in which data mining

technology may be applied for intrusion detection:

Development of data mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build

discriminating attributes.

Analysis of Stream data.

Distributed data mining.

Page 38: Improving Association Rules

23

Visualization and query tools.

2.4 Data Mining Systems [13]

There is a large variety of Data Mining Systems available. Data mining System may

integrate techniques from the following:

Spatial Data Analysis

Information Retrieval

Pattern Recognition

Image Analysis

Signal Processing

Computer Graphics

Web Technology

Business

Bioinformatics

2.4.1 Data Mining System Classification [12]

The data mining system can be classified according to the following criteria:

Database Technology

Page 39: Improving Association Rules

24

Statistics

Machine Learning

Information Science

Visualization

Other Disciplines

Figure 2.4 Data Mining System Classification.

(Adopted from http://www.tutorialspoint.com/data_mining/dm_systems.htm)

2.4.2 Data Mining System Products [13]

There are many data mining system products and domain specific data mining

applications are available. The new data mining systems and applications are being

Page 40: Improving Association Rules

25

added to the previous systems. Also the efforts are being made towards

standardization of data mining languages.

2.4.3 Choosing Data Mining System

Which data mining system to choose will depend on following features of Data

Mining System:

Data Types - The data mining system may handle formatted text, record-based data

and relational data. The data could also be in ASCII text, relational database data or

data warehouse data. Therefore we should check what exact format, the data mining

system can handle.

System Issues - We must consider the compatibility of Data Mining system with

different operating systems. One data mining system may run on only on one

operating system or on several. There are also data mining systems that provide web-

based user interfaces and allow XML data as input.

Data Sources - Data Sources refers to the data formats in which data mining system

will operate. Some data mining system may work only on ASCII text files while other

on multiple relational sources. Data mining system should also support ODBC

connections or OLE DB for ODBC connections.

Data Mining functions and methodologies - There are some data mining systems

that provide only one data mining function such as classification while some provides

multiple data mining functions such as concept description, discovery-driven OLAP

analysis, association mining, linkage analysis, statistical analysis, classification,

prediction, clustering, outlier analysis, similarity search etc.

Page 41: Improving Association Rules

20

Coupling data mining with databases or data warehouse systems - Data mining

system need to be coupled with database or the data warehouse systems. The coupled

Page 42: Improving Association Rules

26

components are integrated into a uniform information processing environment. Here

are the types of coupling listed below:

o No coupling

o Loose Coupling

o Semi tight Coupling

o Tight Coupling

Scalability - There are two scalability issues in Data Mining as follows:

o Row (Database size) Scalability - Data mining System is considered as row scalable

when the number or rows are enlarged 10 times, It takes no more than the 10 times to

execute the query.

o Column (Dimension) Scalability - Data mining system is considered as column

scalable if the mining query execution time increases linearly with number of

columns.

Visualization Tools - Visualization in Data mining can be categorized as follows:

o Data Visualization

o Mining Results Visualization

o Mining process visualization

o Visual data mining

Data Mining query language and graphical user interface - The graphical user

interface which is easy to use and is required to promote user guided, interactive data

mining. Unlike relational database systems data mining systems do not share

underlying data mining query language.

Page 43: Improving Association Rules

27

2.4.4 Trends in Data Mining [25]

Here is the list of trends in data mining that reflects pursuit of the challenges such as

construction of integrated and interactive data mining environments, design of data

mining languages:

Application Exploration

Scalable and Interactive data mining methods

Integration of data mining with database systems, data warehouse systems and web

database systems.

Standardization of data mining query language

Visual Data Mining

New methods for mining complex types of data

Biological data mining

Data mining and software engineering

Web mining

Distributed Data mining

Real time data mining

Multi Database data mining

Privacy protection and Information Security in data mining

Page 44: Improving Association Rules

28

2.5 Data Mining Process Model [23]

CRISP-DM(Cross Industry Standard Process for Data Mining) stands for cross-

industry process for data mining. The CRISP-DM methodology provides a structured

approach to planning a data mining project. It is a robust and well-proven

methodology.

2.5.1 Overview of Data Mining Life Cycle

Figure 2.5 Data Mining Process Model.

(Adopted from http://www.rithme.eu/?m=resources&p=dmmethod&lang=en)

Starting from the knowledge discovery processes used in early data mining projects,

CRISP-DM defined and validated a data mining process that could be applicable in

any industry sectors. This methodology should make large data mining projects faster,

Page 45: Improving Association Rules

29

cheaper, more reliable and more manageable. However, even small scale data mining

investigations can benefit from using it.

This process model provides a simple overview of the life cycle of a data mining

project. Corresponding phases of a data mining project are clearly identified

throughout tasks and relationships between these tasks. Even if the model doesn't

indicate it, there possibly exists relationships between all data mining tasks mainly

depending on analysis goals and on the data to be analysed.

Six main phases can be distinguished in this process model:

Business understanding - concerns the definition of the data mining problem based

on the business objectives.

Data understanding - this phase aims at getting a precise idea about data available,

identifying possible data quality issues, etc.

Data preparation - covers all activities meant to build the dataset to analyse from the

initial raw data. This includes cleaning, feature selection, sampling, etc.

Modeling - is the phase where several data mining techniques are parameter and

tested with the objective of optimizing the obtained data model or knowledge.

Evaluation - aims at verifying that the obtained model properly answers the initially

formulated business objectives and contributes to deciding whether the model will be

deployed or, on the contrary, will be rebuilt.

Deployment - is the final step of the cyclic data miningprocess model. Its target is to

take the obtained knowledge, put it in a convenient form and integrate it in the

business decision process. It can go, upon the objectives, from generating a report

Page 46: Improving Association Rules

30

describing the obtained knowledge to creating an specific application that will use the

obtained model to predict unknown values of a desired parameter.

Page 47: Improving Association Rules

31

Chapter 3

3.0 ANALYSIS

3.1 Data Mining for Shopping Centres

With the majority of large retailers offering a loyalty card scheme, the collection of

customer data is now routine commercial practice.  Whilst loyalty schemes were

originally introduced to reward loyal customers and to encourage them to increase

their overall spend, retailers have been finding more and more sophisticated ways to

use customer data to their advantage.

Due to high competition in the business field, it is essential to consider the customer

relationship management of the shopping centre. Here analyse the massive volume of

customer data and classify them based on the customer behaviours and prediction.

Customer relationship management is mainly used in sales forecasting and banking

areas. Data mining provides the technology to analyse mass volume of data and detect

hidden patterns in data to convert raw data into valuable information.

This work analyses DM techniques in Weka workbench, and reports the simulation

results of applying four DM techniques and classifiers in the open source workbench

to the Customer Relationship Management (CRM) for a shopping centre.

Page 48: Improving Association Rules

32

We are here to propose that data mining techniques to be used in aiding the

salesperson and management of the shopping centre for effective decision making.

This approach was applied to 100 pre-processed records. Simulation results show that

the large volume of customer historical data can play a value added role for

shopping centre development in a way that the mined data helps them to study

customer behaviour so that personalized services can be provided.

Our aim is to demonstrate the possibilities and draw attention to the possible

implications of improving customer satisfaction. The objectives of this work could

include increasing rental incomes and bringing new life back into shopping centre.

3.1.1 Free Sample Data for Testing Purpose

Figure 3.1 Sample Data (CSV format).

Page 49: Improving Association Rules

33

Above is the sample data for testing purpose. This testing consist of 100 pre-processed

customer records. Included fields are:

Sex

Age

Channel

Transportation

All files are provided as CSV (comma-delimited).

Sex are gender, age are random. Channel is the way that the customer make payment,

with credit card or cash. Transportation is how the customer travel to their destination.

Figure 3.2 Sample Data (Notepad format).

3.1.2 Related Work

Before performing data mining need to perform the processes like data preparation

and data cleaning. Incomplete data were found in some of the records therefore data

preparation is needed. This means some records are lack of attribute values. Noisy

data contains errors and inconsistent data contains discrepancies in codes or names. In

data preparation need to select only the wanted fields from each table in order to

Page 50: Improving Association Rules

34

perform the data mining. Data reprocessing techniques like data cleaning and data

reduction were applied for conversion. Data cleaning procedure is used to clean the

data by filling the missing values, smoothing noisy data, identifying or removing

outliers and resolving inconsistencies. Additional data cleaning can be performed to

detect and remove redundancies still occur in the results obtained after data

integration. Data reduction produces a reduced representation of the data set that is

much smaller in volume and that should produce the same result.

Figure 3.3 Block Diagram.

The customer data may contain certain attribute that will take larger values. Therefore

if the attributes are left unnormalized, we need to normalize that. Furthermore, it

would be useful for analysis to obtain aggregate information. The data transformation

operations, such as normalization and aggregation, are additional data pre-processing

procedures that would contribute toward the success of the mining process.

Evaluation criteria: A rich set is available in Weka .

Only the following seven criteria are used:

Correctly Classified

Page 51: Improving Association Rules

35

Incorrectly Classified

Kappa Statistic

Mean Absolute Error

Root Mean Squared Error

Relative Absolute Error

Root Relative Squared Error

We will show the results of the above evaluation criteria applied to two scenarios

based on the customer data records maintained by the shopping centre.

3.1.3 Methods

Four DM algorithms were tested, as follows:

Naïve Bayes Algorithm: Naive Bayes is a well-known in machine learning. It

is a simple and efficient learning method. The Naive Bayes classifier is an

approximation to an ideal Bayesian classifier which would classify an example

based on the probability of each class given the example’s feature variables.

The main assumption is that the different features are independent of each

other given the class of the example.

Decision Table: Decision table is based on logical relationships just as the

truth table. It is a tool that helps us to look at the combination of both

completeness and inconsistency of conditions.

Decision Tree (J48): J48 attempts to account for noise and missing data. It

also deals with numeric attributes by determining where thresholds for

decision splits should be placed. The main parameters that can be set for this

Page 52: Improving Association Rules

36

algorithm are the confidence threshold, the minimum number of instances per

leaf and the number of folds for reduced error pruning.

Association: This technique finds groups of items that tend to occur together

in a transaction. Searches for relationships between variables. For example a

supermarket might gather data on customer purchasing habits. Using

association rule learning, the supermarket can determine which products are

frequently bought together and use this information for marketing purposes.

This is sometimes referred to as market basket analysis. We also identified and

performed an association rule mining task. This involves:

(1) Finding rules, including appropriate parameter setting,

(2) Determining which of the resulting rules are interesting,

(3) Figuring out how the interesting rules could be useful.

3.1.4 Result and Discussion

This section provides the simulation results produced by Weka. As noted earlier, three

types of classifiers are selected under the“Classification” technique, which are Naïve

Bayes algorithm, Decision Table algorithm, and the J48 algorithm (Decision Tree), as

well as the Associative Rules.

Page 53: Improving Association Rules

37

Naïve Bayes: Fig. 3.4 shows the output of the Naïve Bayes algorithm that is used to

analyze the data.

Figure 3.4 Results returned by the Naïve Bayes classifier.

Fig. 3.4 shows the result of analysis for “transportation” based on Naïve

Bayes. The result reveals that both the male and female would like to use

private transport when travel to shopping center.

Page 54: Improving Association Rules

38

Decision Table: Fig. 3.5 shows the output for the case study that uses 100 training

instances, 1 rules, and it is a non matches covered by Majority class.

Figure 3.5 The decision table of data analysis.

Page 55: Improving Association Rules

39

Decision Tree (J48): Fig. 3.6 shows the output produced by the J48 algorithm.

Figure 3.6 J48 pruned tree of “sex” analysis.

The software listed all the possible rules of the decision.

Below are some of the simulation results:

If Sex = female and transportation = private then cash

Page 56: Improving Association Rules

40

If Sex = female and transportation = public and age lesser or equal than 66 than credit

card

If Sex = female and transportation = public and age greater than 66 then cash

If Sex = male then credit card

Association: Fig. 3.7 shows the results of selecting the “Apriori” algorithm using the

“Associate Rules”. The algorithm provides many rules. Only a few rules are useful for

effective decision making. It cannot generate best rules because of insufficient data.

Figure 3.7 “Associate Rules”

Page 57: Improving Association Rules

41

In order to make sure the “Apriori” algorithm of “Associate Rules” works well, some

new fields have been added into the sample data, relationship, region, brand and races.

Age have been removed due to the “Apriori” algorithm do not support numeric data.

Figure 3.8 Sample Data (CSV format).

Figure 3.9 Sample Data (Notepad format).

Page 58: Improving Association Rules

42

Figure 3.10 “Associate Rules”.

Below is the result of the simulation:

Channel=Cash Transportation=Private Race=Chinese ==> Sex=Female

3.1.5 Comparison between Naïve Bayes (NB), Decision Table (DT) and Decision

Tree (J48)

Table 3.1 shows the comparison results of Naïve Bayes (NB), Decision table (DT) and

J48. Overall, J48 gives better results than the DT and NB since J48 produces less

error.

Naïve Bayes (NB)

Use Training

Set

Cross

Validation

Percentage

Split

Correctly Classified 60 58 19

Incorrectly Classified 40 42 15

Page 59: Improving Association Rules

43

Kappa Statistic 0.0909 0.0455 0.0449

Mean Absolute Error 0.4562 0.4671 0.4831

Root Mean Squared Error 0.4777 0.4897 0.5114

Relative Absolute Error 94.97% 97.23% 99.37%

Root Relative Squared

Error 97.52% 99.97% 102.28%

Decision Table (DT)

Correctly Classified 60 56 19

Incorrectly Classified 40 44 15

Kappa Statistic 0 -0.0577 0

Mean Absolute Error 0.4812 0.4855 0.4868

Root Mean Squared Error 0.4899 0.4963 0.4994

Relative Absolute Error 100.17% 101.05% 100.14%

Root Relative Squared

Error 100.01% 101.31% 99.87%

Decision Tree (J48)

Correctly Classified 60 59 19

Incorrectly Classified 40 41 15

Kappa Statistic 0 -0.0199 0

Mean Absolute Error 0.48 0.4827 0.4857

Root Mean Squared Error 0.4899 0.4954 0.5004

Relative Absolute Error 99.9184 % 100.4763 % 99.9137 %

Root Relative Squared

Error 99.9992 % 101.1204 % 100.0864 %

TABLE 3.1 COMPARISON BETWEEN NB, DT, J48 BY DATA NUMERIC.

Page 60: Improving Association Rules

44

3.1.6 Comparison between classifiers with time taken to build a model

The results in Table 3.2 show that J48 has the highest correctly classified followed by

Naïve Bayes and lastly is the Decision Table algorithm. The longest time taken to

build model is Decision table followed by Naïve Bayes and J48 algorithm.

Algorithm

Naïve

Bayes

J4

8

Decision

Table

correctly classified

instances  58  59  56

time taken to build

(second)  0  0  0.03

TABLE 3.2 COMPARISON BETWEEN CLASSIFIERS WITH TIME TAKEN TO

BUILD A MODEL.

3.2 Association Rules Apriori Algorithm [24]

3.2.1 Apriori Algorithm

Apriori algorithm is mining for associations among items in a large database of sales

transaction. It is an important database mining function. For example, the information

of a customer who purchase a keyboard also tends to but a mouse at the same time

Page 61: Improving Association Rules

45

3.2.1 Limitations of Apriori Algorithm

Apriori algorithm is simple and easy to execute, but has some limitation. The main

limitation is costly to handle a huge number of candidate sets with much frequent

itemsets, low minimum support or large itemsets. For example, if there are 10^4 from

frequent 1-itemsets, it need to generate more than 10^7 candidates into 2-length and

accumulate and test their occurrence frequencies. Moreover, to discover a frequent

pattern in size of 100. Example v1, v2, v3… v100, it must generate 2^100 candidate

itemsets in total on costly and wasting of time of candidate generation. Thus, it will

repeatedly scan the database and check large set of candidates by pattern matching.

Apriori algorithm will be very low efficiency when memory capacity is limited with

large number of transactions.

Page 62: Improving Association Rules

46

Chapter 4

4.0 DESIGN

4.1 Data Mining Process

The data mining process has 8 steps.

1. Translate the business problem into a data mining problem.

2. Select appropriate data.

3. Analyze the data.

4. Create a model set

5. Fix problems with the data.

6. Transform data

7. Build models.

8. Deploy models

Page 63: Improving Association Rules

47

Figure 4.1 Data Mining is not a linear process.

As shown in Figure 4.1, data mining process is best considered as a set of settled

circles or nested loops instead of a straight line. The steps do have their order, but it is

not necessary to completely finish with one step before moving on to the following

step. After done with the following step, it may revisit the previous step.

Page 64: Improving Association Rules

48

4.1.1 Step One: Translate the business problem into a data mining problem

The first step is to explore the available data and make a list of candidate business

problems. A well-defined business problem will lead to the proper destination for data

mining project and solve the problem. Data mining goals for particular project should

be in more specific but not in broad and general. This make it easier to monitor

progress in achieving them. Example of specific goals:

Identify customers who are unlikely to renew their subscriptions.

Forecast customer population in future months.

List products whose sales are at risk if we discontinue wine and beer sales.

4.1.2 Step Two: Select appropriate data

Data mining requires data. The data would be better if already be resident in a

corporate data warehouse, cleansed, available, historically accurate, and frequently

updated. The data sources that are useful and valuable, from problem to problem and

industry to industry. A few samples of useful data:

Point of sale data (coupons, discount)

Credit card charge records

Direct mail response records

4.1.3 Step Three: Analyze the data

A good step to examine the dataset and understand the data file from a new source.

Data visualization is the best way to know the data.

Page 65: Improving Association Rules

49

Figure 4.2 Sample data in ARFF Viewer.

Figure 4.3 Data Visualize.

Page 66: Improving Association Rules

71

Figure 4.4 Visualization of data by age and sex.

4.1.4 Step Four: Create a Model Set for Prediction

Creating a model set for prediction requires assembling data from different sources.

When making a prediction, the predictive model uses data from the past, finding

patterns to make predictions about the future. Time can always be divided into three

periods: the past, present, and future.

Page 67: Improving Association Rules

51

Figure 4.5 Data from the past mimics data from the past, present, and future.

4.1.5 Step Five: Fix Problem with the Data

Figure 4.6Sample data.

Variables such as address, post, telephone number, email are useful information, but

not all the data mining algorithms can handle. So we have to fix the data by replacing

by other attributes.

Page 68: Improving Association Rules

52

4.1.6 Step Six: Transform Data to Bring Information to the Surface

Once all the steps above have been done, it is the time to bring the information to the

surface by adding derived fields, combining multiple variables, creating ratios and

formula logarithms. Because of different person spend different money on a product,

maybe some of the buy more and some of the buy less. So it is wiser to convert the

money values to proportions of their spending.

Page 69: Improving Association Rules

53

4.1.7 Step Seven: Build Models

A sample model based on the sample data that used in Chapter 3.

Figure 4.7Data Mining Model.

The diagram illustrates the flow of data when a mining structure is processed, and

when a mining model is processed.

Page 70: Improving Association Rules

54

4.1.8 Step Eight: Deploy Models

Deploying a model means moving it from the data mining environment to the scoring

environment. Once a model has been created, the model can then be used to make

predictions for new data. The model would be built by using historical customer data.

This process is illustrated below:

Figure 4.8Data Mining Process Model.

The process of prediction for data is “scoring”. The process of using the model is

different from the process that creates the model. A model is used multiple times after

it is created to score different databases. Example, it can use to predict the probability

of a customer whether it will purchase an item or not during the wholesale.

Page 71: Improving Association Rules

55

Figure 4.9Data Mining Scoring Process Model.

In the end, it will generate prediction number between 0 and 1 as the output and also

known as “scoring”.

Page 72: Improving Association Rules

71

Figure 4.10“Scoring” Prediction.

Page 73: Improving Association Rules

56

Chapter 5

5.0 IMPLEMENTATION

5.1 Data Mining Process

The data mining process has 8 steps.

1. Translate the business problem into a data mining problem.

2. Select appropriate data.

3. Analyse the data.

4. Create a model set.

5. Fix problems with the data.

6. Transform data.

7. Build models.

8. Deploy models.

Page 74: Improving Association Rules

57

5.1.1 Translate the business problem into a data mining problem

Example Scenario

A shopping centre want to know about their sales for the past 5 months, so that they can forecast and achieve their target sales for the future months. Below are the specific goals:

List products whose sales are at risk if we discontinue beer sales. Which products they should make promotion for the future months.

5.1.2. Select appropriate data

Data Cleaning Process (Before)

Figure 5.1Appropriate data.

Above is a CSV file that contains 1000 user/customers profiles for testing purpose.

These data contain errors, inconsistent data and some records are lack of attribute

values. Data cleaning procedure is needed to clean the data before testing by filling

the missing values, smoothing noisy data, identifying or removing outliers and

resolving inconsistencies of data.

Included fields are:

userID

smoker

drink_level

Page 75: Improving Association Rules

71

dress_preference

ambience

transport

marital_status

hijos

birth_year

interest

personality

religion

activity

color

weight

budget

height

Upayment

Fcuisine

5.1.3 Analyse the data

Page 76: Improving Association Rules

71

This step is to examine the dataset and understand the data file from a new source by

using Weka ARFF Viewer and Weka Explorer Visualize.

Figure 5.2 Analyse data in ARFF Viewer.

Figure 5.3Data Visualize

Page 77: Improving Association Rules

71

Figure 5.4 Visualization of data by smoker and drink_level.

Page 78: Improving Association Rules

71

5.1.4 Create a model set

1 2 3 4 5

alcohol 91 106 79 70 118

non_alcohol 82 43 69 69 44

juice 27 51 61 61 38

10

30

50

70

90

110

130

91

106

7970

118

82

43

69 69

44

27

5161 61

38

Amount of drinks sold for the past 5 months

Month

Amou

nt o

f drin

ks

Figure 5.5 Prediction Model

Creating a model set for prediction on the amount of drinks that sold for the past 5

months based on the data set. When making a prediction, the predictive model uses

data from the past, finding patterns to make predictions about the future. From the

model set, we found out that the higher sales are alcohol drinks during the 5 months

periods. Thus, we should not discontinue beer sales. We can make promotion for

non_alcohol and juice during 3rd and 4th month to boost their sales.

Page 79: Improving Association Rules

71

5.1.5 Fix problems with the data

Data Cleaning Process (After)

Figure 5.6 Fixed dataset.

The figure above is a fixed dataset after data cleaning process.Variables such as

address, post, telephone number, email are useful information, but not all the data

mining algorithm of this project can handle. So we have to choose certain attributes

that can be used in Associate Rules and fix the data by replacing by other attributes.

Page 80: Improving Association Rules

71

5.1.6 Transform data

Figure 5.7 Transformed data.

Compare the figure 5.7 and previous figure 5.6, there are some changes for the

“income” attribute. Associate Rules are unable to read the numeric data, so we have to

convert the numerical data into nominal data. Convert it to low, medium or high

instead of using numbering as the “income” attribute values.

Page 81: Improving Association Rules

71

5.1.7 Build models

Figure 5.8 Data Mining Model.

The diagram illustrates the flow of data when a mining structure is processed, and

when a mining model is processed. The model filter into 3 models to create models.

To build model, we can use parameters to adjust the algorithm, apply filters to the

dataset, creating different results. The mining model object contains summaries and

patterns that can be used for prediction. Below are the figure of 3 models:

Page 82: Improving Association Rules

71

Model 1

a l cohol non_a l cohol jui c e

374

155

107

Model 1

drinks

relig

ion

= no

n_m

uslim

Figure 5.9 Model 1

Model 2

beef chi cken por k

104

96

84model 2

shopping_cart

food

_pre

fere

nce

= no

n_ha

lal

Figure 5.10 Model 2

Page 83: Improving Association Rules

71

Model 3

ca sh cr edi t ca r d debi t ca r d

1 2 31 2 3

349

618

33

Model 3

payment

1000

cus

tom

ers

Figure 5.11 Model 3

5.1.8 Deploy model

The last step in the data mining process, is to deploy the models that performed the

best to a production environment.

Use the models to create predictions, which you can then use to make business

decisions.

Update the models dynamically, as more data comes into the organization.

Page 84: Improving Association Rules

67

5.2 Apriori Algorithm Source Code

Figure 5.12 Original Code Part 1

Figure 5.13 Modified Code Part 1

Page 85: Improving Association Rules

71

Figure 5.14 Original Code Part 2

Figure 5.15 Modified Code Part 2

Page 86: Improving Association Rules

71

5.3 Import dataset into WEKA

The same dataset imported into Weka to test for their result by using original apriori

algorithm and modified apriori algorithm. Below figures are the results:

Original

Figure 5.16 Result of original code.

Number of association rules generated are 10. The total time is 47ms.

Modified

Figure 5.17 Result of modified code.

Number of association rules generated are 10. The total time is 44ms. The runtime of

the apriori algorithm have been improved.

Page 87: Improving Association Rules

71

Chapter 6

6.0 CONCLUSION

6.1 Progress and Outcome

In the first phase of this project have been completed successfully. First, the

problem has been identified with setting up a list of objectives to be achieved. Then,

research stage has started. This research stage involved conducting a literature review

as well as review for data mining techniques. Furthermore, research on Customer

Relationship Management (CRM) in data mining and data mining application areas

have been performed.

The analysis section begun with analyze the current situation of shopping

centre and also the way of people spend their money in past, present and future. But

the most important part was testing several data mining algorithms and make

comparison on which method is the best. Finally the design section has started by

designing a data mining process and model set.

The next section after design section is implementation. The implementation

begun with data mining process (methodology).After that proceed with build models

set for prediction. Then compile modified code using Apache-Ant so that the code can

be used by WEKA.Lastly, generate best rules by import dataset into the WEKA

software and compare the run time between original code and modified code in

different WEKA. The result show that the run time have been improved with modified

code.

Page 88: Improving Association Rules

71

6.2 Problems Encountered

Difficulty with using data mining tool, WEKA.

Difficulty with obtaining datasets.

Lack of experience in designing a model.

Lack of Internet resources.

Difficulty in modifying source code.

Time limitation as several courses requirements were due at the same time.

6.3 Future Planning

There isa new data mining software name SPMF. SPMF is an open-source data mining

mining library written in Java, specialized in pattern mining. It offers implementations

of 86 data mining algorithms for sequential pattern mining, association rule mining, itemset

mining, sequential rule mining and clustering. I hope I can do some research on this software

and compare with my current project in the future.

Page 89: Improving Association Rules

72

REFERENCES

Online Research

1) Data Mining: What is Data Mining

http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/

palace/datamining.htm Date extracted: 24/6/2014

2) Definition of Data Mining

http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/ Date

extracted: 24/6/2014

3) Investopedia explains ‘Data Mining”

http://www.investopedia.com/terms/d/datamining.asp Date extracted:

24/6/2014

4) Oracle Data Mining Concepts

http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/process.htm#CHD

FGCIJ Date extracted: 30/6/2014

5) Resources of Data Mining http://www.rithme.eu/?

m=resources&p=resources&lang=en Date extracted: 30/6/2014

6) Data Mining Techniques

http://www.ibm.com/developerworks/opensource/library/ba-data-mining-

techniques/index.html Date extracted: 6/7/2014

7) Carry Out Data Mining and Machine Learning with Weka

http://www.opensourceforu.com/2014/03/carry-data-mining-machine-learning-

weka/ Date extracted: 10/7/2014

Page 90: Improving Association Rules

76

8) An Introduction to Data Mining

http://www.thearling.com/text/dmwhite/dmwhite.htm Date extracted:

27/6/2014

9) How Business Can Benefit from Data Mining

http://www.tmcnet.com/topics/articles/2013/03/21/331429-how-businesses-

benefit-from-data-mining.htm Date extracted: 27/6/2014

10) An Overview of Data Mining Techniques

http://www.thearling.com/text/dmtechniques/dmtechniques.htm Date

extracted:

11) Data Mining Techniques

http://www.uta.edu/faculty/sawasthi/Statistics/stdatmin.html#index Date

extracted: 7/7/2014

12) Data Mining Classification

http://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm

Date extracted: 17/7/2014

13) Data Mining System

http://www.tutorialspoint.com/data_mining/dm_systems.htm Date extracted:

14) Data Mining Process Model http://www.rithme.eu/?

m=resources&p=dmmethod&lang=enDate extracted:

15) CRM – Customer Relationship Management

http://www.webopedia.com/TERM/C/CRM.html Date extracted: 1/8/2014

16) What is CRM? http://searchcrm.techtarget.com/definition/CRM, posted by

Margaret Rouse. Date extracted: 11/8/2014

Page 91: Improving Association Rules

76

17) Data Mining and Customer Relationships

http://www.thearling.com/text/whexcerpt/whexcerpt.htm, by Kurt Thearling.

Date extracted: 11/8/2014

Page 92: Improving Association Rules

74

18) A Review of Data Mining Tools in Customer Relationship Management

http://www.tlainc.com/articl149.htm, Journal of Knowledge Management

Practice, Vol. 9, No. 1, March 2008 - Jayanthi Ranjan, Institute of

Management Technology, Ghaziabad, Vishal Bhatnagar, Indraprastha

University, Delhi. Date extracted: 19/8/2014

19) Data Mining for Shopping Centres – Customer Knowledge Management

Framework

http://bura.brunel.ac.uk/bitstream/2438/1471/1/KMSCBasedOnChapshortV5.p

df Date extracted:30/8/2014

20) Customer Classification And Prediction Based On Data Mining Technique

http://www.ijetae.com/files/Volume2Issue12/IJETAE_1212_58.pdfDate

extracted:14/8/2014

21) Data Mining Techniques: For Marketing, Sales, and Customer Relationship

Management http://books.google.com.my/books?

id=AyQfVTDJypUC&pg=PA162&lpg=PA162&dq=Membership+Supermark

et

%27s+Customer+in+data+mining&source=bl&ots=KWFyqsQYyK&sig=Uyh

kDWZ2kHDBx-

XVtW9nx5SnTIo&hl=en&sa=X&ei=cZ_8U5_2KoWE8gW9_4CADA&redir_

esc=y#v=onepage&q=Membership%20Supermarket's%20Customer%20in

%20data%20mining&f=false Date extracted:13/8/2014

22) How Do Supermarkets Use Your Data?

http://www.select-statistics.co.uk/article/blog-post/how-do-supermarkets-use-

your-data Date extracted:29/8/2014

Page 93: Improving Association Rules

76

23) What is the CRISP-DM methodology?

Page 94: Improving Association Rules

75

http://www.sv-europe.com/crisp-dm-methodology/ Date extracted:21/8/2014

24) Association Rules Apriori Algorithm

https://fenix.tecnico.ulisboa.pt/downloadFile/3779571250083/licao_9.pdfDate

extracted: 29/9/14

25) Data Mining – Applications & Trends

http://www.tutorialspoint.com/data_mining/dm_applications_trends.htm Date

extracted: 10/7/2014

26) GitHub

https://github.com/jashmenn/apriori

Date extracted: 12/1/2015

27) Association Mining with Weka

http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html

Date extracted: 12/1/2015

28) Association Mining with Weka

http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html

Date extracted: 20/1/2015

29) AprioriItemset Generation

http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html

Date extracted: 20/1/2015

30) Pentaho Data Mining

http://wiki.pentaho.com/display/DATAMINING/Apriori

Date extracted: 20/1/2015

Page 95: Improving Association Rules

76

31) SPMF

http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php

Date extracted: 21/1/2015

32) CODE PROJECT

http://www.codeproject.com/Articles/70371/Apriori-Algorithm

Date extracted: 20/1/2015

33) All My Brain

http://allmybrain.com/2007/11/12/implementing-the-apriori-data-mining-

algorithm-with-javascript/Date extracted: 12/1/2015

34) CODE PROJECT

http://www.codeproject.com/Articles/70371/Apriori-Algorithm

Date extracted: 18/1/2015

35) stackoverflow

http://stackoverflow.com/questions/17125742/creating-k-itemsets-from-2-

itemsetsDate extracted: 16/1/2015

36) compilr

https://compilr.com/soniaj/apriori/Project.java

Date extracted: 22/1/2015

37) Apache Ant - Tutorial

http://www.vogella.com/tutorials/ApacheAnt/article.html

Date extracted: 23/1/2015

38) Uregina

http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/Apriori.javaDate

extracted: 22/1/2015

Page 96: Improving Association Rules

77

Reference Book

1) Data Mining Practical Machine Leaning Tools and Techniques Second Edition

by Ian H. Witten, Department of Computer Science, University of Waikato

and Eibe Frank, Department of Computer Science, University of Waikato.

APPENDIX

Project 1 – Gantt Chart

Semester 2

No. Activities 21-Nov

28-Nov

5-Dec

12-Dec

19-Dec

26-Dec

2-Jan

9-Jan

16-Jan

23-Jan 30-Jan

1 Deeply research in apriori algorithm

2 Select appropriate data

3Analyse and prepare dataset for simulation

4 Modify apriori algorithm

5 Validate model6 Documentation

Project 2 – Gantt Chart