Improving Association Rules
Transcript of Improving Association Rules
Final Year |Project 2
Development of Data Mining Algorithms for Analysing
Shopping Centre Dataset
by
Name: Khoo See Jun
ID: SN089817
Project Supervisor: Alicia Tang Yee Chong, Dr.
ii
DECLARATION
I hereby declare that this report, submitted to University Tenaga Nasional as a
partial fulfilment of the requirements for the Bachelor of Computer Science (System
and Networking) has not been submitted as an exercise for a degree at any other
university. I also certify that the work described here is entirely my own except for
excerpts and summaries whose sources are appropriately cited in the references.
This report may be made available within the university library and may be
photocopied or loaned to other libraries for the purposes of consultation.
23 February 2015
KHOO SEE JUN
SN089817
iii
APPROVAL SHEET
This thesis entitled:
“Development of Data Mining algorithms for analysing public domain databases”
Submitted by:
KHOO SEE JUN (SN089817)
In requirement for the degree of Bachelor of Computer Science (System and
Networking), College of Information Technology, University TenagaNasional has
been accepted.
Supervisor: Alicia Tang Yee Chong, Dr.
Signature: …………………………….
Date:
iv
ABSTRACT
The objectives of this project are to develop Data Mining algorithms for analysing
public domain databases. The public domain for this project is shopping centre. My
task for this project is to identify and perform an association rule mining task which
involves selecting an appropriate data set, preparing and preprocessing the data,
finding rules, including appropriate parameter setting, determining which of the
resulting rules are interesting and figuring out how the interesting rules could be
useful.
This work analyses well-known DM techniques in Weka workbench, and report the
simulation results using sample data by applying four selected DM techniques and
classifiers in the open source workbench to the Customer Relationship Management
(CRM) in shopping centre.
The design of the data mining process has been done in Chapter 4. This will show
how the data mining workflow.
The next section is implementation. The implementation begun with data mining
process (methodology).After that proceed with build models set for prediction. Then
compile modified code using Apache-Ant so that the code can be used by WEKA.
Lastly, generate best rules by import dataset into the WEKA software and compare the
run time between original code and modified code in different WEKA. The result
show that the run time have been improved with modified code.
v
TABLE OF CONTENTS
Page
DECLARATION ii
APPROVAL SHEET iii
ABSTRACT iv
TABLE OF CONTENTS v
LIST OF TABLES ix
LIST OF FIGURES x
CHAPTER 1 INTRODUCTION
1.0 INTRODUCTION
1.1 Project Introduction 1
1.2 Problem Statement 2
1.3 Objectives 2
1.4 Benefits 3
1.5 Project Scope 4
1.6 Expected Outcome 5
1.7 Gantt Chart 5
CHAPTER 2: RESEARCH AND LITERATURE REVIEW
vi
2.0 LITERATURE REVIEW
2.1 Data Mining Techniques 6
2.1.1 Association 6
2.1.2 Classification 6
2.1.3 Prediction 7
2.1.4 Sequential Patterns (Long-term data) 7
2.1.5 Clustering 8
2.1.6 Decisions Trees (J48) 9
2.1.7Decision Table 10
2.2 Customer Relationship Management (CRM) 10
2.2.1 What is Customer Relationship Management (CRM)? 10
2.2.2 How CRM is Used Today 10
2.2.3 The CRM Strategy 11
2.2.4 The Impact of Technology on CRM 11
2.2.5 The Benefits of CRM 12
2.2.6 Data Mining and Customer Relationship Management 12
2.2.7 Review of Data Mining Tools in CRM 13
2.2.8 Data Mining Tools Applications in CRM 15
2.3 Data Mining Applications 17
2.3.1 Banking/Finance (Financial Data Analysis) 17
2.3.2 Retail/Marketing Industry 18
2.3.3 Telecommunication Industry 19
2.3.4 Biological Data Analysis 20
2.3.5 Medical/Pharma 20
2.3.6 Insurance and Health Car 21
vii
2.3.7 Other Scientific Applications 21
2.3.8 Intrusion Detection 22
2.4 Data Mining Systems 23
2.4.1 Data Mining System Classification 23
2.4.2 Data Mining System Products 24
2.4.3 Choosing Data Mining System 25
2.4.4 Trends in Data Mining 27
2.5 Data Mining Process Model 28
2.5.1 Overview of Data Mining Life Cycle 28
CHAPTER 3: ANALYSIS
3.0 ANALYSIS
3.1 Data Mining for Shopping Centers 31
3.1.1 Free Sample Data for Testing Purpose 32
3.1.2 Related Work 33
3.1.3 Methods 35
3.1.4 Result and Discussion 36
3.1.5 Comparison between Naïve Bayes (NB), Decision Table (DT)
and Decision Tree (J48) 42
3.1.6 Comparison between classifiers with time taken to build a model 44
3.2Association Rules Apriori Algorithm 44
3.2.1 Apriori Algorithm 44
3.2.2 Limitations of Apriori Algorithm 44
viii
CHAPTER 4: DESIGN
4.0 DESIGN
4.1 Data Mining Process 46
4.1.1 Step One: Translate the business into a data mining problem 48
4.1.2 Step Two: Select appropriate data 48
4.1.3 Step Three: Analyze the data 48
4.1.4 Step Four: Create a Model Set for Prediction 50
4.1.5 Step Five: Fix Problem with the Data 51
4.1.6 Step Six: Transform Data to Bring Information to the Surface 52
4.1.7 Step Seven: Build Models 53
4.1.8 Step Eight: Deploy Models 54
CHAPTER 5: IMPLEMENTATION
5.0 IMPLEMENTATION
5.1 Data Mining Process 56
5.1.1 Translate the business into a data mining problem 57
5.1.2 Select appropriate data 57
5.1.3 Analyze the data 59
5.1.4 Create a Model Set for Prediction 61
5.1.5Fix Problem with the Data 62
5.1.6 Transform Data to Bring Information to the Surface 63
5.1.7 Build Models 64
5.1.8 Deploy Models 66
5.2 Apriori Algorithm Source Code 67
viii
5.3 Import dataset into WEKA 69
CHAPTER 6: CONCLUSION
6.0 CONCLUSION
6.1 Progress and Outcome 70
6.2 Problems Encountered 71
6.3 Future Planning 71
REFERENCES 72
APPENDIX 77
ix
LIST OF TABLES
Table No. Page
TABLE 3.1 42
TABLE 3.2 44
xii
LIST OF FIGURES
Figure No. Page
Figure 2.1 Clustering (Sample Diagram) 8
Figure 2.2 Decision Tree (J48) 9
Figure 2.3 Data Mining Applications Useful For Companies 15
Figure 2.4 Data Mining System Classification 24
Figure 2.5 Data Mining Process Model 28
Figure 3.1 Sample Data (CSV format) 32
Figure 3.2 Sample Data (Notepad format) 33
Figure 3.3 Block Diagram 34
Figure 3.4 Results returned by the Naïve Bayes classifier. 37
Figure 3.5Thedecision table of data analysis 38
Figure 3.6 J48 pruned tree of “sex” analysis 39
Figure 3.7 “Associate Rules” 40
Figure 3.8 Sample Data (CSV format) 41
Figure 3.9 Sample Data (Notepad format) 41
Figure 3.10 “Associate Rules” 42
Figure 4.1 Data Mining is not a linear process 47
xi
Figure 4.2 Sample data in ARFF Viewer 49
Figure 4.3 Data Visualize 49
Figure 4.4 Visualization of data by age and sex 50
Figure 4.5 Data from the past mimics data from the past, present, and future 51
Figure 4.6 Sample data 51
Figure 4.7 Data Mining Model 53
Figure 4.8 Data Mining Process Model 54
Figure 4.9 Data Mining Scoring Process Model 55
Figure 4.10 “Scoring” Prediction 55
Figure 5.1 Appropriate data 57
Figure 5.2 Analyse data in ARFF Viewer 59
Figure 5.3 Data Visualize 59
Figure 5.4 Visualization of data by smoker and drink_level 60
Figure 5.5 Prediction Model 61
Figure 5.6 Fixed dataset 62
Figure 5.7 Transformed data 63
Figure 5.8 Data Mining Model 64
Figure 5.9 Model 1 65
Figure 5.10 Model 2 65
xii
Figure 5.11 Model 3 66
Figure 5.12 Original Code Part 1 67
Figure 5.13 Modified Code Part 1 67
Figure 5.14 Original Code Part 2 68
Figure 5.15 Modified Code Part 2 68
Figure 5.16 Result of original code 69
Figure 5.17 Result of modified code 69
1
CHAPTER 1
1.0INTRODUCTION
1.1 Project Introduction
Far too many companies sit on loads of good customer data and do nothing with it. In
meanwhile they don’t know that data is a gold mine of insight that can increase
customer loyalty, unlock hidden profitability and reduce client churn. By applying
data mining (knowledge discovery), theprocess used by companies to turn raw data
into useful information. By using computer-assisted software and go through the
process of digging and analyzing enormous sets of dataandextract the hidden
predictive information from large databases. It is a powerful new technology with
great potential to help companies focus on the most important information in their
data warehouses. In businesses can learn more about their customers and develop
more effective marketing strategies as well as increase sales and decrease costs.
Grocery stores are well-known users of data mining techniques. Many supermarkets
offer free loyalty cards to customers that give them access to reduced prices not
available to non-members. The cards make it easy for stores to track who is buying
what, when they are buying it, and at what price. The stores can then use this data,
after analyzing it, for multiple purposes, such as offering customers coupons that are
targeted to their buying habits and deciding when to put items on sale and when to sell
them at full price. Data mining tools predict behaviours and future trends, allowing
businesses to make proactive, knowledge-driven decisions. Data mining tools can
answer business questions that traditionally were too time consuming to resolve. They
2
scour databases for hidden patterns, finding predictive information that experts may
miss because it lies outside their expectations.
1.2 Problem Statement
Most of the companies have wasted a tons of useful customer data by doing nothing
on it. They do not know what exactly their customer need, what they are missing. By
engaging in data mining, we can gain greater insight into external conditions, internal
processes, company’s market and their customers. We also gain predictive capabilities
that can be used both in strategic planning and in daily interactions. These insights and
predictive capabilities are taking a company’s business results to the next level by
improving the company’s marketing campaign management, up-sell and cross-sell
activities, or customer retention, risk analysis, or fraud detection efforts. This project
aim at a company to create powerful strategies, make fast and feasible decisions and
achieve competitive advantage in future.
1.3 Objectives
The objectives of this project are:
1. To identify parameters of the algorithm
2. To design new data mining algorithms
3. To develop new data mining algorithms
3
1.4 Benefits
The benefits of data mining in businesses are:
1. More Money. Money is always a good thing in business. When data is mined
that unearths the kinds of projects past donors contributed to, types of products
customers have purchased in the past, or a not-for-profit can put a number on
statistics for a grant proposal, it can result in serious cash. Once a business
knows who the top donors are or what their customers want, they can
customize approaches and outreach.
2. Improve Branding and Marketing. Data can reveal a number of things like
what direction the marketing department should take. For example, there might
have been a recent customer survey asking about what services or products
consumers want to see. That kind of information is gold, and a marketing
department can do wonders with it. If a survey or any feedback is being
collected, put it to use.
3. Streamline Outreach. Whether a business depends on e-mail blasts, print ads
or social media, knowing how customers want to be approached is important.
Data that includes relevant e-mail addresses, mailing addresses or social media
pages can help streamline any mailers or outreach. It also saves money,
whether it's in postage or time, by keeping consumer information updated.
4. Tap into New Markets.There are some databases available that businesses
can purchase, or the databases might be available to the public free of charge.
Business owners can use the databases of others to find out more information
about potential consumers and identify any holes in the current tactics.
However, when handling outside databases, it's especially important to
4
practice caution. Privacy is a big legal issue, and sometimes it's easy to overstep
boundaries.
5. Share and Share Alike. Sharing information is largely illegal, but it all
depends on what the customer has signed. For example, some coalitions may
share information on consumers in order to provide better services. This can be
dangerous grounds, but if it's legally acceptable, some business owners can
access the data of other partner organizations, too. This largely expands the
availability of information and can provide more data--and likely in turn more
accurate data--to improve the bottom line, services and research.
6. Learn from the Past. Data mining past information and comparing it to the
current situation can reveal a lot. Graphs can easily show any troubling sales
years, spikes or other trends that should be taken into consideration. Seeing the
ebb and flow of a business via data can provide insight that otherwise might be
overlooked. For example, a business that knows there's a history of high sales
in July can work on maximizing that month, while giving extra attention to
periods where sales slack.
1.5 Project Scope
Given databases of sufficient size and quality, data mining technology can generate
new business opportunities by providing these capabilities:
1. Select an appropriate data set
2. Preparing and pre-processing the data
3. Finding rules and identify parameter for the algorithm
5
1.6 Expected Outcome
The outcome of the project are:
1. A critical review data mining techniques
2. Dataset from a company
3. Design a new data mining algorithm
1.7 Gantt Chart
6
Chapter 2
2.0 LITERATURE REVIEW
2.1 Data Mining Techniques [6], [10], [11]
2.1.1 Association
Association (or relation) is probably the better known and most familiar and
straightforward data mining technique. A simple correlation between two or more
items, often of the same type to identify patterns. For example, when tracking people's
buying habits, we might identify that a customer always buys cream when they buy
strawberries, and therefore suggest that the next time that they buy strawberries they
might also want to buy cream.
2.1.2 Classification
We can use classification to build up an idea of the type of customer, item, or object
by describing multiple attributes to identify a particular class. For example, we can
easily classify cars into different types (sedan, 4x4, convertible) by identifying
different attributes (number of seats, car shape, driven wheels). Given a new car, we
might apply it into a particular class by comparing the attributes with our known
definition. We can apply the same principles to customers, for example by classifying
them by age and social group.
7
2.1.4 Prediction
Prediction is a wide topic and runs from predicting the failure of components or
machinery, to identifying fraud and even the prediction of company profits. Used in
combination with the other data mining techniques, prediction involves analyzing
trends, classification, pattern matching, and relation. By analyzing past events or
instances, we can make a prediction about an event. For example, using the credit card
authorization, we combine decision tree and classification to analysis an individual
past transaction to identify whether a transaction is fraudulent by matching the
historical pattern of the individual.
2.1.5 Sequential patterns (Long-term data)
Sequential patterns are a useful method for identifying trends, or regular occurrences
of similar events. For example, with customer data we can identify that customers buy
a particular collection of products together at different times of the year. In aonline
shopping website, we can use this information to automatically suggest that certain
items be added to their shopping cart based on their frequency and past purchasing
history.
8
2.1.6 Clustering
By examining one or more attributes or classes, we can group individual pieces of
data together to form a structure opinion. Clustering is using one or more attributes as
basis for identifying a cluster of correlating results. Clustering is useful to identify
different information because it correlates with other examples so we can see where
the similarities and ranges agree.
Clustering can work both ways. We can assume that there is a cluster at a certain point
and then use our identification criteria to see if we are correct. The graph in Figure
2.1 shows a good example. In this example, a sample of sales data compares the age
of the customer to the size of the sale. It is not unreasonable to expect that people in
their twenties (before marriage and kids), fifties, and sixties (when the children have
left home), have more disposable income.
Figure 2.1 Clustering (Sample Diagram).
(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)
9
2.1.7 Decision trees(J48)
Related to most of the other techniques (primarily classification and prediction), the
decision tree can be used either as a part of the selection criteria, or to support the use
and selection of specific data within the overall structure. Within the decision tree, we
start with a simple question that has two (or sometimes more) answers. Each answer
leads to a further question to help classify or identify the data so that it can be
categorized, or so that a prediction can be made based on each answer.
Figure 2.2 shows an example where you can classify an incoming error condition.
Figure 2.2 Decision Tree (J48).
(Adopted from http://www.ibm.com/developerworks/library/ba-data-mining-techniques/)
Decision trees are often used with classification systems to attribute type information,
and with predictive systems, where different predictions might be based on past
historical experience that helps drive the structure of the decision tree and the output.
10
2.1.8 Decision Table
Decision tables, like decision trees, are classification models used for prediction. They
are induced by machine learning algorithms. A decision table consists of a
hierarchical table in which each entry in a higher level table gets broken down by the
values of a pair of additional attributes to form another table. The structure is similar
to dimensional stacking.
2.2 Customer Relationship Management (CRM) [15], [16], [17], [18]
2.2.1 What is Customer Relationship Management (CRM)?
CRM (customer relationship management) is an information industry term for
methodologies, software, and usually Internet capabilities that help a
company manage customer relationships in an organized way. For example, a
company might build a database about its customers that described relationships in
sufficient detail so that management, salespeople, people providing service, and
perhaps the customer directly could access information, match customer needs with
product plans and offerings, remind customers of service requirements, and know
what other products a customer had purchased, and so on.
2.2.2 How CRM is Used Today
CRM solutions provide a company with the customer business data to provide
services or products that customers want, provide better customer service, cross-sell
11
and up-sell more effectively, close deals, retain current customers and better
understand in the customer.
2.2.3 The CRM Strategy
Customer relationship management is often thought of as a business strategy that
enables businesses to improve in a number of areas. The CRM strategy allows a
company to following:
1) Understand the customer
2) Retain customers through better customer experience
3) Attract new customers
4) Win new clients and contracts
5) Increase profitably
6) Decrease customer management costs
2.2.4 The Impact of Technology on CRM
Technology and the Internet have changed the way companies approach customer
relationship strategies. Advances in technology have changed consumer buying
behaviour, and today there are many ways for companies to communicate with
customers and to collect data about them. With each new advance in technology
especially the proliferation of self-service channels like the Web and smartphones
customer relationships are being managed electronically.
Many aspects of customer relationship management rely heavily on technology;
however, the strategies and processes of a good CRM system will collect, manage and
12
link information about the customer with the goal of letting you market and sell
services effectively.
2.2.5 The Benefits of CRM
The biggest benefit most businesses realize when moving to a CRM system comes
directly from having all the business data stored and accessed from a single location.
Before CRM systems, customer data was spread out over office productivity suite
documents, email systems, mobile phone data and even paper note cards and Rolodex
entries. Storing all the data from all departments (e.g., sales, marketing, customer
service and HR) in a central location gives management and employees immediate
access to the most recent data when they need it. Departments can collaborate with
ease, and CRM systems help organization to develop efficient automated processes to
improve business processes.
2.2.6Data Mining and Customer Relationship Management [17]
Customer relationship management (CRM) is a process that manages the interactions
between a company and its customers. The primary users of CRM software
applications are database marketers who are looking to automate the process of
interacting with customers.
To be successful, database marketers must first identify market segments containing
customers or prospects with high-profit potential. They then build and execute
campaigns that favourably impact the behaviour of these individuals.
The first task, identifying market segments, requires significant data about prospective
13
customers and their buying behaviours. In theory, the more data the better. In practice,
however, massive data stores often impede marketers, who struggle to sift through the
minutiae to find the nuggets of valuable information.
Recently, marketers have added a new class of software to their targeting arsenal.
Data mining applications automate the process of searching the mountains of data to
find patterns that are good predictors of purchasing behaviours.
After mining the data, marketers must feed the results into campaign management
software that, as the name implies, manages the campaign directed at the defined
market segments.
In the past, the link between data mining and campaign management software was
mostly manual. In the worst cases, it involved "sneaker net," creating a physical file
on tape or disk, which someone then carried to another computer and loaded into the
marketing database.
This separation of the data mining and campaign management software introduces
considerable inefficiency and opens the door for human errors. Tightly integrating the
two disciplines presents an opportunity for companies to gain competitive advantage.
2.2.7Review of Data Mining Tools in CRM [18]
Data mining uses a combination of an explicit knowledge base, sophisticated
analytical skills, and domain knowledge to uncover hidden trends and patterns. These
trends and patterns form the basis of predictive models that enable analysts to produce
14
new observations from existing data. There are number of data mining tools available
in the market spaces that can provide the cutting edge for the firms to achieve
profitable CRM.
Data mining tools helps CRM by providing the complete framework, which covers:
To analyze the business problem.
To prepare the data requirements.
To build the suitable model with respect to business problem.
To validate and evaluate the designed model.
Model building is the next phase of the Data mining tool, which builds the various
models according to the data given in the data preparation phase. The last phase is the
evaluation of the model, so that the proper results in the form of useful patterns can be
drawn from the models built by the tools.
The tools of data mining for CRM should be able to detect the necessary information
from the available data .To achieve this, Data mining tools should have some
characteristic like:
User friendly environment
Efficiency of the tool
Basic task should be accomplished
Low cost of implementation
15
2.2.8 Data Mining Tools Applications in CRM [18]
Virtually any process from pharmacology to customer service can be studied,
understood, and improved using data mining. The top three end uses of data mining
are, not surprisingly, in the marketing area.
Figure 2.3 Data Mining Applications Useful For Companies.
(Adopted from http://www.informationweek.com/673/73iudat.htm)
Figure 2.3 shows that the Customer demographics are one of the most important
applications for the companies. The application of Data Mining tools are in:
7
Customer Profiling: In customer profiling, characteristics of good customers
are identified with the goals of predicting; who will become one and helping
marketers target new prospects. Data mining can find patterns in a customer
database that can be applied to a prospective database so that customer
acquisition can be appropriately targeted. For example, by identifying good
candidates for mail offers or catalogues direct-mail marketers can reduce
expenses and increase their sale
16
Targeted Marketing: Targeting specific promotions to existing and potential
customers offer similar benefits
Market-basket analysis: Market-basket analysis helps retailers understand
which products are purchased together or by an individual over time. With
data mining, retailers can determine which products to stock in which stores,
and even how to place them within a store. Data mining can also help assess
the effectiveness of promotions and coupons.
Manage customer relationship: Another common use of data mining in many
organizations is to help manage customer relationships. By determining
characteristics of customers who are likely to leave for a competitor, a
company can take action to retain that customer because doing so is usually far
less expensive than acquiring a new customer.
Fraud detection: Fraud detection is of great interest to telecommunications
firms, credit-card companies, insurance companies, stock exchanges, and
government agencies identify and track individual terrorists themselves, such
as through travel and immigration records.
Anticipate and prevent customer attrition: The data mining tool can help to
find the customers which are not satisfied by the firm’s services. This helps the
firms to give promotional services to group of customers who are likely to
attrite.
Mine unstructured data, such as text: The text data is always unstructured.
So data mining tools can help to mine the unstructured data to help the various
organizations to get good out of the data.
17
2.3 Data Mining Applications
Data mining is a data analysis approach that has been quickly adapted and used in a
large number of domains that were already using statistics. Here is the list of areas
where data mining is widely used:
Banking/Finance
Retail Industry
Telecommunication Industry
Biological Data Analysis
Medical/Pharma
Insurance and Health Care
Other Scientific Applications
Intrusion Detection
2.3.1 BANKING/FINANCE (FINANCIAL DATA ANALYSIS)
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates the systematic data analysis and data mining. Here are the
few typical cases:
Design and construction of data warehouses for multidimensional data analysis
and data mining.
18
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Detection of fraudulent credit card usage patterns.
Risk management related to attribution of loans using scorecards.
Find hidden correlations between different financial indicators.
Identification of stocks trading rules from historical market data.
2.3.2 RETAIL/MARKETING INDUSTRY
Data Mining has its great application in Retail Industry because it collects large
amount data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of increasing ease, availability and popularity of web.
The Data Mining in Retail Industry helps in identifying customer buying patterns and
trends. That leads to improved quality of customer service and good customer
retention and satisfaction. Here is the list of examples of data mining in retail industry:
Design and Construction of data warehouses based on benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
18
Customer Retention.
Product recommendation and cross-referencing of items.
Discovery of buying behaviour patterns
Detection of associations among customer characteristics.
Prediction of the probability that clients answer to mailing.
19
2.3.3 TELECOMMUNICATION INDUSTRY
Today the Telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, Internet messenger,
images, e-mail, web data transmission etc. Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business. Data Mining in Telecommunication industry helps in
identifying the telecommunication patterns, catch fraudulent activities, make better
use of resource, and improve quality of service. Here is the list examples for which
data mining improve telecommunication services:
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.
20
2.3.4 BIOLOGICAL DATA ANALYSIS
Now a days we see that there is vast growth in field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is
very important part of Bioinformatics. Following are the aspects in which data mining
contribute for biological data analysis:
Semantic integration of heterogeneous, distributed genomic and proteomic
databases.
Alignment, indexing, similarity search and comparative analysis multiple
nucleotide sequences.
Discovery of structural patterns and analysis of genetic networks and protein
pathways.
Association and path analysis.
Visualization tools in genetic data analysis.
2.3.5 MEDICAL/PHARMA
Data mining is a very important part in medical field. By getting through data mining,
research for new cure for rare diseases rate will be higher. Below are the aspects in
which data mining contribute for medical field:
21
Computer Assisted Diagnosis (expert systems learning)
Characterization/prediction of patient's response to product dosage
Identification of successful medical therapies (successful prescription
patterns).
Study of relations between dosage and potentially related adverse events
2.3.6 INSURANCE AND HEALTH CARE
Following is how the insurance companies manage their businesses and customer with
the help of data mining:
Discovery of medical procedures that are claimed together through claims
analysis
Identification of customers that are potential buyers for new policies.
Detection of behaviour patterns capable of identifying risky customers.
Detection of fraudulent behaviour.
2.3.7 OTHER SCIENTIFIC APPLICATIONS
The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data
have been collected from scientific domains such as geosciences, astronomy etc.
22
There is large amount of data sets being generated because of the fast numerical
simulations in various fields such as climate, and ecosystem modelling, chemical
engineering, fluid dynamics etc. Following are the applications of data mining in field
of Scientific Applications:
Data Warehouses and data pre-processing.
Graph-based mining.
Visualization and domain specific knowledge.
2.3.8 INTRUSION DETECTION
Intrusion refers to any kind of action that threatens integrity, confidentiality, or
availability of network resources. In this world of connectivity security has become
the major issue. With increased usage of internet and availability of tools and tricks
for intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection:
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
23
Visualization and query tools.
2.4 Data Mining Systems [13]
There is a large variety of Data Mining Systems available. Data mining System may
integrate techniques from the following:
Spatial Data Analysis
Information Retrieval
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
2.4.1 Data Mining System Classification [12]
The data mining system can be classified according to the following criteria:
Database Technology
24
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Figure 2.4 Data Mining System Classification.
(Adopted from http://www.tutorialspoint.com/data_mining/dm_systems.htm)
2.4.2 Data Mining System Products [13]
There are many data mining system products and domain specific data mining
applications are available. The new data mining systems and applications are being
25
added to the previous systems. Also the efforts are being made towards
standardization of data mining languages.
2.4.3 Choosing Data Mining System
Which data mining system to choose will depend on following features of Data
Mining System:
Data Types - The data mining system may handle formatted text, record-based data
and relational data. The data could also be in ASCII text, relational database data or
data warehouse data. Therefore we should check what exact format, the data mining
system can handle.
System Issues - We must consider the compatibility of Data Mining system with
different operating systems. One data mining system may run on only on one
operating system or on several. There are also data mining systems that provide web-
based user interfaces and allow XML data as input.
Data Sources - Data Sources refers to the data formats in which data mining system
will operate. Some data mining system may work only on ASCII text files while other
on multiple relational sources. Data mining system should also support ODBC
connections or OLE DB for ODBC connections.
Data Mining functions and methodologies - There are some data mining systems
that provide only one data mining function such as classification while some provides
multiple data mining functions such as concept description, discovery-driven OLAP
analysis, association mining, linkage analysis, statistical analysis, classification,
prediction, clustering, outlier analysis, similarity search etc.
20
Coupling data mining with databases or data warehouse systems - Data mining
system need to be coupled with database or the data warehouse systems. The coupled
26
components are integrated into a uniform information processing environment. Here
are the types of coupling listed below:
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
Scalability - There are two scalability issues in Data Mining as follows:
o Row (Database size) Scalability - Data mining System is considered as row scalable
when the number or rows are enlarged 10 times, It takes no more than the 10 times to
execute the query.
o Column (Dimension) Scalability - Data mining system is considered as column
scalable if the mining query execution time increases linearly with number of
columns.
Visualization Tools - Visualization in Data mining can be categorized as follows:
o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
Data Mining query language and graphical user interface - The graphical user
interface which is easy to use and is required to promote user guided, interactive data
mining. Unlike relational database systems data mining systems do not share
underlying data mining query language.
27
2.4.4 Trends in Data Mining [25]
Here is the list of trends in data mining that reflects pursuit of the challenges such as
construction of integrated and interactive data mining environments, design of data
mining languages:
Application Exploration
Scalable and Interactive data mining methods
Integration of data mining with database systems, data warehouse systems and web
database systems.
Standardization of data mining query language
Visual Data Mining
New methods for mining complex types of data
Biological data mining
Data mining and software engineering
Web mining
Distributed Data mining
Real time data mining
Multi Database data mining
Privacy protection and Information Security in data mining
28
2.5 Data Mining Process Model [23]
CRISP-DM(Cross Industry Standard Process for Data Mining) stands for cross-
industry process for data mining. The CRISP-DM methodology provides a structured
approach to planning a data mining project. It is a robust and well-proven
methodology.
2.5.1 Overview of Data Mining Life Cycle
Figure 2.5 Data Mining Process Model.
(Adopted from http://www.rithme.eu/?m=resources&p=dmmethod&lang=en)
Starting from the knowledge discovery processes used in early data mining projects,
CRISP-DM defined and validated a data mining process that could be applicable in
any industry sectors. This methodology should make large data mining projects faster,
29
cheaper, more reliable and more manageable. However, even small scale data mining
investigations can benefit from using it.
This process model provides a simple overview of the life cycle of a data mining
project. Corresponding phases of a data mining project are clearly identified
throughout tasks and relationships between these tasks. Even if the model doesn't
indicate it, there possibly exists relationships between all data mining tasks mainly
depending on analysis goals and on the data to be analysed.
Six main phases can be distinguished in this process model:
Business understanding - concerns the definition of the data mining problem based
on the business objectives.
Data understanding - this phase aims at getting a precise idea about data available,
identifying possible data quality issues, etc.
Data preparation - covers all activities meant to build the dataset to analyse from the
initial raw data. This includes cleaning, feature selection, sampling, etc.
Modeling - is the phase where several data mining techniques are parameter and
tested with the objective of optimizing the obtained data model or knowledge.
Evaluation - aims at verifying that the obtained model properly answers the initially
formulated business objectives and contributes to deciding whether the model will be
deployed or, on the contrary, will be rebuilt.
Deployment - is the final step of the cyclic data miningprocess model. Its target is to
take the obtained knowledge, put it in a convenient form and integrate it in the
business decision process. It can go, upon the objectives, from generating a report
30
describing the obtained knowledge to creating an specific application that will use the
obtained model to predict unknown values of a desired parameter.
31
Chapter 3
3.0 ANALYSIS
3.1 Data Mining for Shopping Centres
With the majority of large retailers offering a loyalty card scheme, the collection of
customer data is now routine commercial practice. Whilst loyalty schemes were
originally introduced to reward loyal customers and to encourage them to increase
their overall spend, retailers have been finding more and more sophisticated ways to
use customer data to their advantage.
Due to high competition in the business field, it is essential to consider the customer
relationship management of the shopping centre. Here analyse the massive volume of
customer data and classify them based on the customer behaviours and prediction.
Customer relationship management is mainly used in sales forecasting and banking
areas. Data mining provides the technology to analyse mass volume of data and detect
hidden patterns in data to convert raw data into valuable information.
This work analyses DM techniques in Weka workbench, and reports the simulation
results of applying four DM techniques and classifiers in the open source workbench
to the Customer Relationship Management (CRM) for a shopping centre.
32
We are here to propose that data mining techniques to be used in aiding the
salesperson and management of the shopping centre for effective decision making.
This approach was applied to 100 pre-processed records. Simulation results show that
the large volume of customer historical data can play a value added role for
shopping centre development in a way that the mined data helps them to study
customer behaviour so that personalized services can be provided.
Our aim is to demonstrate the possibilities and draw attention to the possible
implications of improving customer satisfaction. The objectives of this work could
include increasing rental incomes and bringing new life back into shopping centre.
3.1.1 Free Sample Data for Testing Purpose
Figure 3.1 Sample Data (CSV format).
33
Above is the sample data for testing purpose. This testing consist of 100 pre-processed
customer records. Included fields are:
Sex
Age
Channel
Transportation
All files are provided as CSV (comma-delimited).
Sex are gender, age are random. Channel is the way that the customer make payment,
with credit card or cash. Transportation is how the customer travel to their destination.
Figure 3.2 Sample Data (Notepad format).
3.1.2 Related Work
Before performing data mining need to perform the processes like data preparation
and data cleaning. Incomplete data were found in some of the records therefore data
preparation is needed. This means some records are lack of attribute values. Noisy
data contains errors and inconsistent data contains discrepancies in codes or names. In
data preparation need to select only the wanted fields from each table in order to
34
perform the data mining. Data reprocessing techniques like data cleaning and data
reduction were applied for conversion. Data cleaning procedure is used to clean the
data by filling the missing values, smoothing noisy data, identifying or removing
outliers and resolving inconsistencies. Additional data cleaning can be performed to
detect and remove redundancies still occur in the results obtained after data
integration. Data reduction produces a reduced representation of the data set that is
much smaller in volume and that should produce the same result.
Figure 3.3 Block Diagram.
The customer data may contain certain attribute that will take larger values. Therefore
if the attributes are left unnormalized, we need to normalize that. Furthermore, it
would be useful for analysis to obtain aggregate information. The data transformation
operations, such as normalization and aggregation, are additional data pre-processing
procedures that would contribute toward the success of the mining process.
Evaluation criteria: A rich set is available in Weka .
Only the following seven criteria are used:
Correctly Classified
35
Incorrectly Classified
Kappa Statistic
Mean Absolute Error
Root Mean Squared Error
Relative Absolute Error
Root Relative Squared Error
We will show the results of the above evaluation criteria applied to two scenarios
based on the customer data records maintained by the shopping centre.
3.1.3 Methods
Four DM algorithms were tested, as follows:
Naïve Bayes Algorithm: Naive Bayes is a well-known in machine learning. It
is a simple and efficient learning method. The Naive Bayes classifier is an
approximation to an ideal Bayesian classifier which would classify an example
based on the probability of each class given the example’s feature variables.
The main assumption is that the different features are independent of each
other given the class of the example.
Decision Table: Decision table is based on logical relationships just as the
truth table. It is a tool that helps us to look at the combination of both
completeness and inconsistency of conditions.
Decision Tree (J48): J48 attempts to account for noise and missing data. It
also deals with numeric attributes by determining where thresholds for
decision splits should be placed. The main parameters that can be set for this
36
algorithm are the confidence threshold, the minimum number of instances per
leaf and the number of folds for reduced error pruning.
Association: This technique finds groups of items that tend to occur together
in a transaction. Searches for relationships between variables. For example a
supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes.
This is sometimes referred to as market basket analysis. We also identified and
performed an association rule mining task. This involves:
(1) Finding rules, including appropriate parameter setting,
(2) Determining which of the resulting rules are interesting,
(3) Figuring out how the interesting rules could be useful.
3.1.4 Result and Discussion
This section provides the simulation results produced by Weka. As noted earlier, three
types of classifiers are selected under the“Classification” technique, which are Naïve
Bayes algorithm, Decision Table algorithm, and the J48 algorithm (Decision Tree), as
well as the Associative Rules.
37
Naïve Bayes: Fig. 3.4 shows the output of the Naïve Bayes algorithm that is used to
analyze the data.
Figure 3.4 Results returned by the Naïve Bayes classifier.
Fig. 3.4 shows the result of analysis for “transportation” based on Naïve
Bayes. The result reveals that both the male and female would like to use
private transport when travel to shopping center.
38
Decision Table: Fig. 3.5 shows the output for the case study that uses 100 training
instances, 1 rules, and it is a non matches covered by Majority class.
Figure 3.5 The decision table of data analysis.
39
Decision Tree (J48): Fig. 3.6 shows the output produced by the J48 algorithm.
Figure 3.6 J48 pruned tree of “sex” analysis.
The software listed all the possible rules of the decision.
Below are some of the simulation results:
If Sex = female and transportation = private then cash
40
If Sex = female and transportation = public and age lesser or equal than 66 than credit
card
If Sex = female and transportation = public and age greater than 66 then cash
If Sex = male then credit card
Association: Fig. 3.7 shows the results of selecting the “Apriori” algorithm using the
“Associate Rules”. The algorithm provides many rules. Only a few rules are useful for
effective decision making. It cannot generate best rules because of insufficient data.
Figure 3.7 “Associate Rules”
41
In order to make sure the “Apriori” algorithm of “Associate Rules” works well, some
new fields have been added into the sample data, relationship, region, brand and races.
Age have been removed due to the “Apriori” algorithm do not support numeric data.
Figure 3.8 Sample Data (CSV format).
Figure 3.9 Sample Data (Notepad format).
42
Figure 3.10 “Associate Rules”.
Below is the result of the simulation:
Channel=Cash Transportation=Private Race=Chinese ==> Sex=Female
3.1.5 Comparison between Naïve Bayes (NB), Decision Table (DT) and Decision
Tree (J48)
Table 3.1 shows the comparison results of Naïve Bayes (NB), Decision table (DT) and
J48. Overall, J48 gives better results than the DT and NB since J48 produces less
error.
Naïve Bayes (NB)
Use Training
Set
Cross
Validation
Percentage
Split
Correctly Classified 60 58 19
Incorrectly Classified 40 42 15
43
Kappa Statistic 0.0909 0.0455 0.0449
Mean Absolute Error 0.4562 0.4671 0.4831
Root Mean Squared Error 0.4777 0.4897 0.5114
Relative Absolute Error 94.97% 97.23% 99.37%
Root Relative Squared
Error 97.52% 99.97% 102.28%
Decision Table (DT)
Correctly Classified 60 56 19
Incorrectly Classified 40 44 15
Kappa Statistic 0 -0.0577 0
Mean Absolute Error 0.4812 0.4855 0.4868
Root Mean Squared Error 0.4899 0.4963 0.4994
Relative Absolute Error 100.17% 101.05% 100.14%
Root Relative Squared
Error 100.01% 101.31% 99.87%
Decision Tree (J48)
Correctly Classified 60 59 19
Incorrectly Classified 40 41 15
Kappa Statistic 0 -0.0199 0
Mean Absolute Error 0.48 0.4827 0.4857
Root Mean Squared Error 0.4899 0.4954 0.5004
Relative Absolute Error 99.9184 % 100.4763 % 99.9137 %
Root Relative Squared
Error 99.9992 % 101.1204 % 100.0864 %
TABLE 3.1 COMPARISON BETWEEN NB, DT, J48 BY DATA NUMERIC.
44
3.1.6 Comparison between classifiers with time taken to build a model
The results in Table 3.2 show that J48 has the highest correctly classified followed by
Naïve Bayes and lastly is the Decision Table algorithm. The longest time taken to
build model is Decision table followed by Naïve Bayes and J48 algorithm.
Algorithm
Naïve
Bayes
J4
8
Decision
Table
correctly classified
instances 58 59 56
time taken to build
(second) 0 0 0.03
TABLE 3.2 COMPARISON BETWEEN CLASSIFIERS WITH TIME TAKEN TO
BUILD A MODEL.
3.2 Association Rules Apriori Algorithm [24]
3.2.1 Apriori Algorithm
Apriori algorithm is mining for associations among items in a large database of sales
transaction. It is an important database mining function. For example, the information
of a customer who purchase a keyboard also tends to but a mouse at the same time
45
3.2.1 Limitations of Apriori Algorithm
Apriori algorithm is simple and easy to execute, but has some limitation. The main
limitation is costly to handle a huge number of candidate sets with much frequent
itemsets, low minimum support or large itemsets. For example, if there are 10^4 from
frequent 1-itemsets, it need to generate more than 10^7 candidates into 2-length and
accumulate and test their occurrence frequencies. Moreover, to discover a frequent
pattern in size of 100. Example v1, v2, v3… v100, it must generate 2^100 candidate
itemsets in total on costly and wasting of time of candidate generation. Thus, it will
repeatedly scan the database and check large set of candidates by pattern matching.
Apriori algorithm will be very low efficiency when memory capacity is limited with
large number of transactions.
46
Chapter 4
4.0 DESIGN
4.1 Data Mining Process
The data mining process has 8 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Analyze the data.
4. Create a model set
5. Fix problems with the data.
6. Transform data
7. Build models.
8. Deploy models
47
Figure 4.1 Data Mining is not a linear process.
As shown in Figure 4.1, data mining process is best considered as a set of settled
circles or nested loops instead of a straight line. The steps do have their order, but it is
not necessary to completely finish with one step before moving on to the following
step. After done with the following step, it may revisit the previous step.
48
4.1.1 Step One: Translate the business problem into a data mining problem
The first step is to explore the available data and make a list of candidate business
problems. A well-defined business problem will lead to the proper destination for data
mining project and solve the problem. Data mining goals for particular project should
be in more specific but not in broad and general. This make it easier to monitor
progress in achieving them. Example of specific goals:
Identify customers who are unlikely to renew their subscriptions.
Forecast customer population in future months.
List products whose sales are at risk if we discontinue wine and beer sales.
4.1.2 Step Two: Select appropriate data
Data mining requires data. The data would be better if already be resident in a
corporate data warehouse, cleansed, available, historically accurate, and frequently
updated. The data sources that are useful and valuable, from problem to problem and
industry to industry. A few samples of useful data:
Point of sale data (coupons, discount)
Credit card charge records
Direct mail response records
4.1.3 Step Three: Analyze the data
A good step to examine the dataset and understand the data file from a new source.
Data visualization is the best way to know the data.
49
Figure 4.2 Sample data in ARFF Viewer.
Figure 4.3 Data Visualize.
71
Figure 4.4 Visualization of data by age and sex.
4.1.4 Step Four: Create a Model Set for Prediction
Creating a model set for prediction requires assembling data from different sources.
When making a prediction, the predictive model uses data from the past, finding
patterns to make predictions about the future. Time can always be divided into three
periods: the past, present, and future.
51
Figure 4.5 Data from the past mimics data from the past, present, and future.
4.1.5 Step Five: Fix Problem with the Data
Figure 4.6Sample data.
Variables such as address, post, telephone number, email are useful information, but
not all the data mining algorithms can handle. So we have to fix the data by replacing
by other attributes.
52
4.1.6 Step Six: Transform Data to Bring Information to the Surface
Once all the steps above have been done, it is the time to bring the information to the
surface by adding derived fields, combining multiple variables, creating ratios and
formula logarithms. Because of different person spend different money on a product,
maybe some of the buy more and some of the buy less. So it is wiser to convert the
money values to proportions of their spending.
53
4.1.7 Step Seven: Build Models
A sample model based on the sample data that used in Chapter 3.
Figure 4.7Data Mining Model.
The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed.
54
4.1.8 Step Eight: Deploy Models
Deploying a model means moving it from the data mining environment to the scoring
environment. Once a model has been created, the model can then be used to make
predictions for new data. The model would be built by using historical customer data.
This process is illustrated below:
Figure 4.8Data Mining Process Model.
The process of prediction for data is “scoring”. The process of using the model is
different from the process that creates the model. A model is used multiple times after
it is created to score different databases. Example, it can use to predict the probability
of a customer whether it will purchase an item or not during the wholesale.
55
Figure 4.9Data Mining Scoring Process Model.
In the end, it will generate prediction number between 0 and 1 as the output and also
known as “scoring”.
71
Figure 4.10“Scoring” Prediction.
56
Chapter 5
5.0 IMPLEMENTATION
5.1 Data Mining Process
The data mining process has 8 steps.
1. Translate the business problem into a data mining problem.
2. Select appropriate data.
3. Analyse the data.
4. Create a model set.
5. Fix problems with the data.
6. Transform data.
7. Build models.
8. Deploy models.
57
5.1.1 Translate the business problem into a data mining problem
Example Scenario
A shopping centre want to know about their sales for the past 5 months, so that they can forecast and achieve their target sales for the future months. Below are the specific goals:
List products whose sales are at risk if we discontinue beer sales. Which products they should make promotion for the future months.
5.1.2. Select appropriate data
Data Cleaning Process (Before)
Figure 5.1Appropriate data.
Above is a CSV file that contains 1000 user/customers profiles for testing purpose.
These data contain errors, inconsistent data and some records are lack of attribute
values. Data cleaning procedure is needed to clean the data before testing by filling
the missing values, smoothing noisy data, identifying or removing outliers and
resolving inconsistencies of data.
Included fields are:
userID
smoker
drink_level
71
dress_preference
ambience
transport
marital_status
hijos
birth_year
interest
personality
religion
activity
color
weight
budget
height
Upayment
Fcuisine
5.1.3 Analyse the data
71
This step is to examine the dataset and understand the data file from a new source by
using Weka ARFF Viewer and Weka Explorer Visualize.
Figure 5.2 Analyse data in ARFF Viewer.
Figure 5.3Data Visualize
71
Figure 5.4 Visualization of data by smoker and drink_level.
71
5.1.4 Create a model set
1 2 3 4 5
alcohol 91 106 79 70 118
non_alcohol 82 43 69 69 44
juice 27 51 61 61 38
10
30
50
70
90
110
130
91
106
7970
118
82
43
69 69
44
27
5161 61
38
Amount of drinks sold for the past 5 months
Month
Amou
nt o
f drin
ks
Figure 5.5 Prediction Model
Creating a model set for prediction on the amount of drinks that sold for the past 5
months based on the data set. When making a prediction, the predictive model uses
data from the past, finding patterns to make predictions about the future. From the
model set, we found out that the higher sales are alcohol drinks during the 5 months
periods. Thus, we should not discontinue beer sales. We can make promotion for
non_alcohol and juice during 3rd and 4th month to boost their sales.
71
5.1.5 Fix problems with the data
Data Cleaning Process (After)
Figure 5.6 Fixed dataset.
The figure above is a fixed dataset after data cleaning process.Variables such as
address, post, telephone number, email are useful information, but not all the data
mining algorithm of this project can handle. So we have to choose certain attributes
that can be used in Associate Rules and fix the data by replacing by other attributes.
71
5.1.6 Transform data
Figure 5.7 Transformed data.
Compare the figure 5.7 and previous figure 5.6, there are some changes for the
“income” attribute. Associate Rules are unable to read the numeric data, so we have to
convert the numerical data into nominal data. Convert it to low, medium or high
instead of using numbering as the “income” attribute values.
71
5.1.7 Build models
Figure 5.8 Data Mining Model.
The diagram illustrates the flow of data when a mining structure is processed, and
when a mining model is processed. The model filter into 3 models to create models.
To build model, we can use parameters to adjust the algorithm, apply filters to the
dataset, creating different results. The mining model object contains summaries and
patterns that can be used for prediction. Below are the figure of 3 models:
71
Model 1
a l cohol non_a l cohol jui c e
374
155
107
Model 1
drinks
relig
ion
= no
n_m
uslim
Figure 5.9 Model 1
Model 2
beef chi cken por k
104
96
84model 2
shopping_cart
food
_pre
fere
nce
= no
n_ha
lal
Figure 5.10 Model 2
71
Model 3
ca sh cr edi t ca r d debi t ca r d
1 2 31 2 3
349
618
33
Model 3
payment
1000
cus
tom
ers
Figure 5.11 Model 3
5.1.8 Deploy model
The last step in the data mining process, is to deploy the models that performed the
best to a production environment.
Use the models to create predictions, which you can then use to make business
decisions.
Update the models dynamically, as more data comes into the organization.
67
5.2 Apriori Algorithm Source Code
Figure 5.12 Original Code Part 1
Figure 5.13 Modified Code Part 1
71
Figure 5.14 Original Code Part 2
Figure 5.15 Modified Code Part 2
71
5.3 Import dataset into WEKA
The same dataset imported into Weka to test for their result by using original apriori
algorithm and modified apriori algorithm. Below figures are the results:
Original
Figure 5.16 Result of original code.
Number of association rules generated are 10. The total time is 47ms.
Modified
Figure 5.17 Result of modified code.
Number of association rules generated are 10. The total time is 44ms. The runtime of
the apriori algorithm have been improved.
71
Chapter 6
6.0 CONCLUSION
6.1 Progress and Outcome
In the first phase of this project have been completed successfully. First, the
problem has been identified with setting up a list of objectives to be achieved. Then,
research stage has started. This research stage involved conducting a literature review
as well as review for data mining techniques. Furthermore, research on Customer
Relationship Management (CRM) in data mining and data mining application areas
have been performed.
The analysis section begun with analyze the current situation of shopping
centre and also the way of people spend their money in past, present and future. But
the most important part was testing several data mining algorithms and make
comparison on which method is the best. Finally the design section has started by
designing a data mining process and model set.
The next section after design section is implementation. The implementation
begun with data mining process (methodology).After that proceed with build models
set for prediction. Then compile modified code using Apache-Ant so that the code can
be used by WEKA.Lastly, generate best rules by import dataset into the WEKA
software and compare the run time between original code and modified code in
different WEKA. The result show that the run time have been improved with modified
code.
71
6.2 Problems Encountered
Difficulty with using data mining tool, WEKA.
Difficulty with obtaining datasets.
Lack of experience in designing a model.
Lack of Internet resources.
Difficulty in modifying source code.
Time limitation as several courses requirements were due at the same time.
6.3 Future Planning
There isa new data mining software name SPMF. SPMF is an open-source data mining
mining library written in Java, specialized in pattern mining. It offers implementations
of 86 data mining algorithms for sequential pattern mining, association rule mining, itemset
mining, sequential rule mining and clustering. I hope I can do some research on this software
and compare with my current project in the future.
72
REFERENCES
Online Research
1) Data Mining: What is Data Mining
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/
palace/datamining.htm Date extracted: 24/6/2014
2) Definition of Data Mining
http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/ Date
extracted: 24/6/2014
3) Investopedia explains ‘Data Mining”
http://www.investopedia.com/terms/d/datamining.asp Date extracted:
24/6/2014
4) Oracle Data Mining Concepts
http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/process.htm#CHD
FGCIJ Date extracted: 30/6/2014
5) Resources of Data Mining http://www.rithme.eu/?
m=resources&p=resources&lang=en Date extracted: 30/6/2014
6) Data Mining Techniques
http://www.ibm.com/developerworks/opensource/library/ba-data-mining-
techniques/index.html Date extracted: 6/7/2014
7) Carry Out Data Mining and Machine Learning with Weka
http://www.opensourceforu.com/2014/03/carry-data-mining-machine-learning-
weka/ Date extracted: 10/7/2014
76
8) An Introduction to Data Mining
http://www.thearling.com/text/dmwhite/dmwhite.htm Date extracted:
27/6/2014
9) How Business Can Benefit from Data Mining
http://www.tmcnet.com/topics/articles/2013/03/21/331429-how-businesses-
benefit-from-data-mining.htm Date extracted: 27/6/2014
10) An Overview of Data Mining Techniques
http://www.thearling.com/text/dmtechniques/dmtechniques.htm Date
extracted:
11) Data Mining Techniques
http://www.uta.edu/faculty/sawasthi/Statistics/stdatmin.html#index Date
extracted: 7/7/2014
12) Data Mining Classification
http://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm
Date extracted: 17/7/2014
13) Data Mining System
http://www.tutorialspoint.com/data_mining/dm_systems.htm Date extracted:
14) Data Mining Process Model http://www.rithme.eu/?
m=resources&p=dmmethod&lang=enDate extracted:
15) CRM – Customer Relationship Management
http://www.webopedia.com/TERM/C/CRM.html Date extracted: 1/8/2014
16) What is CRM? http://searchcrm.techtarget.com/definition/CRM, posted by
Margaret Rouse. Date extracted: 11/8/2014
76
17) Data Mining and Customer Relationships
http://www.thearling.com/text/whexcerpt/whexcerpt.htm, by Kurt Thearling.
Date extracted: 11/8/2014
74
18) A Review of Data Mining Tools in Customer Relationship Management
http://www.tlainc.com/articl149.htm, Journal of Knowledge Management
Practice, Vol. 9, No. 1, March 2008 - Jayanthi Ranjan, Institute of
Management Technology, Ghaziabad, Vishal Bhatnagar, Indraprastha
University, Delhi. Date extracted: 19/8/2014
19) Data Mining for Shopping Centres – Customer Knowledge Management
Framework
http://bura.brunel.ac.uk/bitstream/2438/1471/1/KMSCBasedOnChapshortV5.p
df Date extracted:30/8/2014
20) Customer Classification And Prediction Based On Data Mining Technique
http://www.ijetae.com/files/Volume2Issue12/IJETAE_1212_58.pdfDate
extracted:14/8/2014
21) Data Mining Techniques: For Marketing, Sales, and Customer Relationship
Management http://books.google.com.my/books?
id=AyQfVTDJypUC&pg=PA162&lpg=PA162&dq=Membership+Supermark
et
%27s+Customer+in+data+mining&source=bl&ots=KWFyqsQYyK&sig=Uyh
kDWZ2kHDBx-
XVtW9nx5SnTIo&hl=en&sa=X&ei=cZ_8U5_2KoWE8gW9_4CADA&redir_
esc=y#v=onepage&q=Membership%20Supermarket's%20Customer%20in
%20data%20mining&f=false Date extracted:13/8/2014
22) How Do Supermarkets Use Your Data?
http://www.select-statistics.co.uk/article/blog-post/how-do-supermarkets-use-
your-data Date extracted:29/8/2014
76
23) What is the CRISP-DM methodology?
75
http://www.sv-europe.com/crisp-dm-methodology/ Date extracted:21/8/2014
24) Association Rules Apriori Algorithm
https://fenix.tecnico.ulisboa.pt/downloadFile/3779571250083/licao_9.pdfDate
extracted: 29/9/14
25) Data Mining – Applications & Trends
http://www.tutorialspoint.com/data_mining/dm_applications_trends.htm Date
extracted: 10/7/2014
26) GitHub
https://github.com/jashmenn/apriori
Date extracted: 12/1/2015
27) Association Mining with Weka
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html
Date extracted: 12/1/2015
28) Association Mining with Weka
http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/associate.html
Date extracted: 20/1/2015
29) AprioriItemset Generation
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_apriori.html
Date extracted: 20/1/2015
30) Pentaho Data Mining
http://wiki.pentaho.com/display/DATAMINING/Apriori
Date extracted: 20/1/2015
76
31) SPMF
http://www.philippe-fournier-viger.com/spmf/index.php?link=download.php
Date extracted: 21/1/2015
32) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
Date extracted: 20/1/2015
33) All My Brain
http://allmybrain.com/2007/11/12/implementing-the-apriori-data-mining-
algorithm-with-javascript/Date extracted: 12/1/2015
34) CODE PROJECT
http://www.codeproject.com/Articles/70371/Apriori-Algorithm
Date extracted: 18/1/2015
35) stackoverflow
http://stackoverflow.com/questions/17125742/creating-k-itemsets-from-2-
itemsetsDate extracted: 16/1/2015
36) compilr
https://compilr.com/soniaj/apriori/Project.java
Date extracted: 22/1/2015
37) Apache Ant - Tutorial
http://www.vogella.com/tutorials/ApacheAnt/article.html
Date extracted: 23/1/2015
38) Uregina
http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/Apriori.javaDate
extracted: 22/1/2015
77
Reference Book
1) Data Mining Practical Machine Leaning Tools and Techniques Second Edition
by Ian H. Witten, Department of Computer Science, University of Waikato
and Eibe Frank, Department of Computer Science, University of Waikato.
APPENDIX
Project 1 – Gantt Chart
Semester 2
No. Activities 21-Nov
28-Nov
5-Dec
12-Dec
19-Dec
26-Dec
2-Jan
9-Jan
16-Jan
23-Jan 30-Jan
1 Deeply research in apriori algorithm
2 Select appropriate data
3Analyse and prepare dataset for simulation
4 Modify apriori algorithm
5 Validate model6 Documentation
Project 2 – Gantt Chart