COLLEGE OF MANAGEMENT IN TRENČÍN

Data Mining: A Tool for Knowledge Discovery 0

COLLEGE OF MANAGEMENT IN TRENČÍN

USING DATA MINING AS A TOOL FOR DISCOVERING

IMPORTANT KNOWLEDGE FOR COMPANIES

2010 Tomáš Vanek


COLLEGE OF MANAGEMENT IN TRENČÍN

USING DATA MINING AS A TOOL FOR DISCOVERING

IMPORTANT KNOWLEDGE FOR COMPANIES

Bachelor Thesis

Study program: Knowledge Management

Workplace: College of Management, Bratislava

Thesis advisor: Tomáš Vanek

Consultant: Martina Česalová, M.S.C.S

Trenčín 2010 Tomáš Vanek


Content

1. Introduction and Problem Statement ................................................................................. 1

2. Review of Literature .......................................................................................................... 2

3. Description of the Methodology ........................................................................................ 3

4. Data mining overview ........................................................................................................ 4

4.1. Principles of Data Mining ............................................................................................... 4

4.1.1. Definitions of Data Mining ...................................................................................... 7

4.1.2. History of Data Mining ............................................................................................ 7

4.1.3. The Evolution and the Future of Data Mining ......................................................... 8

4.1.4. Disadvantages of Data Mining ................................................................................ 9

4.2. Data Warehousing ......................................................................................................... 10

4.3. Knowledge Discovery Process and Data Mining ......................................................... 11

4.3.2. CRISP-DM model .............................................................................................. 12

4.3.2.1. Business understanding ................................................................................... 13

4.3.2.2. Data understanding ......................................................................................... 13

4.3.2.3. Data preparation .............................................................................................. 14

4.3.2.4. Modeling ......................................................................................................... 14

4.3.2.5. Evaluation ....................................................................................................... 15

4.3.2.6. Deployment ..................................................................................................... 15

5. Practical Project – Data Mining in Banking Domain ...................................................... 16

5.1. Business Understanding ............................................................................................ 17

5.2. Data Understanding .................................................................................................. 19

5.3. Data Preparation ....................................................................................................... 20

5.4. Modeling ................................................................................................................... 22

5.5. Evaluation ................................................................................................................. 23

5.6. Deployment ............................................................................................................... 23

5.7. Project Conclusion .................................................................................................... 27


Thesis Conclusion ................................................................................................................ 28

List of Pictures ..................................................................................................................... 31

List of Figures ...................................................................................................................... 32

Literature .............................................................................................................................. 33


List of abbreviations

CRISPS-DM - Cross-Industry Standard Process for Data Mining


Acknowledgements

I would like to thank to Martina Česalová, M.S.C.S. for her patience and advices during

writing this thesis.


1. Introduction and Problem Statement

The power of information can be considered as a very important factor in today's

businesses. The popularity of information technology caused that many data from different

areas is collected and stored. The data are stored every time a person access a web page,

purchases a product, or makes a phone call. These data consist of hidden information that

is very important.

Data mining is a tool that allows analyzing this data and therefore extracting

useful, previously unknown and interesting information. This tool is used mostly by

companies that collect and store large number of data. Mining the data therefore allows

them to gain essential knowledge and use it to their benefits. Thus data mining represents

quite a new and unique technology that can provide numerous advantages.

Objective of the thesis is to offer general information about the problem. The thesis

consists of theoretical and practical part. Specifically, the theoretical part informs the

reader about basic principles of the data mining. It starts by explaining how revolution of

information technology forced and still forces scientists to develop data mining

technology. Then thesis mentions some general examples why data mining can be

considered as a gold mine for some companies. Also, in order to provide accurate point of

view on technology, thesis mentions its advantages as well as the disadvantages.

Moreover, thesis talks about the history of data mining and also examines future

predications. The end of theoretical part focuses on six steps of standardized model called

CRISP-DM that is used for data mining projects.

The practical part of the thesis proposes data mining project that is applied in

financial sector. The goal of the project is to help bank segment its customers by using data

mining. The entire project is divided into six steps of CRISP-DM model. Basically, the

project covers business opportunity, describes the used data, introduces model and

suggests deployment of proposed solution. As a final result, the bank can use the

segmentation to improve the process of decision making and to introducing new services.


2. Review of Literature

During the writing of the thesis, various sources have been used. Great effort has

been made to use different types of sources. Specifically, collected information mainly

comes from printed and internet sources.

The first, theoretical, part of the thesis is primary written from two books. Most of

the information is taken from the book called “Introduction to Data Mining and its

Applications” written by Dr. S. Sumathi and Dr. S.N. Sivanandam. Both are professors at

College of Technology in India and therefore experts in their field. At the beginning, the

book provides very clear and general introduction into science. Information in the book are

presented in very extended way therefore the summarization has been used quite often. The

second book used in the theoretical part is called “Data Mining - A Knowledge Discovery

Approach” and is written by four authors: Krzysztof J. Cios, Witold Pedrycz, Roman W.

Swiniarski, and Lukasz A. Kurgan. All four authors work for different universities across

USA and Canada. As the name of the book says, the book mainly focuses on the

knowledge discovery by using data mining and therefore is very suitable for the thesis.

Moreover, there have a few been internet sources used. For example, to describe business

opportunities that data mining offers, the YouTube video by by Dr. S. Srinath from Indian

Institute of technology has been used.

In the second, practical, part of the thesis the internet source has been used do

describe credit scouring method. Information about credit scoring has been taken from the

internet site called myFICO. This internet page has been on the market since year 2001 and

primary deals with credit risk scoring issues for finance segment therefore can be

considered as a relevant source. Moreover, to finish practical part of the project, the book

named “Dobývání znalostí z databází” that can be translated as “Gaining the Knowledge

from Databases” has been used. The book is written by Doc. Ing. Petr Berka, who works

for The University of Economics in Prague. The practical part of the thesis could not be

done without this book because is does not only deal with theoretical information, but also

practical demonstrations of data mining methods.


3. Description of the Methodology

In the thesis, the evaluation method has been used. Many sources has been

collected and analyzed to gain the certain knowledge about data mining. After that, the

most important things has been researched again and presented in the thesis. To highlight

the importance of data mining and knowledge discovery in today’s competitive market

environment, the examples were used. Moreover, gained theoretical knowledge was

applied in the practical part of the thesis that was done to show how data mining can be

used in banking environment.


4. Data mining overview

4.1. Principles of Data Mining

An enormous number of data that is nowadays created, used and stored on every

day bases caused a demand for a new tool that could help to analyze these massive data.

Therefore, demand for a tool that turns stored data into useful knowledge that is easily

understandable by human beings. Traditional techniques for analyzing data were very

useful and solved many problems. These techniques mostly used statistics to analyze the

data and therefore could only extract certain data characteristics. This limitation and need

for a new tool for data analysis caused that scientists started to collect ideas to develop a

machine learning tool. This effort has led to a new research area called data mining and

later to a research area called- data mining and knowledge discovery. But it all would not

be possible without computer revolution. (Sumathi & Sivanandam, 2006)

People have experienced the trend and revolution when it comes to information

availability. Especially during the last decade when the Internet and network based systems

allowed the global exchange of information. E-commerce business have experienced great

grow and companies started to collect more and more electronic information. More

importantly, technology and market opportunities caused that companies started to collect

and use right data. It means that they started to realize and analyze collected data rather

than collect it without further use. Soon many companies realized that “tracking,

accounting for, and archiving the activities of an organization, this data can sometimes be a

gold mine for strategic planning, which recent research and new businesses have only

started to tap” (Sumathi & Sivanandam, 2006). So with a support from scientists and

demand from commercial domains data mining starts to have ideal conditions to grow and

to be developed. (Sumathi & Sivanandam, 2006)

Data mining concept and growth could not be that fast without database technology

that was widely used in business environment with a great success. Organizations started to

create very large databases that reach capacity in terabytes. These databases hold the

business data like “consumer data, transaction histories, sales records, etc.”( Sumathi &

Sivanandam, 2006) that can very likely consist many important and valuable information.

This important business information is of course hidden in the data forms and need to be


somehow extracted. The extraction can be of course successfully done by using proper

mining method. (Sumathi & Sivanandam, 2006)

Data mining represents promising tool that can be described as “the process of

discovering meaningful new correlation, patterns, and trends by digging into (mining)

large amounts of data stored in warehouse, using statistical, machine learning, artificial

intelligence (AI), and data visualization techniques” (Sumathi & Sivanandam, 2006).

There are many industry areas that are already using mining of data. For example,

aerospace, medical or chemical, but because the technology is still quite new the number of

industries is still increasing. Not mostly for its impact on science, but also for its business

value. (Sumathi & Sivanandam, 2006)

When speaking about business value of data mining it can literally symbolize a

gold mine. From business point of view, data mining can represent quite a beneficial and

unique asset. There are many benefits that data mining can have for a company or

generally for a business. Let’s look at a few concrete examples that can possibly motivate

managers or business owners to invest to this technology. Data mining can:

Influence decision making

Grow wealth

Help to analyze

Improve a security

Decision making is important process when running a company. Data mining can

reveal patterns from historical data and therefore can lead to certain knowledge. For

example, by analyzing company’s data, some hidden parents that repeat can be recognized.

Having this knowledge form the past, company can learn something new and therefore act

accordingly. Therefore, we can say that data mining can influence decision making. This is

very important because making strategic decisions are necessary for every company that

wants to stay on nowadays competitive market. (Srinath, 2008)

Making good decisions is also connected with wealth growing. Basically, if data

mining can help making right strategic decisions, it can logically also positively influence

financial situation of a company. Moreover, by mining data the wealth of information that

company has is growing. The information can be used in many different ways. For

example, product development, marketing, investment, etc. So, we can definitely say that

by using data mining company gains important knowledge. The gained knowledge can be


later transformed into strategic decisions that increase financial portfolio of a company and

therefore growth wealth. (Srinath, 2008)

As was mentioned data mining can reveal some patterns from history therefore help

to analyze the trends. Trend analysis can be used, for example in stock market. By mining

data, stock exchange companies can analyzing historical price of a stock end predict its

future price. But, what can be also very interesting for companies is risk analysis.

Exploring and analyzing data help companies that operate in financial sector to evaluate

customers. As will be proposed in practical project, bank can mine its data and basically

divide good customer from bad ones. Therefore analyze the risks before offering any

service to particular customer. Overall, we can claim that mined data can offer different

kind of information that can be used for analyzing purposes. (Sumathi & Sivanandam,

2006)

Lastly, data mining just recently started to be used for maintaining security. It is

quite a new field that includes mining data for discovering activity that can be possibly

illegal. (Srinath, 2008) In the year 2008, data mining was successfully used to help to

discover the biggest scandal in online gambling history. In short, few poker players ware

accused of cheating on poker site that was part of Ultimate Bet network. Online poker

players that turned into victims of cheaters used data mining to analyze the situation. They

came with the conclusion that it is statistically almost impossible to win so much money in

such a short time and contacted the company. It turned out that cheaters somehow avoided

the security systems and therefore were able to see the cards of opponents; witch is in

game of poker tremendous advantage. So in this case, known as Ultimate Bet scandal, data

mining helped to discover fraud detection and maintain security. (Brunker, 2008)

Obviously, these are just few possibilities why using data mining represents

benefits for business owners or companies. Of course, there are more that are also very

important. So, anther commonly used data mining uses that ware not discussed are listed

below with short description:

Market segmentation: Finding characteristics that are common for

customers that purchased same or similar products.

Customer churn: Identifying customers that are likely to leave the current

company and go to different one.

Direct marketing: Identifying and sending mails to specific group of

customers to achieve high response rate.


Interactive marketing: Determining in what information/product a customer

was interested in when browsing a web page.

Analysis of market basket: Identifying products and services that have high

probability to be purchased together. (Sumathi & Sivanandam, 2006)

4.1.1. Definitions of Data Mining

Many various definitions can be used to define data mining. A few following

definitions has been picked from different sources:

“Data mining is the efficient discovery of valuable, nonobvious information from a

large collection of data.” (Sumathi & Sivanandam, 2006)

“The aim of data mining is to make sense of large amounts of mostly unsupervised

data, in some domain.”(Cios, Pedrycz, Swiniarski, & Kurgan, 2007)

“…is the process of analyzing data from different perspectives and summarizing it

into useful information” (Palace, 1996)

It is the process of extracting previously unknown, valid, and actionable

information from large databases and then using the information to make crucial

business decisions.” (Sumathi & Sivanandam, 2006)

4.1.2. History of Data Mining

As was already mentioned data mining represents quite a young and

groundbreaking tool that itself has not a very long history. It has been recently a subject in

many magazines from business and software environment. Even though its significant

importance is now widely spread, a few years ago not so many people ware familiar with a

term- data mining. The term itself was firstly introduced in the 1990s. Data mining can be

basically traced from the three family roots. (Data Mining Software, n.d.)

The most important root is statistics. Classical statistics concepts like “regression

analysis, standard distribution, standard deviation, standard variance, discriminant analysis,

cluster analysis, and confidence intervals” (Data Mining Software, n.d.) are used in data

mining when studying data and its relationships. Even though today’s data mining uses

more advanced analysis, we can still say that core of data mining is build with the help of

basic statistical tools and techniques. So without statistics, data mining would certainly not

exist. (Data Mining Software, n.d.)


The second root data mining comes from is artificial intelligence. Artificial

intelligence basically allows applying brain to process statistical problems. This off course

requires computer processing approach, so it could not be used until the early 1980s. In

early 1980s computers became very accessible and people could buy processing power at

the quite reasonable prices. Later when computers became faster and cheaper the growth of

data mining continued faster. Also, supercomputers allowed to study and analyze large

number of data because of its super processing power. Overall, the biggest advantage of

artificial intelligence was that it allowed to process data faster and more precisely than

humans could. (Data Mining Software, n.d.)

The last root is represented by the combination of statistics and artificial

intelligence. This union is known as machine learning. Because in 80s and 90s computers

became cheaper and faster, the machine learning experienced evolution. More applications

were released because computers became more accessible than artificial intelligence.

Actually, machine learning is considered as advancement of artificial intelligence. The

main advancement of machine learning is typical of ability to make computer programs to

lean about the studied data. This advantage allows programs to make decisions based on

the gained knowledge from the data. Then it achieves its goals by using statistics and

advanced algorithms. (Data Mining Software, n.d.)

In one sentence, short history of data mining can be precisely described “as the

union of historical and recent developments in statistics, AI, and machine learning” (Data

Mining Software, n.d.).

4.1.3. The Evolution and the Future of Data Mining

According to Dr. Sumathi and Dr. Sivanandam the evolution of data mining was

natural process that was caused by increased use of information technologies. As the meter

of fact, increase of information technologies went along with increase the data that have

been used. Logically, the larger amounts of data had to be stored and analyzed. Traditional

methods, such as of creating queries and reports did not handle working with large

amounts of data therefore data mining started to be developed and widely used. Data

mining soon started to be considered as a tool that has a big future potential. (Sumathi &

Sivanandam, 2006)

Future of data mining can be described as very bright. As was already mentioned,

the whole potential of data mining is not used and the concept of mining data is still being

developed. In the near future, data mining will penetrate into more business. Data mining


will logically became very profitable and valuable tool in many areas. There are many

markets that could be heavily influenced by data mining tool, but probably the most

significant that is going to be influenced is advertising market. Data mining will allow

advertising to explore unique inches, which would attract wide range of new customers.

Moreover, data mining will be available for general public. In terms of usage, data mining

will be easier to use. That means not only experts in the field would be able to use benefits

of data mining, but with the user-friendly applications and tools the technology would be

as easy to use as e-mail. General public would possibly be able to find the lost numbers of

classmates, or the best loan in the area within a short period of time. (Sumathi &

Sivanandam, 2006)

Speaking about long-term changes, data mining can do a lot for us. The changes

and challenges are really exciting and ground braking. For example, by applying data

mining into medical areas, we could possibly be able to discover a new treatments and

practices for illnesses that we are not able to cure so far. (Sumathi & Sivanandam, 2006)

4.1.4. Disadvantages of Data Mining

It should be now clear that data mining is very valuable tool that can offer quite

unique benefits for companies that operate in different businesses. Even though the

technology cannot literally harm anyone the purpose of this part is to discover possible

drawbacks. At the moment, data mining does not have any primary disadvantage that could

raise any concerns among companies that are willing to invest in this technology. Some

scientists and experts however raised a few questions about possible disadvantages that can

occur. In the future, the main disadvantages that are likely to be connected with data

mining are privacy and security. (Chhay, 2005)

Technology boom has caused that privacy has became a mayor concern among

people. It allows people to do everyday tasks easier, faster and more comfortable. But it is

the same technology that forces people becoming more sensitive about their privacy. It is

because most of the technological tools used are able to track and store person’s private

information. Whenever somebody makes a phone call, pays with a credit card, visits a web

page, or books a flight ticket data are collected. This kind of data is already stored in

databases among many companies. But what if all information were collected together?

Collecting all the data from different sources represents the real concern. By analyzing

these data a lot would be possible to tell about individuals. Even though, each country has

a different privacy rights, generally it is illegal to sell or exchange data about private


information of customers within organizations. However this kind of transactions is hard

co control. As Heng Chhay wrote “…in 1998, CVS had sold their patient’s prescription

purchases to a different company” (2005). Selling information about customers without

their knowledge is definitely violation of privacy. (Chhay, 2005)

Security is another main issue that occurred and will represent disadvantage in

future. Companies collect information about customers, but many of them do not have

appropriate security measurements. Therefore there ware many cases when the data ware

accessed and misused. For example, company called Ford Motor Credit had to apologies to

13,000 of their customers because “their personal information including Social Security

number, address, account number and payment history were accessed by hackers who

broke into a database” (Chhay, 2005). As the result, the company has lost its reputation.

Therefore, companies should always think about safety of the data because

underestimating security measurements can lead to disaster. (Chhay, 2005)

4.2. Data Warehousing

Even though the topic of data warehousing may not seem to have important role for

data mining, the opposite is true. It is very important to cover and to understand data

warehousing concepts because data warehousing is closely connected with data mining.

Data warehousing can be basically defined as ”a process of centralized data management

and retrieval” (Sumathi & Sivanandam, 2006). As well as data mining, data warehousing is

quite a new concept. It is important to know that data warehouse is not software, or

hardware, but can be better defined as an environment. The environment that allows

companies or corporations store their data into relational database systems. These systems

are designed to satisfy high level of performance and support large databases. To make this

clear, we can say that data warehousing and data mining are two enterprises that operate

very well together. It is because data warehousing provides the memory and data mining

the intelligence. (Sumathi & Sivanandam, 2006)

Any organization that has a lot of data that is created and stored faces the problem

to turn these data into valuable information. This information is usually unknown, but

presented in already existing and stored data. To extract information from the data and

therefore turn data into knowledge, certain steps need to be applied. For example, the data


needs to be stored in certain form and organized, so the mining can be applied. (Sumathi &

Sivanandam, 2006)

Primary purpose of data warehousing is to allow end users search for information

that would support, for example, his/her strategic decision making. End users can access

and interact with the data warehouses by front-end tools. These access tools can be divided

into five main groups:

“1. Data query and reporting tools

2. Application development tools

3. Executive information system (EIS) tools

4. Online analytical preprocessing tools and

5. Data mining tools” (Sumathi & Sivanandam, 2006)

4.3. Knowledge Discovery Process and Data Mining

To understand the process of extracting valuable information from data that are

stored in databases, the process of knowledge discovery needs to be briefly explained. The

knowledge discovery process can be described as “the nontrivial process of identifying

valid, novel, potentially useful, and ultimately understandable patterns in data” (Cios,

Pedrycz, Swiniarski, & Kurgan, 2007). So, what is the basic difference between data

mining and knowledge discovery process? Data mining is just one of many steps that

knowledge discovery process covers. The basic knowledge discovery process can be seen

on Figure 1 below.

Figure 1: Knowledge discovery process model

Source: Cios, Pedrycz, Swiniarski, & Kurgan, 2007

As can be seen the model has to have an input that represents data and output that

represents knowledge. Input is defined as the data that are going to be analyzed. The type

of data of course differs depending on project. However input of data can typically include


“numerical and nominal data stored in databases or flat files; images; video; semi-

structured data, such as XML or HTML” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007).

The collected data then goes through number of steps that are interconnected by feedback

loops. The result, as can be seen on Figure 1, includes the final knowledge. All in all,

knowledge discovery process can be defined as a progress that helps to change data into

useful knowledge by applying patterns and algorithms. (Cios, Pedrycz, Swiniarski, &

Kurgan, 2007)

4.3.2. CRISP-DM model

A lot of effort has been made to create model that would define the process and

phases of data mining projects. One of them was for example - Cabena et el. (Cios,

Pedrycz, Swiniarski, & Kurgan, 2007) that consists of five steps and is supported by IBM.

Another is CRISP-DM model, which consists of six steps, became more popular and

leading model among the others. Therefore this model will be explained in details.

CRISPS-DM means Cross-Industry Standard Process for Data Mining. It was

introduced in the 1990s by the European Commission of companies as a free to use data

mining model. (Hunter, 2009) CRISPS-DM was developed to create standard process for

data mining projects. It was because data mining was quite new and nobody followed any

particular process or guide when developing data mining projects. This process is very

flexible and can be uses in variety of industry areas and with variety of data mining

software. CRISPS-DM process is very valuable because it makes data mining projects

faster, cheaper, more efficient and more reliable. CRISPS-DM model consists of six

unique steps or phases as can be seen bellow on the Fifure 2. (Cios, Pedrycz, Swiniarski, &

Kurgan, 2007)


Figure 2: Cross-Industry Standard Process model

Source: Crisp-dm, n.d.

4.3.2.1. Business understanding

As can be seen on Figure 2, Cross-Industry Standard Process model starts with

business understanding. The first phase is very important because in business

understanding the primary goals are defined. It basically means the main purpose of the

whole data mining project. There needs to be specified what we want to know or learn

from available data that we are going to explore. Also it is important to set what questions

the project should answer and what business value the project is holding. In the business

understanding phase, there needs to be the project goal set and specifically measurable

project success. It is also necessary to know that this initial phase gives the whole project

the direction. Without clear defining objectives, the project can lose its direction and

therefore can lose its initial purpose and fail to success. (Hunter, J., 2009)

4.3.2.2. Data understanding

The second phase starts with the collecting already existing data. Data

understanding can be also described as familiarization with data. This phase requires


domain expert, who explores interesting data and detect possible data problems. According

to already specified business needs mentioned in the previous phase, the data are explored.

Either by using graphic visualization or by statistic approach. Moreover, in this step

domain expert starts to look at basic relationships between the available data. As can be

seen on the Figure 2, the business understanding and data understanding are interconnected

with each other. This interconnection exists because finding the relationship between the

data can trigger the business understanding. For example, we can find out that data does

not influence enough information to satisfy primary goal set in business understanding. So

the goal needs to be changed, or replaced because we would not be able to achieve it. In

other words, during the first and second phase, the hypothesis and goals for the project are

formed into final version. (Hunter, 2009)

4.3.2.3. Data preparation

Data preparation phase is usually the most time consuming. In some cases it can

take more than 80% of the project’s schedule time. The time is usually influenced by the

quality of the data that are available. If the raw data are messy it can take a lot of time to

sort it. For example, some attributes and variables can be incorrect or can be missing.

During data preparation the final dataset that is going to be used is created. The data set is

created by selecting needed data. Moreover, during this phase the data needs to be cleaned

into form that would be suitable for the purpose of the project. (Hunter, 2009)

4.3.2.4. Modeling

In this phase, there is a wide range of modeling techniques selected and used.

Several models are applied for the same data mining problem and later are modified for

optimal output. As can be seen on the Figure 2 the data preparation and modeling phase are

interconnected. Interconnection is created because some models require concrete input of

data; therefore often the step back into the previous phase is necessary. After the data are

cleaned and modified, algorithms can be used again. (Hunter, 2009)

Modeling stage is divided into four parts:

“selection of modeling technique(s)

generation of test design

creation of models, and

assessment of generated models” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007)


4.3.2.5. Evaluation

After the model or models have been created they need to be reviewed and the best

model is chose. The right one needs to be evaluated from the project’s business objective.

The right model(s) need to usefully satisfy the set of goals. It is essential to find out if all

the business goals have been considered. Additionally, in this step it is important to decide

how to use the model or collection of models. (Hunter, 2009)

4.3.2.6. Deployment

At this final phase the chosen model that is suitable for data mining project needs to

be known. Deployment phase uses the chosen model to score the data. However, this phase

could not be the final. Sometimes it is better to step back to third phase and add more data

to achieve better results. (Hunter, 2009)

Deployment phase is divided to:

“plan deployment

plan monitoring and maintenance

generation of final report

review of the process substeps” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007)


5. Practical Project – Data Mining in Banking Domain

After covering theory it should be clear what the data mining is and what are the

pros and cons of the tool. However, to understand the issue better it is always good to

apply theoretical knowledge into practice. Therefore in this part, the gained knowledge is

going to be exercised in practice. The goal of the practical project is to apply data mining

concepts and possibly solve some problem, or fill out some need. As was previously

mentioned, data mining can be used in many different areas. As a student of business

school, I have decided to apply data mining to finance sector. To be specific, the practical

project will deal with banking domain.

The primary goal of the project is to help a bank to improve its services.

Specifically, the bank wants to use data mining to evaluate its customers. By using data

mining, the existing data of the bank can be analyzed and used for the evaluation. The

evaluation will basically help bank to divide its customers into categories. This

segmentation can be very beneficial for the bank because it can, for example, divide bad

customers from good ones.

The data mining project will follow CRISPS-DM methodology that can be seen on

Figure 2 and was described in theoretical part of the thesis. This standardized model is

ideal for the project, so the project will be divided into six main parts. In the data

understanding part, some of the data mining techniques that are used in financial sector

will be mentioned. Moreover, each technique will be briefly exploited and explained.

Additionally, one that is suitable for the purpose of the project will be chosen. Most

importantly, the business understanding part will mention clear goal of the project and its

business value. In the second step, the needed data for the project will be discovered as

wall as defined. In the data preparation phase, different attributes that are essential for the

project will be mentioned. In the fourth, modeling phase, the model will be introduced. In

the evaluation phase, the model will be evaluated and deployment phase will finally

explain the actual options how the bank can use the data mining solution. Lastly, the

project will be summarized in the conclusion part.

In the real life environment, the project would need three experts. Domain expert

would be responsible for business purpose. That means for business understanding and

deployment part. The data expert would take care about data understanding plus data


preparation part and the last expert would be data mining expert. He/she would be

responsible for creating models and its evaluation.

5.1. Business Understanding

In the business understanding phase the direction of the project needs to be set. But

firstly, it is crucial to explain a few data mining methods that are used in financial sector.

Furthermore, the one that is the most appropriate for the project will be chosen and the

project goal can be set.

First data mining method that could possibly the bank use is Customer Relationship

Model. This model is used to measure customer’s response to service or product. By

scoring a customer, the bank knows how successful the product or service is. The

information can also predict customer’s behavior. For example, if the bank introduces a

new service and the data will show that the service is poorly used, the bank can assume

that service is not needed. Even though the service was considered important by bank, the

customers proved the exact opposite and therefore the bank can predict that introducing

similar services or products is not necessary. Implementation of the solution differs and

basically is influenced by how the bank communicates with customers. It means that data

can be gathered by different ways. For example, the bank can contact customers or

opposite. This solution can definitely be used to improve customer care and therefore to

increase competitive advantage of the bank. (Dass R, 2006)

The second method used in financial sector is called Risk Analysis. This method is

mainly used to forecast factors that can somehow influence the company. In our case, the

bank can use the historical data to make right decisions. This method can lead co cost-

effective running of the bank. The predictions can help the bank to stay competitive in the

marketplace. This method does not primary deal with service improvement and customer

care therefore is not going to be used for the project. (Dass R, 2006)

Stock analysis and predictions is the third method that could be used. The method is

mainly used in stock market, but can also be used by banks that specialize on making

investments. This method is not focused on customer. The main idea is to make a

prediction based on the already existing data. It can basically be described as making the

predictions about future based on historical data. Stock analysis focuses on finding

historical events that are likely to repeat. Such a predictions are very valuable when

predicting market trends, making decisions about whether or not to buy a stock, or when to

by a stock. However, there are still so many factors that can influence the final prediction,


like financial crisis or natural disasters, so forecast cannot be considered 100 percent

correct. The third method, stock analysis and predictions, generally does not fulfill project

criteria therefore will not be used. (Dass R, 2006)

The last method that could be suitable for the project is credit scoring method. This

method is already used in financial sector for a few years. The method is primary used by

banks to evaluate customer. By evaluation, the bank can predict possible risks when

borrowing the money. It means that the bank, as the lender, uses credit formulas to analyze

borrower’s data. Because the system is not just used in banking sector, credit formulas may

differ. However, the important information for the bank may be seen on Figure 3. The pie

chart shows the data that the bank has about the customer. The data are divided into five

main categories: amount owned, payment history, types of credit used, new credit, and

length of credit history. (myfico, 2009)

Figure 3: FICO Scores chart

Source: myfico, 2009

The percentage information on the Figure 3 reflects the importance of the

information. That means the information about history of payments is more important than

amount owned etc. Each of the five information will be explained to get the exact idea why

are important.

As can be seen on the Figure 3 the most important factor that bank considers is

“payment history”. The importance of the information represents 35 out of 100 percent.

Payment history includes information about payments on accounts. If the customer has or

had mortgage, if he or she has credit cards, loans etc. Second important attribute, “amount

owned”, represents 30% of the pie chart. The information includes the amount customer

30%

35%

10%

10%

15% Amounts Owed

Payment History

Types of Credit Used

New Credit

Length of Credit History


owns on account, number of accounts, credit limits on accounts etc. To “length of credit

history” is dedicated 15 percent and it, for example, includes information about dates when

the account(s) was opened, or information how often the account is used. Finally, the last

two information are each worth 10 percent. “New credit” holds information about recent

accounts. It includes: times when they were opened, credit limits, or credit history. Last 10

percent that is included in pie chart is called “types of credit used” and basically includes

data about types of accounts customer uses. If, for example, he or she has credit card

accounts, or loan accounts. (myfico, 2009)

The purpose of the practical project is to evaluate customers of the bank. By

evaluation, the segmentation can be done and used for services improvement. Credit

scoring method is ideal for the project purpose and therefore was chosen from all

mentioned solutions. The method can evaluate customers according to bank’s criteria. By

using right data and proper model, bank would be able to decide whether customer is worth

borrowing the money. So, the project goal is to create accurate model that would be able to

do such segmentation. Also, creating the model would lead to service improvement and

that is the main business goal of the practical data mining project.

5.2. Data Understanding

In the data understanding phase the main goal is to get familiar with the data that

are available for the project. Then decide what data are interesting/ suitable for the goal of

the project. So in the data understanding part, the domain expert would collect the data

from the bank and analyze it. The data collected contains very sensitive data, about

customers and the bank. Because the real data contains such information they can be

considered as a part of the bank’s know-how. Moreover, revealing private information of

banks’ customers would violate their privacy rights as well as bank’s reputation. Therefore

the data are not available for the general public. Because of this fact, the data from any real

bank are not available for the project; therefore common sense and assumption will be

used.

The bank stores large number of data that includes: information about employees,

transactions, customers etc. For the purpose of the project the main focus is given on the

information about customers and the other data that are not in any relationship with

customers can be considered as irrelevant. Information about customers is collected when a

customer asks for an account. These information include name, address, phone number,


services that he/she needs etc. So the data collection is done when creating the account and

is stored in bank’s database.

The model of the banks database consists of many classes. The class diagram can

be seen on Figure 4. The relationships between classes represent the lines and the symbols

represent the type of the relationship. The relationship is following: one customer can have

one or many accounts, one account can have one or many transactions, one customer can

have one or many services and one or many loans.

Figure 4: Class diagram of bank’s database

After the familiarization, the domain expert concludes that the bank’s database is

suitable for the project goal. It means that it contents enough data to produce valid result.

So, the data understanding part was successful and data preparation can begin.

5.3. Data Preparation

Data preparation part will cover the concrete attributes that will be important for the

project. The data were collected from four tables. From customers table the personal

information will be needed. The most crucial information for the project from this table

are: age, income and employment. Then some attributes from services table will be

collected as well as from account table. For the purpose of the project the most important

attribute in the account table is account balance. Lastly, loan table consists of attribute

called amount that is also very important and specifies the amount of money customer

wants to borrow.

In the next step all the important attributes that are presented in mentioned tables

needs to be modified, so the decision tree would know how to process them. The list of the

attributes that are clustered according to boundaries can be seen below:

1

1 1 ∞

1

∞

∞

∞

Account Customers Services

Loan Transactions


Personal information:

Gender: Male/ Female

Marital status: Divorced/separated/married/single/widowed

Age:

young: 0 – 25

middle aged: 25 – 50

old: 50 – 67

retired: >67

Annual Income:

low: 0 - 499

middle: 500 – 799

high: > 800

Employed: yes/no

Job position: employed/unemployed

Accommodation: own/rent/for free

Number of residents in the household: (number)

Number of children: (number)

Service information:

Number of credit cards: (number)

Insurance: yes/no

Internet banking: yes/no

Account information:

Monthly account balance:

low: 0 > 249

middle: 250 – 999

high: >1000

Credit history: credit never taken/ all credit payed on time/ delay in payments

Number of loans: (number)

Number of permanent transactions: (number)

Number of transactions: (number)


Loan information:

Type of the loan: house/student/combined/others

Purpose of the loan: house/car/equipment/investment/business/others

Amount: (number)

Monthly payments: (number)

Debtors: none/co-applicants/guarantor

5.4. Modeling

After the modification of attributes the process of modeling can begin. In the

modeling phase the decision model is created. For the purpose of the project simple

decision tree will be proposed to demonstrate the possible criteria that bank can require.

The data mining expert designed decision tree that can be seen on Figure 5 according to

three basic attributes: annual income, monthly account balance and employed.

Figure 5: Decision tree model

Source: Berka, 2003

No

Middle

Middle/

Low

High

High

Annual income

Monthly account

balance

Employed

Yes

Yes

Yes

No

No

Low

Yes


According to proposed model, customer asking for a loan would firstly be

considered by his or her annual income. As can be seen in data preparation part, the

attribute has been clustered according to boundaries into three groups: high, middle and

low. As can be seen on decision tree, the customer will get the loan if he or she has high

income (more than €800). If not customer is being considered by second attribute. As can

be seen on Figure 5, the second attribute is monthly account balance. The exact same

principle is applied here as well. The customer is evaluated and is given the loan if he or

she has bigger balance than €1000. If the customer’s balance is smaller than €249, he or

she is not suitable for the loan. In case that the balance is between ranges €250 – €999, the

customer is considered according to the last- third attribute. The attribute is simple,

customer gets the loan if he or she is employed and vice versa.

The proposed decision tree is very simple and easy to understand. Of course, the

bank can easily change the requirements for the loan. For example, the bank can decrease,

or increase the amount of annual income. Also the decision tree can be simply modified.

Additional attributes that will help to evaluate customers can be added. It all depends on

requirements that are given by bank.

5.5. Evaluation

In evaluation phase, the created model needs to be evaluated and tested. The data

mining model used in practical project is based on decision tree that has been described in

modeling phase. The created model does meet all the business objectives and goals of the

project and therefore can be evaluated as suitable. To ensure the model will work properly,

it needs to be tested. The data mining expert decided to test the model on sample size of

5000 customers. All customers of sample will be evaluated according to model and the

data could be reviewed. If for example error had occurred only with 5 customers, the bank

can be sure that the model has approximate 99.9% accuracy.

5.6. Deployment

The functionality of the proposed model based on decision tree guarantees the bank

very high accuracy. So the next step and the purpose of the deployment phase is to apply

the solution in bank and therefore propose the changes that can be done. As the result,

domain expert suggests applying the platform in two basic ways:


1. Changing the process of decision making

2. Introducing and improving services

The first and the most crucial improvement will allow clerks in the bank decide if

the loan should be given or not. Changing the process of deciding whether or not to borrow

money will help decide if a person applying for the loan is worth borrowing the money or

not. Clerks will use platform that according to data evaluate the borrower and identify

him/her as a “yes” or “no” customer. Each customer will need to be evaluated before the

loan is given. So, the decision making process will be much easier and the possible

mistakes that can be done by clerks will be minimized. This of course will make the work

of employees in the bank much easier. However, if the platform marks the customer as not

suitable for the loan, the clerk will always need to check if the data are correct. In case

customer does not pass, it is clerk’s responsibility to find the reasons and explain them to

customer. For example, clerk can advice customer to increase the account balance or

decrease the amount of money borrowed. Moreover, if customer does not, for instance,

have any account balance or is unemployed he/she fails completely. In this case, the clerk

needs to explain that he/she does not fulfill the bank’s criteria and therefore the loan cannot

be approved.

Secondly, the created model will support introduction of new services and

improvement of services that the bank already uses. The proposed decision tree will be part

of applications that will be created for a bank. The first one is internal application and will

be used only by employees. The second application will be external. It will be part of

online platform that will be used by customers.

The internal application is very important because it will allow employees to use

the model without any further knowledge about decision tree, or data mining. The

application needs to be programmed in some programming language that is commonly

used. It is also important that the created program will be compatible with operational

systems that are used in the bank. The program needs to be secured because it consists of

information that are sensitive and also includes bank’s “know-how”. The primary users of

the program will be clerks. We can assume that most of them probably have just basic

computer skills. So, the program should be user friendly and intuitive. That would allow

clerks to work with the program without going through long and complicated training. The

program would be used while the dealing with a customer. The program would allow the


bank to simplify the decision making process. Moreover, to have such a sophisticated

program, the clerks do not have to be very skilled or educated in banking sector. So, with

the help of the software the bank can introduce new services. For example, the bank can

start using new customer line that would be available 24/7. There would be one operator

needed with good people skills that would have a training to use the software. The

customer line would be for customers that do not have time to go to bank. They can simply

call and tell the information to the operator and find out if they can get a loan or not. By

introducing this service, bank can attract the broader target of customers and therefore

increase its revenue stream.

The external application would also attract more customers. The external

application would be simply online platform. This online platform will be on the bank’s

corporate website. The main purpose of the platform is to serve the clients that do not

want, or cannot go to bank. The platform would allow customers to enter the required data

into online form similar to one that can be seen on Picture 1.


Picture 1: Online form

Source: TrueCredit

The online form on Picture 1 is taken form TrueCredit web site and is used just for

practical demonstration what kind of data it may include. For example, the data includes

information/ purpose of the loan that customer applies for, personal as well as contact

information. As soon as the customer submits this form, he/she is automatically redirected

to next form that includes more detailed information about his/her credibility. For example,

amount of money needed, annual income, number of kids, if a person is employed etc.

Then the data will be analyzed and the customer could get the result about the loan he/she

applied for. With the growing popularity of the internet this solution can definitely attract

more customers therefore lead to financial benefits.


5.7. Project Conclusion

To highlight the importance of the project it is essential to mention that offering

loans to general public by commercial banks is still considered as a core business for them.

Therefore loans can be considered as one of the primary sources of revenue stream for

commercial banks. Logically, this fact forces banks to develop borrowing process and

therefore improve its services. The process itself is quite easy and simple. However, the

decision making is primary influenced by human procedure. It means that traditionally, the

borrower is evaluated by clerk in bank. Even if clerk is highly trained, such a way can lead

to making mistakes. The problem can be solved by using data mining approach with credit

scoring formulas as was proposed in the project. Basically, by summarizing all information

credit scoring method can evaluate the borrower and therefore decide if he/she deserves to

get a loan. Finally, proposed solution will improve the decision making process and

therefore help bank to decrease the risk of loosing borrowed money. Moreover, the

solution can positively influence economical situation of the bank by introducing new

services to customers.


Thesis Conclusion

The main goal of the thesis is to inform the reader about data mining technology

and highlight its importance. Furthermore, to propose the project in which the technology

was applied and draw attention to its outcome.

Theoretical part covers general information about mining the data. It starts by

explaining basic principles on which the technology works. Then informs how data mining

was developed and thesis continues by specifying four reasons why companies should

consider investing in data mining technology. In the next part, the short history of the

technology is highlighted. Thesis then mentions evolution of mining the data and points

some predictions about near future. Because data mining technology is strongly connected

with databases, some warehousing data mining concepts are covered. The thesis continues

by explaining the difference between data mining and knowledge discovery process.

Finally, the theoretical part of the thesis describes the six steps of the CRISP-DM model

that was used in practical part.

Practical part is dedicated to data mining project that is focused on financial sector.

The goal of the project is to help bank divide its customers according to their financial

credibility into two groups. The first group would represent customers that are suitable for

borrowing the money and the second group would represent customers that are not. The

project follows CRISP-DM model and therefore consists of six main steps. To achieve the

project’s goal, credit scoring method was used. The outcome of the project is presented in

the last, deployment part, which proposes two main ways how bank should use the created

model. Firstly, the model should be used to improve and simplify the decision making

process when the bank borrows the money. Secondly, the model should be used to improve

bank’s current services as well as to introduce new ones. To conclude, practical part deals

with data mining project that in the end advises the outcome which can help bank to

decrease risks and increase revenues.

Last but not least, the popularity of data mining is driven by increasing number of

data that are being stored. It is primary caused by advancement of information technology

that allows data to be stored faster and cheaper. The globalization and wide spread of

telecommunication technologies are few of the reasons that caused that data created by

people around the world can be gained quite easily. These are one of many reasons why

there was naturally created demand for a tool or technology that could somehow translate

these valuable data into helpful knowledge that can be easily understood.


As thesis mentions, by using data mining companies are able to gain knowledge

and therefore make better decisions, gain competitive advantage, or grow wealth. Data

mining and knowledge discovery can today seem as a complicated tool. However, the

further development will probably cause that it will started to be used more not just by

businesses, but also by governments and ordinary people. To conclude, data mining gives

businesses unique opportunity to extract information from data they already have but in

form that cannot be understood. Therefore, this opportunity should not be underrated. It

should be considered as good investment especially in nowadays competitive market when

making the right definitions is the key to success.


USING DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT

KNOWLEDGE FOR COMPANIES

I, Tomas Vanek, do hereby irrevocably consent to and authorize the library of Vysoká

škola manažmentu v Trenčíne to file the attached project and/or bachelor thesis USING

DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT KNOWLEDGE FOR

COMPANIES and make such paper available for in-library use in all site locations.

For public access to digital form of the project/bachelor thesis on internet

I give my permission

I do not give my permission

I state at this time that the contents of this paper are my own work and all resources used

are indicated.

_______________________________________________________________ (Signature)

___________________28.3.2010________________________________________ (Date)


List of Pictures

Picture 1: Online form


List of Figures

Figure 1: Knowledge discovery process model

Figure 2: Cross-Industry Standard Process model

Figure 3: FICO Scores chart

Figure 4: Class diagram of bank’s database

Figure 5: Decision tree model


Literature

Berka , P., (2003). Dobývání znalostí z databází [Gaining Knowledge from Databases].

Prague, Czech Republic: Academia.

Brunker, M., (2008). Poker site cheating plot a high-stakes whodunit. Retrieved November

5, 2009, from http://www.msnbc.msn.com/id/26563848/

Cios, K., Pedrycz, W., Swiniarski, R., & Kurgan, L., (2007). Data Mining A Knowledge

Discovery Approach. New York: Springer.

Chhay, H., (2005). Data mining. Retrieved November 5, 2009, from

http://cseserv.engr.scu.edu/StudentWebPages/hchhay/hchhay_FinalPaper.htm#DIS

ADVANTAGES

Crisp-dm. (n.d.). Process Model. Retrieved November 5, 2009, from http://www.crisp-

dm.org/Process/index.htm

Dass, R. (2006). DATA MINING IN BANKING AND FINANCE: A NOTE FOR BANKERS.

Retrieved November 5, 2009, from

http://www.iimahd.ernet.in/publications/data/Note%20on%20Data%20Mining%20

&%20BI%20in%20Banking%20Sector.pdf

Data Mining Software. (n.d.). A Brief History of Data Mining. Retrieved November 5,

2009, from http://www.data-mining-software.com/data_mining_history.htm

Hunter, J., (2009). Data Mining Process using CRISP. Retrieved November 5, 2009, from

http://www.youtube.com/watch?v=dJcmOe3_P0E

myfico. (2009). What’s in your FICO® score. Retrieved November 5, 2009, from

http://www.myfico.com/CreditEducation/WhatsInYourScore.aspx

Palace, B., (1996). Data Mining: What is Data Mining?. Retrieved November 5, 2009,

from http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/

palace/datamining.htm

Sumathi, S., & Sivanandam, S., (2006). Introduction to Data Mining and its Applications.

New York: Springer.

Srinath, S., (2008). Data Mining and Knowledge Discovery. Retrieved November 5, 2009,

from http://www.youtube.com/watch?v=m5c27rQtD2E

ABSTRAKT


Téma: Používanie data miningu v spoločnostiach ako nástroj na objavovanie

informácii

Kľúčové slová: data mining, databázy, cross industry standard process, credit scoring.

Študent: Tomáš Vanek

Vedúci BP: Martina Česalová, M.S.C.S.

Bakalárska práca sa zaoberá základným princípom dolovania dát, resp. data miningom ako

nástrojom na získavanie nových informácií. Práca sa skladá z dvoch hlavných častí. Prvá

časť je teoretická, kde je vysvetlené ako data mining funguje a k akým informáciám sa

pomocou neho možno dopracovať. Ďalej popisuje možné výhody a nevýhody, ktorými

tento nástroj disponuje. Praktická stránka práce sa opiera o teoretickú časť, pričom sa

venuje aplikovaniu data miningu na bankový sektor. Pozostáva z vytvorenia projektu pre

fiktívnu banku, ktorá potrebuje použiť segmentáciu zákazníkov pri udeľovaní pôžičiek.

Projekt pozostáva zo šiestich fáz CRISP-DM modelu, pričom hlavný dôraz sa kladie na

biznis podstatu navrhnutého riešenia.


ABSTRACT

Topic: Using Data Mining as a Tool for Discovering Important Knowledge for

Companies

Key words: Data Mining, Databases, Cross Industry Standard Process, Credit

Scoring.

Student: Tomáš Vanek

Advisor: Martina Česalová, M.S.C.S.

The bachelor thesis covers fundamental principles of data mining as a tool for knowledge

discovery. The thesis consists of two main parts. The first part is theoretical and basically

explains how data mining works and what kind of information can reveal. Additionally, the

first part of the thesis also mentions advantages and disadvantages of the tool. The second,

practical part, concentrates on applying data mining in banking domain. Therefore data

mining project was created and deals with customer segmentation that helps bank to

estimate customer’s financial credibility. The project follows six steps of CRISP-DM

model, but the main focus is given on business aspect of the solution.

COLLEGE OF MANAGEMENT IN TRENČÍN

Documents

Transcript of COLLEGE OF MANAGEMENT IN TRENČÍN