Final Report Data Mining

30/05/2014

Applying Data Mining Techniques for an Insurance Company

Group Project: Abidar Hamza

Jad Al Adas

Sandra Culman

Lionel Kouchou

Prepared for:

Professor: Kilian Stoffel

Assistant: Dong Han

DATA MINING UNIVERSITE DE NEUCHATEL

Acknowledgment

We would like to express our deepest appreciation to the Professor Kilian

Stoffel , as well to Mister Dong Han who gave us the opportunity to do

this interesting project on the topic Applying Data Mining Techniques for

an Insurance Company, which also made us doing a lot of research that

helped us accumulate new information.


Table of content

Introduction .......................................................................................................................................... 1

Chapiter 1 :Business Background and Data Presentation .................................................................. 2

1. What is Insurance? ........................................................................................................................ 3

1.1 Area of study ......................................................................................................................... 3

2. Data source ................................................................................................................................... 4

3. From a big relational database to a normalized dataset ................................................................ 7

3.1 Creating a view for the policies ............................................................................................. 7

3.2 Querying the database ........................................................................................................... 8

3.3 Description of the initial data ................................................................................................ 9

3.4 Key Attributes for analysis .................................................................................................... 9

Chapiter 2 :Data Reparation and Visualization ................................................................................ 11

1. Introduction................................................................................................................................. 12

2. Missing Values ........................................................................................................................... 12

3. Discretization .............................................................................................................................. 14

3.1 Discretization from Numerical To Nominal .................................................................... 14

4. Conversion .................................................................................................................................. 20

4.1 Nominal to numeric ............................................................................................................. 20

5. Visualization ............................................................................................................................... 20

Chapiter 3 :Method processes and results interpretation .................................................................. 24

1. Business questions ...................................................................................................................... 25

2. Predictive methods and evaluation ............................................................................................. 25

2.1 One Rule .............................................................................................................................. 25

2.2 Nave Bayes ......................................................................................................................... 29


2.3 Decision Tree ...................................................................................................................... 34

2.4 Logistic Regression [2] ....................................................................................................... 39

3. Descriptive methods and evaluation ........................................................................................... 40

3.1 Association Rules ................................................................................................................ 40

3.2 Clustering ............................................................................................................................ 42

Conclusion.......................................................................................................................................... 51

Glossary.............................................................................................................................................. 52

Webography ....................................................................................................................................... 54


Figure List

Figure 1: Policies Diagram................................................................................................................... 5

Figure 2 : Claims Diagram ................................................................................................................... 6

Figure 3 : Policy view script ................................................................................................................ 7

Figure 4 : Policy view relation ............................................................................................................. 7

Figure 5 : Final query ........................................................................................................................... 8

Figure 6 : Results in SQL ..................................................................................................................... 8

Figure 7 : Attributes ............................................................................................................................. 9

Figure 8 : Missing values table .......................................................................................................... 12

Figure 9 : Table with no missing values ............................................................................................ 13

Figure 10 : Discretization process in Rapiminer ................................................................................ 14

Figure 11 : Visualization of Body after discretization ................................................................... 16

Figure 12 : Discritization of attribute Body in two intervals .......................................................... 17

Figure 13 : Weka pre-process main window ...................................................................................... 18

Figure 14 : Weka visualization .......................................................................................................... 19

Figure 15 : Visualization of all attributes after correcting the discretized values .............................. 19

Figure 16 : Histogram visualization of Dateoccurence .................................................................. 21

Figure 17 : Discretization of Horsepower ...................................................................................... 21

Figure 18 : Visualization after correcting split point of Horsepower ............................................. 22

Figure 19 : Scatter plot visualization of Horsepower ..................................................................... 22

Figure 20 : Weka visualization plot of Region ............................................................................... 23

Figure 21 : Single Rule process ......................................................................................................... 26

Figure 22 : Screenshot of Single Rule process ................................................................................... 26

Figure 23 : Single Rule confusion matrix .......................................................................................... 27

Figure 24 : Car characteristics result using Single Rule .................................................................... 27

Figure 25 : Car characteristics confusion matrix ............................................................................... 28

Figure 26: Car characteristics confusion matrix ................................................................................ 28

Figure 27: Confusion Matrix Single Rule ....................................................................................... 28

Figure 28: Nave Bayes process ......................................................................................................... 29

Figure 29: Naive Bayes distribution table 1 ....................................................................................... 30

Figure 30: Naive Bayes confusion matrix .......................................................................................... 30


Figure 31: Lift chart of driver profiles prediction .............................................................................. 31

Figure 32: Confusion matrix of car characteristics using Naive Bayes ............................................. 32

Figure 33: Distribution table of car characteristics ............................................................................ 32

Figure 34: Naive Bayes distribution table 2 ....................................................................................... 33

Figure 35: Lift chart of car characteristics ......................................................................................... 33

Figure 36: Screenshot Weka decision tree ...................................................................................... 34

Figure 37: Text View Weka Decision Tree .................................................................................... 36

Figure 38: Confusion matrix using decision tree ............................................................................... 36

Figure 39: Screenshot of Weka tree ................................................................................................... 36

Figure 40: Confusion Matrix .............................................................................................................. 37

Figure 41: Screenshot of Weka decision tree using driver profile attributes ..................................... 37

Figure 42: Confusion matrix Driver attributes ................................................................................ 38

Figure 43: Screenshot of decision tree in Rapidminer and Confusion Matrix (Car attributes) ......... 38

Figure 44: Screenshot of decision tree in Rapidminer and Confusion Matrix (driver attributes) ...... 39

Figure 45: Logistic Regression table result ........................................................................................ 39

Figure 46: Association rules process .................................................................................................. 40

Figure 47: Association rules process 1 ............................................................................................... 41

Figure 48: Association rules process 2 ............................................................................................... 42

Figure 49: Clustering process with Rapidminer................................................................................. 43

Figure 50 : Number of clusters ........................................................................................................... 44

Figure 51 : Centroid table .................................................................................................................. 44

Figure 52: Screen shot of centroid plot view ..................................................................................... 45

Figure 53: Screenshot of cluster 0 folder view .................................................................................. 46

Figure 54: Centroid table of second iteration ..................................................................................... 47

Figure 55: Screenshot of centroid plot - second iteration .................................................................. 48

Figure 56 : Screenshot of cluster 3 - folder view ............................................................................... 49

Figure 57: Performance Vector Davies Boulding ........................................................................... 49

Figure 58 : David Boulding table ....................................................................................................... 50

Figure 59 : David Building graph ...................................................................................................... 50

DATA MINING 1 UNIVERSITE DE NEUCHATEL

Introduction

The accumulation of vast and growing amounts of data in different formats and different datasets

can be considered one of the biggest problem we are facing. The amount of information stored in

insurance databases is rapidly increasing because of the rapid progress of information technology.

The data that is gathered is useless without analyzing it. The patterns, associations, or relationships

among this data can provide important information that helps companies improve their activities.

The wealth of data can be considered a potential goldmine of business information. Finding the

valuable information hidden in those databases and identifying appropriate models is a difficult

task.

The above mentioned problem can be solved with the help of Data Mining, a process of analyzing

data from different perspectives and summarizing it into useful information. A typical data mining

process includes data acquisition, data integration, data exploration, model building, and model

validation.

Nowadays, insurance has become a compulsory need in peoples life since they cant afford anymore

to bear the expenses of a loss or an accident. Thus, this need have fueled insurance companies to

expand and grow, consequently profits increased as well as market share. Nevertheless, corporations

are still exposed to a great risk and some losses are inevitable, thats why they seek new approaches to

better manage their risk.

The paper is organized as follows. Chapter 1 provides an overview of the insurance area and the data

source. Chapter 2 explains our whole process of data preparation. By analyzing the data that we obtain

from this field, in chapter 3 we try to find usable information that can help the insurance company

better price their premiums.


Chapter 1

Business Background and Data

Presentation

What is insurance?

Data source

Steps from a big relational database

To a normalized dataset

Description of the initial data


1. What is Insurance?

Insurance is the fair transfer of the risk of a loss, from one entity to another in exchange for

payment. In other words, insurance equals peace of mind. An insurer, or insurance carrier, is a

company selling the insurance. The insured, or policy holder, is the person or entity buying the

insurance policy. The amount of money to be charged for a certain amount of insurance coverage

is called the premium.

The main aspects of insurance are:

Underwriting (Policies) is when a customer buys coverage or a policy from the insurance

company (Revenues to the company).

A claim is when a customer undergoes a certain loss and declares it to the insurance

company in order to receive the compensation agreed upon (Losses to the company).

There are several types of insurance, for example Motor, Health, Fire, Allied Perils, Natural

disasters, Marine, Personal Accident, Life, Property, Liability, Travel and many more.

A key part of insurance is charging each customer the appropriate price for the risk they

represent. Risk varies widely from customer to customer, and a deep understanding of different

risk factors helps predict the likelihood and cost of insurance claims.

1.1 Area of study

Insurance is a quite broad and a rich topic; it offers a lot of potential for performing data mining

methods. Yet, we will concentrate our research on one line of business: Motor (Automobile).

Some interesting facts about motor accidents:

There are more than 12 million motor vehicle accidents annually;

The typical driver will have a near automobile accident one or two times per month

The typical driver will be in a collision of some type on average of every 6 years

Crashes are the leading cause of death for ages 3-33

Its good to know as well that even a minor accident can result in thousands of dollars in

damages. Accordingly, a question would arise here to know: what is the likelihood of a car

accident to occur?

Many studies have been conducted over the years on this topic; specialists have narrowed some

important factors that are linked to increasing the risk an accident and on which they base the

pricing of their premiums. Some of these aspects are:


Age and gender of the driver;

Driving record;

Type of vehicle;

Geographical record;

Period of the year;

Two factors have been chosen from the above to be interpreted in our project: type of the vehicle

and the drivers profile.

Therefore our goal will be:

To better predict motor insurance claim occurrence based on the characteristics of:

The drivers vehicle.

The driver.

Discover hidden patterns that may be useful for the insurance company

2. Data source

One of our team members used to work as a software developer for a software vendor specialized in

insurance and reinsurance ERP systems. As a result, we had the permission to acquire a database of

one of the main insurance companies. For confidentiality purposes we preferred to keep the names

undisclosed unless requested.

The software vendor used a relational database built on the Microsoft SQL Server database

management system. Parts of the diagrams are shown in the figures below:


Policies Diagram:

Figure 1: Policies Diagram


Claims Diagram:

Figure 2 : Claims Diagram

The initial unfiltered database contains roughly 600 000 rows of policies and 90 000 rows of claims.

Hence, we had to go through several steps in order to come up with one final dataset that would be

useful for our project.


3. From a big relational database to a normalized dataset

3.1 Creating a view for the policies

First of all, creating a view for the policies was fundamental as it will facilitate writing the queries

in the database. The following figures describe the view created as a script and the design to

illustrate the tables used:

Figure 3 : Policy view script

Figure 4 : Policy view relation


3.2 Querying the database

As a second step, we used the above view in a query to select the needed data. The query selected

the attributes related to the characteristics of the car, the drivers profile, the date occurrence of the

claim and a conditional attribute to check if a policy has a claim (depending if a Policy ID have

one or more rows in the claims table). Furthermore, we filtered the policies by the issue year 2012

and the line of business Motor using the WHERE clause. The result is exported to a CSV file as our

preliminary dataset.

The following figure illustrates the final query in SQL:

Figure 5 : Final query

Result in SQL:

Figure 6 : Results in SQL


3.3 Description of the initial data

A total final of 4731 instances and 16 attributes is the result of the transition of SQL to the CSV

file.

Figure 7 : Attributes

The table was extracted from Rapidminer. It represents the metadata of the insurance data which

includes the number of examples or instances, the number, description and type of the attributes,

some statistics about the values of the attributes and missing values if any.

3.4 Key Attributes for analysis

HasClaim is the target attribute or the Label in our data. The type of this attribute is binominal

with 2 values: True and False. This attribute describes weather the line has a claim or not in the

past. Given this result, we could establish a relationship with the rest of the attributes for

classification. The rest of the attributes will be divided into two main groups which will support our

supervised learning hypothesizes:

Drivers Profile

Age: age of the driver, Type integer

Gender: Male or Female, Type binominal

Marital Status: Married or Single, Type binominal

Has Children: True, False, Type binominal

Region: Urban, Town, Suburban ,Type polynominal


The drivers profile is seen as important to analyze by insurance companies as it portrays the degree

of responsibility given the presence of children, the method of driving, abidance by the rules as

youngsters tend to break them more often and go over the speed limit and the residential area with

high or low traffic.

Vehicle Specifications

Make: manufacturer (BMW, Fiat), Type polynominal

Model: subcategory (BMW X5, 320), Type polynominal

Year Built: year of manufacturing, Type integer

Category: type of usage (Taxi, private, rent a car) Type polynominal

Body: the size, Type polynomial

Horsepower: speed or power, Type integer

Another aspect to look at when pricing premiums is obviously the car. Some makes are considered

to be safer on the roads and more robust, new cars tend to be more stable than old cars, high speed

cars are more prone to accidents.

After the description of the initial data, we come to the step of further cleaning and preparing so it

would be all set to be utilized in data mining techniques using rapid miner.


Chapter 2

Data Reparation and Visualization

Introduction

Missing values

Discretization

Conversion

Visualization


1. Introduction

In order to use the techniques of data mining, or make the predictions, business professionals almost

generally agree that data preparation is one of the most imperative parts of any such project, and

one of the most time-consuming and difficult.

We already covered the integration, transformation and reduction parts in the first chapter Data

source by normalizing data from a relational database into one dataset CSV file. We will process

the rest of the steps in Rapidminer.

2. Missing Values

A missing value can signify a number of different things. Perhaps the field was not applicable, the

event did not happen, or the data was not available. It could be that the person who entered the data

did not know the right value, or did not care if a field was not filled in.

However, there are many data mining scenarios in which missing values provide important

information. The meaning of the missing values depends largely on context. Since our data comes

from an insurance company, the quality for them is not an option but its a vital thing in their

everyday operations and they perform data cleaning very often. Therefore, we didnt find so many

inconsistent or missing values. Yet, we could discover some attributes as shown in the metadata of

Rapidminer:

Figure 8 : Missing values table


The Metadata indicated that we have 4 attributes with missing values:

Model: 14 missing values

Yearbuilt: 3 missing values

Horsepower: 3 missing values

Body: 1 missing values

To deal with these missing values Rapidminer provides an operator called Replace missing

values. We will use it and tune its parameters according to each attribute to get the best

replacement.

Model and Body

The best method for a polynominal attribute is to assign the maximum value (most frequent value)

to the missing values.

Yearbuilt and Horsepower

Since they are integers, replacing by the average is the best option, in all cases 3 values will not

make much of a difference. Metadata after fixing the missing values will look as the following:

Figure 9 : Table with no missing values


3. Discretization

The discretization involves partitioning numerical values into intervals by placing breakpoints;

using this method randomly can result in data unbalance. In our project we worked efficiently to

visualize the discretization results and making sure to keep as much as possible balance between the

intervals, through this section we describe the steps of discretization and visualization results using

Rapidminer and Weka.

3.1 Discretization from Numerical To Nominal

The insurance data has many attributes which need discretization in order to provide a better

understanding and good interpretation while using different data mining methods. We checked

before starting discretization if our data is well balanced and then corrected the unbalance of some

attribute values. The Figure 10 describes the discretization process in Rapidminer.

Figure 10 : Discretization process in Rapiminer

The Discretization by user specification has limited capabilities and supports only the numerical

attributes. The sensitive point in the discretization process is the cut point called breakpoint. After

sorting each attribute in ascending order, we choose the cut point and then we split and merged.

After these three important steps we evaluated the values and made some modification if the

intervals were unbalanced.


The numerical attributes selected for discretization are:

Sum insured:

o High-risk

o Medium Risk

o Low Risk

Total Premium:

o High

o Medium

o Low

Year Built:

o Very old Cars

o Old Cars

o Recent Cars

o New Cars

Horsepower:

o Fast

o Medium

o Slow

Age:

o Young

o Adult

o Old

There are also other attributes which are important to use even though they are not numerical. For

example the attribute BODY that describes the car size is nominal and contains different types of

car sizes. We decided to discretize this attribute using nominal to numerical conversion and then

discretization by user specification.

Based on the Meta data table we set

the range values for each attribute and

corrected these values until we obtained the

right balance


Body:

o Small cars

o Medium Cars

o Big Cars

It is important to set the right cut-point for a better discretization!

Figure 11 : Visualization of Body after discretization

The figure11 shows that our discretization needs to be more balanced between the three groups.

Small cars have the largest value and while using this attribute for learning models we will have

always a bigger probability to obtain small cars then the medium or big cars. To remedy this

problem we discretized again and corrected the breakpoints between the three groups. Figure 12

show the correct discretization of the attribute Body.


Figure 12 : Discritization of attribute Body in two intervals

We modified the number of the group into two groups: standard car and big car. Because the

value of medium car and big car was close to each other we choose to regroup these two groups

into one significant group with a large number of values. After finishing discretization we added the

operator Write CSV to generate a new file with the data that has been discretized. This helps us

use the file in Weka; in fact it reduces the number of the operators used in the main process as long

as we use the predictive and descriptive methods (Figure 10).


Preprocessing in WEKA

First, we ran Weka and launched the explorer window, and then we selected the Preprocess tab, in

order to see the attribute name, the percentage of missing values and the balanced values (Figure

13).

Figure 13 : Weka pre-process main window

Figure 14 shows all attributes after the first discretization; there are values which cant be displayed,

due to the large number of instances (Policy No, Make, Date Occurrence etc.). Actually these

attributes could not be discretized into intervals in order to know which car model implies the

occurrence of the claim.


Figure 14 : Weka visualization

Figure 15 shows the improvement we bring into the following attributes: Body, Horsepower,

Marital status and Gender, with the intention to balance as much as possible the intervals

between each instance.

Figure 15 : Visualization of all attributes after correcting the discretized values


4. Conversion

The conversion of the data from a type of format to another through different types of conversion

operators solves the problem when the operators are not able to support some attribute types. In our

project we used conversion not only to solve this kind of problems but also to facilitate the

discretization.

4.1 Nominal to numeric

Our insurance data has many numerical attributes, it was not difficult to use discretization and

convert numerical attributes to nominal, but for some attributes like Body which is polynomial

and has much similarity, we needed to split it into groups which can be differentiated between its

values.

To be more efficient in processing we sorted the Body values, then we convert these values into

numerical using Nominal to numerical by assigning each value number (unique integer) and

finally discretized it into a range or an interval (Standard car, big car).

5. Visualization

Data visualization is the process by which textual or numerical data are converted into meaningful

images. The reason why the data visualization can help in data mining is that the human brain is

very effective in recognizing large amounts of graphical representations. This approach of

visualization allows us, in each step that we did in our project, to understand graphically what our

data looks like. In this chapter we described and analyzed the selected attributes for visualization

and described the impact on the insurance business. There is a great variety of visualization

techniques proposed in Rapidminer. We listed below the ones we used in our project:

Histogram

We used the histogram to show the Dateoccurence values. The Dateoccurence = Null is the

most dominant for the company. Important for us it to learn and understand when the claim

occurred.


Figure 16 : Histogram visualization of Dateoccurence

Pie

By visualizing the attribute Horse Power after discretization, we see that there is an unbalanced

data between fast, medium and slow; this means the cut point between the three intervals is not

correct.

Figure 17 : Discretization of Horsepower


While correcting the cut point, we obtained the following visualization:

Figure 18 : Visualization after correcting split point of Horsepower

Scatter plot

It is useful to visualize some attributes in regards to our label. Figure 19 permits a visualization of

Horsepower and Yearbuilt (Y-Axis) and the HasClaim (X-Axis). The HorsePower is

presented by different colors. If we analyze this scatter plot view, we see that old cars with medium

horse power (with a speed not greater than 120 km/h) have more claims contrary to other type of

cars, like new cars that have few claims.

Figure 19 : Scatter plot visualization of Horsepower


We used to work with Weka because of the good visualization that it provides. In our report we

didnt try to figure out all the visualiation types, we wanted to find the visualization that helped us

analyze the relationships between the attributes. Figure 20 ilustrates the attribute Region togheter

with the label attribute. We can deduct that the sensitive region is Town.

Figure 20 : Weka visualization plot of Region


Chapter 3

Method processes and results

interpretation

Business questions

Predictive method and evaluation

Descriptive method and evaluation


1. Business questions

The overall goal of the data mining process is to extract information from a data set and transform it

into an understandable structure for further use. While thinking of the components of the input,

beside the instances and the attributes, the things that we can learn is the concept. Throughout this

chapter we want to put in practice some of the main data mining techniques and algorithms and

their potential applications in the insurance industry.

What are the driver profiles most likely to have an accident?

What are car characteristics which has high impact on hasclaim?

Segmenting automobile drivers?

Accordingly to this business questions we can help insurance firm making crucial business

decisions and turn the new found knowledge into actionable results.

2. Predictive methods and evaluation

To learn a method for predicting the instances class from pre-labeled instances is called

classification. This is a supervised learning method and is used to analyze the importance of the

attributes determining the value of the target attribute. We decided to use four classifiers in order to

predict the value of our label One Rule, Nave Bayes, Decision Tree and the Logistic Regression.

For a better understanding of our results we proceeded using both Rapidminer and Weka, programs

that contain a collection of visualization tools and algorithms for data analysis and predictive

modeling, together with graphical user interfaces for easy access to its functionality.

2.1 One Rule

The Single Rule Induction operator is one of the most reliable classification methods. It helps find

the attribute with the lowest number of errors by testing each attribute. The result can be interpreted

as the attribute that has the most influence on the target group.


Figure 21 : Single Rule process

The first attributes that we tested on One Rule Operator were the car characteristic attributes:

Body, Category, HorsePower, MAKE, Model and Yearbuilt. Amongst these attributes the Model

is the one that provides us the smallest error rate (3442 out of 4731 correct instances). The

insurance company must take into account the model that is most predisposed to accidents. For

Example, a pick-up Car will have a bigger probability to have an accident then an Accord.

Knowing the model implies automatically deducing the brand, while the reciprocal does not hold

true necessarily.

Figure 22 : Screenshot of Single Rule process

The confusion matrix displays how many of predicted values matched the actual values when cross-

validation tests were performed. From among records that were predicted as FALSE, the correct

predictions were 464 and 318 were incorrect. From among records that were predicted as TRUE,

the correct predictions were 356 and 281 were incorrect The confusion matrix shows that in regard


to prediction of FALSE, model accuracy is 59.34% and in regard to prediction of no, model

accuracy is 55.89%. The overall model accuracy is simply the percentage of good predictions

among all predictions, that is:

/ + /

/ + / + / + /

, where T/T is the number of situations where the model predicts TRUE, and the result is TRUE,

T/F is the number of situations where the model predicts TRUE and the result is FALSE, etc. After

performing all computations using Rapidminer, the total accuracy of the model is 57.79%.

Figure 23 : Single Rule confusion matrix

The prediction on the HasClaim is lying on the TRUE value, which means that this value plays an

important role in analyzing the performance of the operator. The poor level of accuracy provided by

this operator will not help us improve the prediction of HasClaim in regards to the car

characteristics.

Obtaining the model of the car as the first rule in our algorithm made us think whether this is

helpful or not. Because of the high diversion of the attribute Model and Make we decided to

test the operator without them. The results below provides us a better understanding of which

attribute an insurance company should take into consideration while making the insurance offer. In

conclusion YearBuilt provides us with 4 simple rules.

Figure 24 : Car characteristics result using Single Rule


After we made the changes, the accuracy of this operator with the new number of attributes still

havent change.

Figure 25 : Car characteristics confusion matrix

Further on we tested One Rule on the driver characteristics: Age, Gender, HasChildren, Marital

Status and Region. The model provided us HasChildren with the attribute having the less number

of errors (2476 out of 4731 correct instances). The program defined the two rules below: if

HasChildren = False then False and if HasChildren = True then True, meaning that clients having

children are prone to have an accident, while the clients without children are not.

Figure 26: Car characteristics confusion matrix

Evaluating this Operator with the new attributes, provides us with a low accuracy of 53.14%.

Having such a low accuracy means that this algorithm is not providing us a reliable algorithm that

the company could use in deciding if the attribute HasChildren is really a primary component in

taking a decision.

Figure 27: Confusion Matrix Single Rule


2.2 Nave Bayes

Driver Profiles prediction

Based on Bayes Theorem this classifier applies a simple probabilistic assumption. The Bayes

Theory applied on this operator is assuming that our attributes are not related to each other but they

are all related to our label attribute HasClaim.

Figure 28: Nave Bayes process

The figure 26 illustrate the nave bayes process, we selected the attributes which are relative to the

driver profiles in order to calculate the likelihood of each one for predicting True hasclaim.

We obtained as a result from this process ,distribution table which mark out the probability of each

attribute ;then we focused on the target class TRUE and we compared the values with the intention

to sort out the attributes with high probability value .

Accordingly to the table the attributes selected are:

Age=adult and Gender =Male and Marital status = Married Has children =TRUE


Figure 29: Naive Bayes distribution table 1

To evaluate the performance of naive Bayes classifier the confusion matrix allows more

detailed analysis than mere proportion of correct guesses,

Confusion matrix displays how many of predicted values matched the actual values when cross-

validation tests were performed (by cross-validation operator). For example, from among records

that were predicted with Hasclaim class as True, the correct predictions were 249 and 194 were

incorrect. The confusion matrix shows that in regard to prediction of Hasclaim =TRUE, model

accuracy is 55.11%.

Figure 30: Naive Bayes confusion matrix

Lift chart

A lift chart graphically represents the improvement that a mining model provides when compared

against a random guess, and measures the change in terms of a lift score. By comparing the lift

scores for various portions of our data set and for different models, we can determine which model


is best, and which percentage of the cases in the data set would benefit from applying the models

predictions.

Figure 31: Lift chart of driver profiles prediction

The lift chart is helpful it quantify the relationships between the confidence (using threshold) and

the True prediction of hasclaim, by showing the increase in the number of driver claims , as we can

see in more details analyzing this chart , with confidence greater than 0.5 the number of driver who

is predicted to hasclaim decrease ,for instance if we take confidence range between [0.49-0.51]

1944 drivers must me targeted in order to found 1025 drivers among them who hasclaim ; the good

thing in our prediction results it gives value more than 90 % when confidence and the driver target

are specified.

Car characteristics prediction

We tried to predict the same target class but using different factors ,first we choosed driver

profiles and now we use the car characteristics ,our main goal is to found rules which can have high

accuracy ,with selecting the car characteristics attributes and train test the model we obtain the

confusion matrix in figure 32


Figure 32: Confusion matrix of car characteristics using Naive Bayes

The accuracy of our model using car attributes is 61.54 %, this result is acceptable and allow us to

give more importance on what type of cars has high impact on hasclaim, to extract more details

from this model we use the weight table of True in order to select the attributes which has high

probability on the table

Figure 33: Distribution table of car characteristics

From the table in figure 33, the car characteristic attributes which has high probability and can be

selected for prediction in this model are:

o Body = Standard cars and Yearbuilt = very old cars

We have split the table presenting car brand and model because of the large number of values; we

sorted these values and selected the car brands which have high probability:


Figure 34: Naive Bayes distribution table 2

Mercedes, Nissan and Hyundai are the three main brand having and high impact on hasclaim

Figure 35: Lift chart of car characteristics

Accordingly to the lift chart using to quantify the prediction of the cars who has highest claim

,although there is no prediction results with perfect information .the variance between the

confidence measure and the number of drivers contribute both on the prediction ; if we take for


instance the confidence value between [0.4 -0.5] 529 drivers targeted out of 1092 can have

prediction of 98% of True hasclaim but if we compare it to the confidence of 1 205 out of 211 we

get only 40 % of true claim , from this result we confirmed that our model predicting the true has

claim is acceptable and the results can contribute strongly in future prediction and help the

insurance company to set new decision regardless this results.

2.3 Decision Tree

This operator generates a Decision Tree for classification both for nominal and numerical data. This

classification model is really easy to interpret and predicts the value of our label Has Claim based

on the attributes we have chosen. In this section we proceed in analyzing both group of attributes

(Car Driver) on two operators: - Decision Tree in Rapidminer and J48- Weka

Weka

The decision tree below resulted after running the program with the car attributes. Unfortunately the

tree is very big since the attributes have many possible outcomes. It has a size of 835 and 801

numbers of leaves. It seems difficult to interpret, but with the help of the text view we managed to

provide an interpretation.

Figure 36: Screenshot Weka decision tree

J48 is considering the Category the root attribute. Prive cars are being analyzed depending on

the YearBuilt, make and horsepower. If the car is very old the algorithm checks the

MAKE of the care before making a decision if true or false. If the car is prive and is

categorized as a new car then this will be automatically assigned as a TRUE HasClaim


prediction. If the car is categorized as an old car, the model will go first trough the MAKE of

the car, depending on this it will proceed in analyzing the Horsepower.

The generated result shows that the cars more likely to be implied in an accident are: Prive, Cargo,

Rent and Taxis. The rest of the proportion is well spread over the remaining categories which own a

small part of the total instances (they have a remote probability of an accident). Really exposed

cars are: Mercedes, Nissan, BMW, Audi, Toyota and DAIHATSU. We are questioning ourselves if

these results are really providing a good overview on the insurances of cars.

The other categories are well spread and own a small part of the total instances. According to this,

these categories have a small probability in being involved in an accident. The insurance company

should analyze and focus on the four big car groups.


Figure 37: Text View Weka Decision Tree

Performing the evaluation the Weka J48 Tree obtained 67.09% accuracy. The prediction of the

TRUE class, that is important for our analysis, has an even bigger accuracy of 71.23%.

Figure 38: Confusion matrix using decision tree

Because of the complexity of the Weka tree we decided to take the attributes with a large number of

instances out from the model (Model, Make, and Category). We obtained a tree that is easier to

interpret but provides a lower accuracy.

Figure 39: Screenshot of Weka tree


Figure 40: Confusion Matrix

Using the driver attributes we obtained the tree below. A 16 size tree with a number of 10 leaves.

The algorithm sets HasChildren as the root attribute and analysis the data in regards to this. We

could conclude that clients that dont have children and are married depending on their age have

the following results: adult true, old- true, young false. If the clients dont have children and are

single then the model decides over false. On the opposite site if the client has children and is single

has a bigger probability of an accident occurrence.

Figure 41: Screenshot of Weka decision tree using driver profile attributes

The Evaluation of this operator using the driver attributes provides a low accuracy of 56%, meaning

that the interpreted results are not to certain. We tried changing the parameters of the operator, but

we couldnt obtain better results.


Figure 42: Confusion matrix Driver attributes

Decision Tree Rapidminer

While running the second operator, this time on Rapidminer we obtained the results below. The

algorithm is producing a pruned tree that has a medium complexity. The root attribute used in the

first model is Year Built and in the second model HasChildren. As a small conclusion after

examining the results, we would say that the attribute YearBuilt and HasChildren will have the

following interpretation: New car True, Old Car False, Very old cars False and Recent car

True. If the client has children it depends on the age and the region if he has a claim or not.

Figure 43: Screenshot of decision tree in Rapidminer and Confusion Matrix (Car

attributes)


We evaluated both algorithms and we obtained a 57% and 53% accuracy. Compared to Weka this

operator is producing a less accurate result. In other words, the insurance company can get a better

prediction of the label class using the J48 Tree.

Figure 44: Screenshot of decision tree in Rapidminer and Confusion Matrix (driver

attributes)

2.4 Logistic Regression [2]

We start modeling process by learning the relationship between claim frequency and underlying

risk factors including age, gender, and marital status, region, and HasChildren based on this

attributes which describe the driver profiles we use logistic regression to quantify the claim

frequency and the effect of each risk factor, and also estimate the probability of claim frequency.

Figure 45: Logistic Regression table result


We compute logistic regression using the same attributes selected from the decision tree, based on

the results from the table we can deduct that the age has high impact on hasclaim and by selecting

drivers depending to the age the insurance company can make more profit increasing the amount

premium when the drivers getting older .this is one of many reasons why we applied predicting

model to learn the best rule for the future.

3. Descriptive methods and evaluation

We proceeded to find the hidden structure in the available data (process called supervised

learning) by using two methods: clustering and association rules.

3.1 Association Rules

Association rules are a bit different from classification rules excepting the fact that they can predict

any attribute, not just the target one, and consequently, any combination of attributes. Different

association rules express different regularities that underlie the dataset, and they generally predict

different things. In order to implement the association rules for our database, the following five

operators are needed:

Figure 46: Association rules process

Read CSV , in which we import the database (but we should not use any label);

Select Attributes, where we will select the attributes for the process. We decided to take

out two attributes that have no impact on our analysis: Policy Number and Date

Occurrence;

Nominal to Binominal, which changes the type of selected nominal attributes to a

binominal type. It also maps all values of these attributes to binominal values;


FP Growth - the FP in FP-Growth stands for Frequency Pattern. Frequency pattern

analysis is used for many kinds of data mining, and it is a necessary component of

association rule mining. Without having frequencies of attribute combinations, we cannot

determine whether any of the patterns in the data occur often enough to be considered

rules. One important parameter of this operator is Min Support. It represents the

occurrence rate of the rule (number of times the rule occurred divided by the number of

observations in the data set).

Create Associations - this operator uses the frequent pattern matrix data and seeks any

patterns that occur frequently enough to be considered as rules. The Create Association

Rules operator generates both a set of rules (through the rul port) and a set of associated

items (through the ite port). In this model we are looking just for generating rules, so we

simply connect the rul port to the res port of the process window. One of the influential

parameters of this operator is Min Confidence. Confident percentage is a measure of

confidence about the likeliness that both an attribute and its associated attribute to be

flagged as true. The confident percentage is computed as the ratio between the number of

times a certain rule occurs and the number of times it could have occurred.

First, we decided to test the program using a confidence of 0.8 and a Min Support of 0.1.

Surprisingly, 352 rules had a very low support. In our opinion this rules are not considered reliable

because of the low frequency of the attribute combination.

Figure 47: Association rules process 1

We considered important to increase the min support in order to have rules with a high frequency of

attribute combination.

The second testing was considering the same confidence, but with a minimum support of 0.6. 5

rules were generated that were having a support between 0.359 and 0.435 and a confidence between

0.821 and 0.951.


Figure 48: Association rules process 2

Analyzing these rules we can consider the following:

A car with a medium risk sumInsured is most probably a prive car;

A prive car, with a medium risk sumInsured will have a high probability to be a standard

car;

Most of the times a standard car will be a prive car;

If the car is a standard one and the driver is a male then we could conclude that the car is a

prive one;

If we have a standard car with a medium risk sumInsured, the car is most probably a prive

car.

3.2 Clustering

Clustering is one of the unsupervised techniques we will deploy on the data in order to partition and

to reveal some sub classes and discover their natural grouping. K means algorithm is one of the

operators that rapid miner offers to divide the data. We preferred this particular algorithm as it is

easy to understand and simple to interpret. Yet, a challenge we faced, was to pick the number of

clusters in advance with the aim of acquiring a robust set of clusters. The performance of a

clustering algorithm may be affected by the chosen value of K. Therefore, instead of using a single

predefined K, a set of values will be adopted in order to find a satisfactory clustering result. The

validity of the outcome will be assessed using the Evaluation Cluster Distance Performance which

is provided by Rapid Miner. This operator relies on two main criteria to evaluate the performance:

avg._within_centroid_distance: The average within cluster distance is calculated by

averaging the distance between the centroid and all examples of a cluster.

davies_bouldin: The algorithms that produce clusters with low intra-cluster distances

(high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity)

will have a low DaviesBouldin index, the clustering algorithm that produces a collection


of clusters with the smallest DaviesBouldin index is considered the best algorithm based

on this criterion.

Since we dont have a target attribute in clustering most of the attributes will be needed for the

process except:

Policy NO, Dateoccurence: Irrelevant

SumInsured, TotalPremium: our outcome of this study is to help price the premium and

sum insured so it will illogical to include them in the clusters.

Make, Model, Category: they contain too many values and we would need large number of

clusters for grouping, thus we omit to have more realistic and useful clusters.

Hence, we would able to learn the correlation and similarities between the drivers profile and the

car he is driving along with the claim occurrence. Furthermore, all polynominal and binominal

attributes will be converted to numerical since K means deals with distances and accepts only

numerical.

The process is illustrated in the figure below:

Figure 49: Clustering process with Rapidminer

First Iteration

K = 3, Max Runs = 10

This first step will be considered as the reference point considering the outcome of the evaluation

and the consistency of the clusters. The results are illustrated in the figures below:


Text View

Figure 50 : Number of clusters

Centroid Table

Figure 51 : Centroid table


Centroid Plot View

Figure 52: Screen shot of centroid plot view

Each colored Line represents a cluster, with the peaks on the attributes that have a strong

relationship to the cluster.

Analysis of Results

Attributes are considered a strong match within a cluster if they have a high average. For example,

we look at cluster 0 in the centroid table and sort the averages from highest to lowest. We can

conclude that the strongest element is Marital Status = single = 1. The strong elements which have

a correlation are: HasChildren = False = 0.747, Body = standard car = 0.634, Gender Male = 0.626,

HasClaim = False = 0.512. The rest of the elements are considered weak or have no correlation like

Marital Status = 0. If we take a look at the folder view of cluster 0 we can have a better and clear

example:


Figure 53: Screenshot of cluster 0 folder view

Cluster 0 includes the items in the red square with a value of 1. From the insurance business

perspective, we can conclude that an adult single female with no children, living in an urban area,

driving a big very old fast car has no claims.

Analysis of Performance

Performance Vector

As a first iteration the performance vector doesnt inform much about the quality of the clusters,

however it is taken as a reference point for the rest of the iterations to validate the relationship

between the number of clusters and Davies Bouldin average.


Second Iteration

K = 6, Max Runs = 10

Here we will try to double the number of K and observe the possible outcome and check the

performance. The results are illustrated in the figures below:

Text View

Centroid Table

Figure 54: Centroid table of second iteration


Centroid Plot

Figure 55: Screenshot of centroid plot - second iteration

Analysis of Results

Now that we increased the number of clusters we can have less attributes belonging to each one and

a clearer view for conclusions. We will consider cluster 3 for this example. We sort it again form

the high to low. As a first observation we can conclude that the very strong attributes belonging to

the cluster has increased dramatically.

Now we got body = standard car, HasChildren = true, hasclaim = false = 1. Plus the strong

attributes have higher averages as well for example: Marital status = married = 0.883, gender =

male = 0.650.

Again, we will have look at the folder view with the aim of understanding better the cluster.


Figure 56 : Screenshot of cluster 3 - folder view

Cluster 3 includes the items in the red square with a value of 1. From the insurance business

perspective, we can conclude that an adult married male with children, living in a suburban area,

driving a standard fast car has no claims.

Analysis of Performance

Figure 57: Performance Vector Davies Boulding


In the first iteration Davies Bouldin was -2.711. After the increase of K to 6 it decreased to -2.657.

This indicates that k = 6 is suits better the algorithm with k = 3 in this data. However a question

arises here which is if the increase of K always produces better clusters. Technically speaking it

might be true but the increase of K / decrease of DB is not a direct relation. Therefore, we need to

decide when stop increasing K whenever we notice DB decrease is becoming almost flat.

Davies Boulding Chart and determining the ideal number of clusters

A number of iterations were performed with different values of K to monitor the variation of

performance.

Figure 58 : David Boulding table

Figure 59 : David Building graph

We notice that the curve is becoming flat starting K = 24. Thus, we can conclude that an optimal

number of clusters would be a range between 12 and 24.

-2,750

-2,700

-2,650

-2,600

-2,550

-2,500

-2,450

-2,400

-2,350

-2,300

-2,250

0.00 20.00 40.00 60.00 80.00 100.00 120.00

Davies Boulding

Davies Boulding

K Davies

Boulding

3 -2.711

6 -2.657

12 -2.461

24 -2.313

48 -2.304

96 -2.297


Conclusion

This project was a typical application of deploying data mining techniques into real life

situations. We learned the different phases starting from data preparation through interpretation and

drawing conclusions. We noticed as well that not all techniques fit perfectly the data, thus a

thorough knowledge of the business is essential in order to have an initial set up of what to use in

the plan.

It is good to mention here that sometimes result evaluations are subjective; the company in

need of this information would be the only judge of the accuracy of certain outcomes. For instance,

we discovered several correlations between the driver, the car and the occurrence of an accident.

Now, it is up for the firm to take decisions whether to increase or decrease premiums for certain

group of people and group of cars.


Glossary

A

Accuracy: A measure of a predictive model that reflects the proportionate number of times that

the model is correct when applied to data.

B

Binning: The process of breaking up continuous values into bins. Usually done as a

preprocessing step for some data mining algorithms. For example breaking up age into bins for

every ten years.

C Claims: Claims and loss handling is the materialized utility of insurance; it is the actual

"product" paid for. Claims may be filed by insured directly with the insurer or through brokers or

agents. The insurer may require that the claim be filed on its own proprietary forms, or may accept

claims on a standard industry form.

Cross Validation (and Test Set Validation): The process of holding aside some training data

which is not used to build a predictive model and to later use that data to estimate the accuracy of

the model on unseen data simulating the real world deployment of the model

Comma separate value (CSV): A common text-based format for data where the divisions

between attributes (columns of data) are indicated by commas.

Confidence level: A value, usually 5% or 0,005, used to test for statistical significance in some

data mining methods. IS statistical significance is found, a data miner can say that there is a 95%

likelihood that a calculated or predicted value is not a false positive.

D

Decision Tree: A class of data mining and statistical methods that form tree like predictive

models.

Data analysis: The process of examining data in a repeatable and structured way in order to

extract mining patters or messages from a set of data


L Label: In Rapidminer, this is the role that must be set in order to use an attribute as the

dependent or target, attribute in a predictive model.

M Missing Data: These are instances in an observation where one or more attributes does not

have a value. It is not as same as zero, because zero is a value.

P Prediction: The target, or label or dependent attribute that is generated by a predictive model,

usually for a scoring data set in a model.

T Training Data: In a predictive model, this data set already has the label, or dependant variable

defined, so that it can be used to create a model which can be applied to a scoring data set in order

to generate prediction for the latter.


Webography

[1] http://chirouble.univ-lyon2.fr/~ricco/data-mining/

[2] https://www.casact.org/pubs/forum/03wforum/03wf001.pdf

[3] http://www.cnbc.com/id/101586404

[4]http://online.wsj.com/news/articles/SB1000142405274870464860457562075099

8072986

[5] http://arxiv.org/ftp/arxiv/papers/1309/1309.0806.pdf

[6] http://docs.salford-systems.com/insurance4211.pdf

[7] http://www.ulb.ac.be/di/map/adalpozz/pdf/Claim_prediction.pdf

[8] http://chirouble.univ-lyon2.fr/~ricco/data-mining/

[9] https://www.casact.org/pubs/forum/03wforum/03wf001.pdf

Final Report Data Mining

Documents

Transcript of Final Report Data Mining