Final Report Data Mining

60
30/05/2014 Applying Data Mining Techniques for an Insurance Company Group Project: Abidar Hamza Jad Al Adas Sandra Culman Lionel Kouchou Prepared for: Professor: Kilian Stoffel Assistant: Dong Han

Transcript of Final Report Data Mining

  • 30/05/2014

    Applying Data Mining Techniques for an Insurance Company

    Group Project: Abidar Hamza

    Jad Al Adas

    Sandra Culman

    Lionel Kouchou

    Prepared for:

    Professor: Kilian Stoffel

    Assistant: Dong Han

  • DATA MINING UNIVERSITE DE NEUCHATEL

    Acknowledgment

    We would like to express our deepest appreciation to the Professor Kilian

    Stoffel , as well to Mister Dong Han who gave us the opportunity to do

    this interesting project on the topic Applying Data Mining Techniques for

    an Insurance Company, which also made us doing a lot of research that

    helped us accumulate new information.

  • DATA MINING UNIVERSITE DE NEUCHATEL

    Table of content

    Introduction .......................................................................................................................................... 1

    Chapiter 1 :Business Background and Data Presentation .................................................................. 2

    1. What is Insurance? ........................................................................................................................ 3

    1.1 Area of study ......................................................................................................................... 3

    2. Data source ................................................................................................................................... 4

    3. From a big relational database to a normalized dataset ................................................................ 7

    3.1 Creating a view for the policies ............................................................................................. 7

    3.2 Querying the database ........................................................................................................... 8

    3.3 Description of the initial data ................................................................................................ 9

    3.4 Key Attributes for analysis .................................................................................................... 9

    Chapiter 2 :Data Reparation and Visualization ................................................................................ 11

    1. Introduction................................................................................................................................. 12

    2. Missing Values ........................................................................................................................... 12

    3. Discretization .............................................................................................................................. 14

    3.1 Discretization from Numerical To Nominal .................................................................... 14

    4. Conversion .................................................................................................................................. 20

    4.1 Nominal to numeric ............................................................................................................. 20

    5. Visualization ............................................................................................................................... 20

    Chapiter 3 :Method processes and results interpretation .................................................................. 24

    1. Business questions ...................................................................................................................... 25

    2. Predictive methods and evaluation ............................................................................................. 25

    2.1 One Rule .............................................................................................................................. 25

    2.2 Nave Bayes ......................................................................................................................... 29

  • DATA MINING UNIVERSITE DE NEUCHATEL

    2.3 Decision Tree ...................................................................................................................... 34

    2.4 Logistic Regression [2] ....................................................................................................... 39

    3. Descriptive methods and evaluation ........................................................................................... 40

    3.1 Association Rules ................................................................................................................ 40

    3.2 Clustering ............................................................................................................................ 42

    Conclusion.......................................................................................................................................... 51

    Glossary.............................................................................................................................................. 52

    Webography ....................................................................................................................................... 54

  • DATA MINING UNIVERSITE DE NEUCHATEL

    Figure List

    Figure 1: Policies Diagram................................................................................................................... 5

    Figure 2 : Claims Diagram ................................................................................................................... 6

    Figure 3 : Policy view script ................................................................................................................ 7

    Figure 4 : Policy view relation ............................................................................................................. 7

    Figure 5 : Final query ........................................................................................................................... 8

    Figure 6 : Results in SQL ..................................................................................................................... 8

    Figure 7 : Attributes ............................................................................................................................. 9

    Figure 8 : Missing values table .......................................................................................................... 12

    Figure 9 : Table with no missing values ............................................................................................ 13

    Figure 10 : Discretization process in Rapiminer ................................................................................ 14

    Figure 11 : Visualization of Body after discretization ................................................................... 16

    Figure 12 : Discritization of attribute Body in two intervals .......................................................... 17

    Figure 13 : Weka pre-process main window ...................................................................................... 18

    Figure 14 : Weka visualization .......................................................................................................... 19

    Figure 15 : Visualization of all attributes after correcting the discretized values .............................. 19

    Figure 16 : Histogram visualization of Dateoccurence .................................................................. 21

    Figure 17 : Discretization of Horsepower ...................................................................................... 21

    Figure 18 : Visualization after correcting split point of Horsepower ............................................. 22

    Figure 19 : Scatter plot visualization of Horsepower ..................................................................... 22

    Figure 20 : Weka visualization plot of Region ............................................................................... 23

    Figure 21 : Single Rule process ......................................................................................................... 26

    Figure 22 : Screenshot of Single Rule process ................................................................................... 26

    Figure 23 : Single Rule confusion matrix .......................................................................................... 27

    Figure 24 : Car characteristics result using Single Rule .................................................................... 27

    Figure 25 : Car characteristics confusion matrix ............................................................................... 28

    Figure 26: Car characteristics confusion matrix ................................................................................ 28

    Figure 27: Confusion Matrix Single Rule ....................................................................................... 28

    Figure 28: Nave Bayes process ......................................................................................................... 29

    Figure 29: Naive Bayes distribution table 1 ....................................................................................... 30

    Figure 30: Naive Bayes confusion matrix .......................................................................................... 30

  • DATA MINING UNIVERSITE DE NEUCHATEL

    Figure 31: Lift chart of driver profiles prediction .............................................................................. 31

    Figure 32: Confusion matrix of car characteristics using Naive Bayes ............................................. 32

    Figure 33: Distribution table of car characteristics ............................................................................ 32

    Figure 34: Naive Bayes distribution table 2 ....................................................................................... 33

    Figure 35: Lift chart of car characteristics ......................................................................................... 33

    Figure 36: Screenshot Weka decision tree ...................................................................................... 34

    Figure 37: Text View Weka Decision Tree .................................................................................... 36

    Figure 38: Confusion matrix using decision tree ............................................................................... 36

    Figure 39: Screenshot of Weka tree ................................................................................................... 36

    Figure 40: Confusion Matrix .............................................................................................................. 37

    Figure 41: Screenshot of Weka decision tree using driver profile attributes ..................................... 37

    Figure 42: Confusion matrix Driver attributes ................................................................................ 38

    Figure 43: Screenshot of decision tree in Rapidminer and Confusion Matrix (Car attributes) ......... 38

    Figure 44: Screenshot of decision tree in Rapidminer and Confusion Matrix (driver attributes) ...... 39

    Figure 45: Logistic Regression table result ........................................................................................ 39

    Figure 46: Association rules process .................................................................................................. 40

    Figure 47: Association rules process 1 ............................................................................................... 41

    Figure 48: Association rules process 2 ............................................................................................... 42

    Figure 49: Clustering process with Rapidminer................................................................................. 43

    Figure 50 : Number of clusters ........................................................................................................... 44

    Figure 51 : Centroid table .................................................................................................................. 44

    Figure 52: Screen shot of centroid plot view ..................................................................................... 45

    Figure 53: Screenshot of cluster 0 folder view .................................................................................. 46

    Figure 54: Centroid table of second iteration ..................................................................................... 47

    Figure 55: Screenshot of centroid plot - second iteration .................................................................. 48

    Figure 56 : Screenshot of cluster 3 - folder view ............................................................................... 49

    Figure 57: Performance Vector Davies Boulding ........................................................................... 49

    Figure 58 : David Boulding table ....................................................................................................... 50

    Figure 59 : David Building graph ...................................................................................................... 50

  • DATA MINING 1 UNIVERSITE DE NEUCHATEL

    Introduction

    The accumulation of vast and growing amounts of data in different formats and different datasets

    can be considered one of the biggest problem we are facing. The amount of information stored in

    insurance databases is rapidly increasing because of the rapid progress of information technology.

    The data that is gathered is useless without analyzing it. The patterns, associations, or relationships

    among this data can provide important information that helps companies improve their activities.

    The wealth of data can be considered a potential goldmine of business information. Finding the

    valuable information hidden in those databases and identifying appropriate models is a difficult

    task.

    The above mentioned problem can be solved with the help of Data Mining, a process of analyzing

    data from different perspectives and summarizing it into useful information. A typical data mining

    process includes data acquisition, data integration, data exploration, model building, and model

    validation.

    Nowadays, insurance has become a compulsory need in peoples life since they cant afford anymore

    to bear the expenses of a loss or an accident. Thus, this need have fueled insurance companies to

    expand and grow, consequently profits increased as well as market share. Nevertheless, corporations

    are still exposed to a great risk and some losses are inevitable, thats why they seek new approaches to

    better manage their risk.

    The paper is organized as follows. Chapter 1 provides an overview of the insurance area and the data

    source. Chapter 2 explains our whole process of data preparation. By analyzing the data that we obtain

    from this field, in chapter 3 we try to find usable information that can help the insurance company

    better price their premiums.

  • DATA MINING 2 UNIVERSITE DE NEUCHATEL

    Chapter 1

    Business Background and Data

    Presentation

    What is insurance?

    Data source

    Steps from a big relational database

    To a normalized dataset

    Description of the initial data

  • DATA MINING 3 UNIVERSITE DE NEUCHATEL

    1. What is Insurance?

    Insurance is the fair transfer of the risk of a loss, from one entity to another in exchange for

    payment. In other words, insurance equals peace of mind. An insurer, or insurance carrier, is a

    company selling the insurance. The insured, or policy holder, is the person or entity buying the

    insurance policy. The amount of money to be charged for a certain amount of insurance coverage

    is called the premium.

    The main aspects of insurance are:

    Underwriting (Policies) is when a customer buys coverage or a policy from the insurance

    company (Revenues to the company).

    A claim is when a customer undergoes a certain loss and declares it to the insurance

    company in order to receive the compensation agreed upon (Losses to the company).

    There are several types of insurance, for example Motor, Health, Fire, Allied Perils, Natural

    disasters, Marine, Personal Accident, Life, Property, Liability, Travel and many more.

    A key part of insurance is charging each customer the appropriate price for the risk they

    represent. Risk varies widely from customer to customer, and a deep understanding of different

    risk factors helps predict the likelihood and cost of insurance claims.

    1.1 Area of study

    Insurance is a quite broad and a rich topic; it offers a lot of potential for performing data mining

    methods. Yet, we will concentrate our research on one line of business: Motor (Automobile).

    Some interesting facts about motor accidents:

    There are more than 12 million motor vehicle accidents annually;

    The typical driver will have a near automobile accident one or two times per month

    The typical driver will be in a collision of some type on average of every 6 years

    Crashes are the leading cause of death for ages 3-33

    Its good to know as well that even a minor accident can result in thousands of dollars in

    damages. Accordingly, a question would arise here to know: what is the likelihood of a car

    accident to occur?

    Many studies have been conducted over the years on this topic; specialists have narrowed some

    important factors that are linked to increasing the risk an accident and on which they base the

    pricing of their premiums. Some of these aspects are:

  • DATA MINING 4 UNIVERSITE DE NEUCHATEL

    Age and gender of the driver;

    Driving record;

    Type of vehicle;

    Geographical record;

    Period of the year;

    Two factors have been chosen from the above to be interpreted in our project: type of the vehicle

    and the drivers profile.

    Therefore our goal will be:

    To better predict motor insurance claim occurrence based on the characteristics of:

    The drivers vehicle.

    The driver.

    Discover hidden patterns that may be useful for the insurance company

    2. Data source

    One of our team members used to work as a software developer for a software vendor specialized in

    insurance and reinsurance ERP systems. As a result, we had the permission to acquire a database of

    one of the main insurance companies. For confidentiality purposes we preferred to keep the names

    undisclosed unless requested.

    The software vendor used a relational database built on the Microsoft SQL Server database

    management system. Parts of the diagrams are shown in the figures below:

  • DATA MINING 5 UNIVERSITE DE NEUCHATEL

    Policies Diagram:

    Figure 1: Policies Diagram

  • DATA MINING 6 UNIVERSITE DE NEUCHATEL

    Claims Diagram:

    Figure 2 : Claims Diagram

    The initial unfiltered database contains roughly 600 000 rows of policies and 90 000 rows of claims.

    Hence, we had to go through several steps in order to come up with one final dataset that would be

    useful for our project.

  • DATA MINING 7 UNIVERSITE DE NEUCHATEL

    3. From a big relational database to a normalized dataset

    3.1 Creating a view for the policies

    First of all, creating a view for the policies was fundamental as it will facilitate writing the queries

    in the database. The following figures describe the view created as a script and the design to

    illustrate the tables used:

    Figure 3 : Policy view script

    Figure 4 : Policy view relation

  • DATA MINING 8 UNIVERSITE DE NEUCHATEL

    3.2 Querying the database

    As a second step, we used the above view in a query to select the needed data. The query selected

    the attributes related to the characteristics of the car, the drivers profile, the date occurrence of the

    claim and a conditional attribute to check if a policy has a claim (depending if a Policy ID have

    one or more rows in the claims table). Furthermore, we filtered the policies by the issue year 2012

    and the line of business Motor using the WHERE clause. The result is exported to a CSV file as our

    preliminary dataset.

    The following figure illustrates the final query in SQL:

    Figure 5 : Final query

    Result in SQL:

    Figure 6 : Results in SQL

  • DATA MINING 9 UNIVERSITE DE NEUCHATEL

    3.3 Description of the initial data

    A total final of 4731 instances and 16 attributes is the result of the transition of SQL to the CSV

    file.

    Figure 7 : Attributes

    The table was extracted from Rapidminer. It represents the metadata of the insurance data which

    includes the number of examples or instances, the number, description and type of the attributes,

    some statistics about the values of the attributes and missing values if any.

    3.4 Key Attributes for analysis

    HasClaim is the target attribute or the Label in our data. The type of this attribute is binominal

    with 2 values: True and False. This attribute describes weather the line has a claim or not in the

    past. Given this result, we could establish a relationship with the rest of the attributes for

    classification. The rest of the attributes will be divided into two main groups which will support our

    supervised learning hypothesizes:

    Drivers Profile

    Age: age of the driver, Type integer

    Gender: Male or Female, Type binominal

    Marital Status: Married or Single, Type binominal

    Has Children: True, False, Type binominal

    Region: Urban, Town, Suburban ,Type polynominal

  • DATA MINING 10 UNIVERSITE DE NEUCHATEL

    The drivers profile is seen as important to analyze by insurance companies as it portrays the degree

    of responsibility given the presence of children, the method of driving, abidance by the rules as

    youngsters tend to break them more often and go over the speed limit and the residential area with

    high or low traffic.

    Vehicle Specifications

    Make: manufacturer (BMW, Fiat), Type polynominal

    Model: subcategory (BMW X5, 320), Type polynominal

    Year Built: year of manufacturing, Type integer

    Category: type of usage (Taxi, private, rent a car) Type polynominal

    Body: the size, Type polynomial

    Horsepower: speed or power, Type integer

    Another aspect to look at when pricing premiums is obviously the car. Some makes are considered

    to be safer on the roads and more robust, new cars tend to be more stable than old cars, high speed

    cars are more prone to accidents.

    After the description of the initial data, we come to the step of further cleaning and preparing so it

    would be all set to be utilized in data mining techniques using rapid miner.

  • DATA MINING 11 UNIVERSITE DE NEUCHATEL

    Chapter 2

    Data Reparation and Visualization

    Introduction

    Missing values

    Discretization

    Conversion

    Visualization

  • DATA MINING 12 UNIVERSITE DE NEUCHATEL

    1. Introduction

    In order to use the techniques of data mining, or make the predictions, business professionals almost

    generally agree that data preparation is one of the most imperative parts of any such project, and

    one of the most time-consuming and difficult.

    We already covered the integration, transformation and reduction parts in the first chapter Data

    source by normalizing data from a relational database into one dataset CSV file. We will process

    the rest of the steps in Rapidminer.

    2. Missing Values

    A missing value can signify a number of different things. Perhaps the field was not applicable, the

    event did not happen, or the data was not available. It could be that the person who entered the data

    did not know the right value, or did not care if a field was not filled in.

    However, there are many data mining scenarios in which missing values provide important

    information. The meaning of the missing values depends largely on context. Since our data comes

    from an insurance company, the quality for them is not an option but its a vital thing in their

    everyday operations and they perform data cleaning very often. Therefore, we didnt find so many

    inconsistent or missing values. Yet, we could discover some attributes as shown in the metadata of

    Rapidminer:

    Figure 8 : Missing values table

  • DATA MINING 13 UNIVERSITE DE NEUCHATEL

    The Metadata indicated that we have 4 attributes with missing values:

    Model: 14 missing values

    Yearbuilt: 3 missing values

    Horsepower: 3 missing values

    Body: 1 missing values

    To deal with these missing values Rapidminer provides an operator called Replace missing

    values. We will use it and tune its parameters according to each attribute to get the best

    replacement.

    Model and Body

    The best method for a polynominal attribute is to assign the maximum value (most frequent value)

    to the missing values.

    Yearbuilt and Horsepower

    Since they are integers, replacing by the average is the best option, in all cases 3 values will not

    make much of a difference. Metadata after fixing the missing values will look as the following:

    Figure 9 : Table with no missing values

  • DATA MINING 14 UNIVERSITE DE NEUCHATEL

    3. Discretization

    The discretization involves partitioning numerical values into intervals by placing breakpoints;

    using this method randomly can result in data unbalance. In our project we worked efficiently to

    visualize the discretization results and making sure to keep as much as possible balance between the

    intervals, through this section we describe the steps of discretization and visualization results using

    Rapidminer and Weka.

    3.1 Discretization from Numerical To Nominal

    The insurance data has many attributes which need discretization in order to provide a better

    understanding and good interpretation while using different data mining methods. We checked

    before starting discretization if our data is well balanced and then corrected the unbalance of some

    attribute values. The Figure 10 describes the discretization process in Rapidminer.

    Figure 10 : Discretization process in Rapiminer

    The Discretization by user specification has limited capabilities and supports only the numerical

    attributes. The sensitive point in the discretization process is the cut point called breakpoint. After

    sorting each attribute in ascending order, we choose the cut point and then we split and merged.

    After these three important steps we evaluated the values and made some modification if the

    intervals were unbalanced.

  • DATA MINING 15 UNIVERSITE DE NEUCHATEL

    The numerical attributes selected for discretization are:

    Sum insured:

    o High-risk

    o Medium Risk

    o Low Risk

    Total Premium:

    o High

    o Medium

    o Low

    Year Built:

    o Very old Cars

    o Old Cars

    o Recent Cars

    o New Cars

    Horsepower:

    o Fast

    o Medium

    o Slow

    Age:

    o Young

    o Adult

    o Old

    There are also other attributes which are important to use even though they are not numerical. For

    example the attribute BODY that describes the car size is nominal and contains different types of

    car sizes. We decided to discretize this attribute using nominal to numerical conversion and then

    discretization by user specification.

    Based on the Meta data table we set

    the range values for each attribute and

    corrected these values until we obtained the

    right balance

  • DATA MINING 16 UNIVERSITE DE NEUCHATEL

    Body:

    o Small cars

    o Medium Cars

    o Big Cars

    It is important to set the right cut-point for a better discretization!

    Figure 11 : Visualization of Body after discretization

    The figure11 shows that our discretization needs to be more balanced between the three groups.

    Small cars have the largest value and while using this attribute for learning models we will have

    always a bigger probability to obtain small cars then the medium or big cars. To remedy this

    problem we discretized again and corrected the breakpoints between the three groups. Figure 12

    show the correct discretization of the attribute Body.

  • DATA MINING 17 UNIVERSITE DE NEUCHATEL

    Figure 12 : Discritization of attribute Body in two intervals

    We modified the number of the group into two groups: standard car and big car. Because the

    value of medium car and big car was close to each other we choose to regroup these two groups

    into one significant group with a large number of values. After finishing discretization we added the

    operator Write CSV to generate a new file with the data that has been discretized. This helps us

    use the file in Weka; in fact it reduces the number of the operators used in the main process as long

    as we use the predictive and descriptive methods (Figure 10).

  • DATA MINING 18 UNIVERSITE DE NEUCHATEL

    Preprocessing in WEKA

    First, we ran Weka and launched the explorer window, and then we selected the Preprocess tab, in

    order to see the attribute name, the percentage of missing values and the balanced values (Figure

    13).

    Figure 13 : Weka pre-process main window

    Figure 14 shows all attributes after the first discretization; there are values which cant be displayed,

    due to the large number of instances (Policy No, Make, Date Occurrence etc.). Actually these

    attributes could not be discretized into intervals in order to know which car model implies the

    occurrence of the claim.

  • DATA MINING 19 UNIVERSITE DE NEUCHATEL

    Figure 14 : Weka visualization

    Figure 15 shows the improvement we bring into the following attributes: Body, Horsepower,

    Marital status and Gender, with the intention to balance as much as possible the intervals

    between each instance.

    Figure 15 : Visualization of all attributes after correcting the discretized values

  • DATA MINING 20 UNIVERSITE DE NEUCHATEL

    4. Conversion

    The conversion of the data from a type of format to another through different types of conversion

    operators solves the problem when the operators are not able to support some attribute types. In our

    project we used conversion not only to solve this kind of problems but also to facilitate the

    discretization.

    4.1 Nominal to numeric

    Our insurance data has many numerical attributes, it was not difficult to use discretization and

    convert numerical attributes to nominal, but for some attributes like Body which is polynomial

    and has much similarity, we needed to split it into groups which can be differentiated between its

    values.

    To be more efficient in processing we sorted the Body values, then we convert these values into

    numerical using Nominal to numerical by assigning each value number (unique integer) and

    finally discretized it into a range or an interval (Standard car, big car).

    5. Visualization

    Data visualization is the process by which textual or numerical data are converted into meaningful

    images. The reason why the data visualization can help in data mining is that the human brain is

    very effective in recognizing large amounts of graphical representations. This approach of

    visualization allows us, in each step that we did in our project, to understand graphically what our

    data looks like. In this chapter we described and analyzed the selected attributes for visualization

    and described the impact on the insurance business. There is a great variety of visualization

    techniques proposed in Rapidminer. We listed below the ones we used in our project:

    Histogram

    We used the histogram to show the Dateoccurence values. The Dateoccurence = Null is the

    most dominant for the company. Important for us it to learn and understand when the claim

    occurred.

  • DATA MINING 21 UNIVERSITE DE NEUCHATEL

    Figure 16 : Histogram visualization of Dateoccurence

    Pie

    By visualizing the attribute Horse Power after discretization, we see that there is an unbalanced

    data between fast, medium and slow; this means the cut point between the three intervals is not

    correct.

    Figure 17 : Discretization of Horsepower

  • DATA MINING 22 UNIVERSITE DE NEUCHATEL

    While correcting the cut point, we obtained the following visualization:

    Figure 18 : Visualization after correcting split point of Horsepower

    Scatter plot

    It is useful to visualize some attributes in regards to our label. Figure 19 permits a visualization of

    Horsepower and Yearbuilt (Y-Axis) and the HasClaim (X-Axis). The HorsePower is

    presented by different colors. If we analyze this scatter plot view, we see that old cars with medium

    horse power (with a speed not greater than 120 km/h) have more claims contrary to other type of

    cars, like new cars that have few claims.

    Figure 19 : Scatter plot visualization of Horsepower

  • DATA MINING 23 UNIVERSITE DE NEUCHATEL

    We used to work with Weka because of the good visualization that it provides. In our report we

    didnt try to figure out all the visualiation types, we wanted to find the visualization that helped us

    analyze the relationships between the attributes. Figure 20 ilustrates the attribute Region togheter

    with the label attribute. We can deduct that the sensitive region is Town.

    Figure 20 : Weka visualization plot of Region

  • DATA MINING 24 UNIVERSITE DE NEUCHATEL

    Chapter 3

    Method processes and results

    interpretation

    Business questions

    Predictive method and evaluation

    Descriptive method and evaluation

  • DATA MINING 25 UNIVERSITE DE NEUCHATEL

    1. Business questions

    The overall goal of the data mining process is to extract information from a data set and transform it

    into an understandable structure for further use. While thinking of the components of the input,

    beside the instances and the attributes, the things that we can learn is the concept. Throughout this

    chapter we want to put in practice some of the main data mining techniques and algorithms and

    their potential applications in the insurance industry.

    What are the driver profiles most likely to have an accident?

    What are car characteristics which has high impact on hasclaim?

    Segmenting automobile drivers?

    Accordingly to this business questions we can help insurance firm making crucial business

    decisions and turn the new found knowledge into actionable results.

    2. Predictive methods and evaluation

    To learn a method for predicting the instances class from pre-labeled instances is called

    classification. This is a supervised learning method and is used to analyze the importance of the

    attributes determining the value of the target attribute. We decided to use four classifiers in order to

    predict the value of our label One Rule, Nave Bayes, Decision Tree and the Logistic Regression.

    For a better understanding of our results we proceeded using both Rapidminer and Weka, programs

    that contain a collection of visualization tools and algorithms for data analysis and predictive

    modeling, together with graphical user interfaces for easy access to its functionality.

    2.1 One Rule

    The Single Rule Induction operator is one of the most reliable classification methods. It helps find

    the attribute with the lowest number of errors by testing each attribute. The result can be interpreted

    as the attribute that has the most influence on the target group.

  • DATA MINING 26 UNIVERSITE DE NEUCHATEL

    Figure 21 : Single Rule process

    The first attributes that we tested on One Rule Operator were the car characteristic attributes:

    Body, Category, HorsePower, MAKE, Model and Yearbuilt. Amongst these attributes the Model

    is the one that provides us the smallest error rate (3442 out of 4731 correct instances). The

    insurance company must take into account the model that is most predisposed to accidents. For

    Example, a pick-up Car will have a bigger probability to have an accident then an Accord.

    Knowing the model implies automatically deducing the brand, while the reciprocal does not hold

    true necessarily.

    Figure 22 : Screenshot of Single Rule process

    The confusion matrix displays how many of predicted values matched the actual values when cross-

    validation tests were performed. From among records that were predicted as FALSE, the correct

    predictions were 464 and 318 were incorrect. From among records that were predicted as TRUE,

    the correct predictions were 356 and 281 were incorrect The confusion matrix shows that in regard

  • DATA MINING 27 UNIVERSITE DE NEUCHATEL

    to prediction of FALSE, model accuracy is 59.34% and in regard to prediction of no, model

    accuracy is 55.89%. The overall model accuracy is simply the percentage of good predictions

    among all predictions, that is:

    / + /

    / + / + / + /

    , where T/T is the number of situations where the model predicts TRUE, and the result is TRUE,

    T/F is the number of situations where the model predicts TRUE and the result is FALSE, etc. After

    performing all computations using Rapidminer, the total accuracy of the model is 57.79%.

    Figure 23 : Single Rule confusion matrix

    The prediction on the HasClaim is lying on the TRUE value, which means that this value plays an

    important role in analyzing the performance of the operator. The poor level of accuracy provided by

    this operator will not help us improve the prediction of HasClaim in regards to the car

    characteristics.

    Obtaining the model of the car as the first rule in our algorithm made us think whether this is

    helpful or not. Because of the high diversion of the attribute Model and Make we decided to

    test the operator without them. The results below provides us a better understanding of which

    attribute an insurance company should take into consideration while making the insurance offer. In

    conclusion YearBuilt provides us with 4 simple rules.

    Figure 24 : Car characteristics result using Single Rule

  • DATA MINING 28 UNIVERSITE DE NEUCHATEL

    After we made the changes, the accuracy of this operator with the new number of attributes still

    havent change.

    Figure 25 : Car characteristics confusion matrix

    Further on we tested One Rule on the driver characteristics: Age, Gender, HasChildren, Marital

    Status and Region. The model provided us HasChildren with the attribute having the less number

    of errors (2476 out of 4731 correct instances). The program defined the two rules below: if

    HasChildren = False then False and if HasChildren = True then True, meaning that clients having

    children are prone to have an accident, while the clients without children are not.

    Figure 26: Car characteristics confusion matrix

    Evaluating this Operator with the new attributes, provides us with a low accuracy of 53.14%.

    Having such a low accuracy means that this algorithm is not providing us a reliable algorithm that

    the company could use in deciding if the attribute HasChildren is really a primary component in

    taking a decision.

    Figure 27: Confusion Matrix Single Rule

  • DATA MINING 29 UNIVERSITE DE NEUCHATEL

    2.2 Nave Bayes

    Driver Profiles prediction

    Based on Bayes Theorem this classifier applies a simple probabilistic assumption. The Bayes

    Theory applied on this operator is assuming that our attributes are not related to each other but they

    are all related to our label attribute HasClaim.

    Figure 28: Nave Bayes process

    The figure 26 illustrate the nave bayes process, we selected the attributes which are relative to the

    driver profiles in order to calculate the likelihood of each one for predicting True hasclaim.

    We obtained as a result from this process ,distribution table which mark out the probability of each

    attribute ;then we focused on the target class TRUE and we compared the values with the intention

    to sort out the attributes with high probability value .

    Accordingly to the table the attributes selected are:

    Age=adult and Gender =Male and Marital status = Married Has children =TRUE

  • DATA MINING 30 UNIVERSITE DE NEUCHATEL

    Figure 29: Naive Bayes distribution table 1

    To evaluate the performance of naive Bayes classifier the confusion matrix allows more

    detailed analysis than mere proportion of correct guesses,

    Confusion matrix displays how many of predicted values matched the actual values when cross-

    validation tests were performed (by cross-validation operator). For example, from among records

    that were predicted with Hasclaim class as True, the correct predictions were 249 and 194 were

    incorrect. The confusion matrix shows that in regard to prediction of Hasclaim =TRUE, model

    accuracy is 55.11%.

    Figure 30: Naive Bayes confusion matrix

    Lift chart

    A lift chart graphically represents the improvement that a mining model provides when compared

    against a random guess, and measures the change in terms of a lift score. By comparing the lift

    scores for various portions of our data set and for different models, we can determine which model

  • DATA MINING 31 UNIVERSITE DE NEUCHATEL

    is best, and which percentage of the cases in the data set would benefit from applying the models

    predictions.

    Figure 31: Lift chart of driver profiles prediction

    The lift chart is helpful it quantify the relationships between the confidence (using threshold) and

    the True prediction of hasclaim, by showing the increase in the number of driver claims , as we can

    see in more details analyzing this chart , with confidence greater than 0.5 the number of driver who

    is predicted to hasclaim decrease ,for instance if we take confidence range between [0.49-0.51]

    1944 drivers must me targeted in order to found 1025 drivers among them who hasclaim ; the good

    thing in our prediction results it gives value more than 90 % when confidence and the driver target

    are specified.

    Car characteristics prediction

    We tried to predict the same target class but using different factors ,first we choosed driver

    profiles and now we use the car characteristics ,our main goal is to found rules which can have high

    accuracy ,with selecting the car characteristics attributes and train test the model we obtain the

    confusion matrix in figure 32

  • DATA MINING 32 UNIVERSITE DE NEUCHATEL

    Figure 32: Confusion matrix of car characteristics using Naive Bayes

    The accuracy of our model using car attributes is 61.54 %, this result is acceptable and allow us to

    give more importance on what type of cars has high impact on hasclaim, to extract more details

    from this model we use the weight table of True in order to select the attributes which has high

    probability on the table

    Figure 33: Distribution table of car characteristics

    From the table in figure 33, the car characteristic attributes which has high probability and can be

    selected for prediction in this model are:

    o Body = Standard cars and Yearbuilt = very old cars

    We have split the table presenting car brand and model because of the large number of values; we

    sorted these values and selected the car brands which have high probability:

  • DATA MINING 33 UNIVERSITE DE NEUCHATEL

    Figure 34: Naive Bayes distribution table 2

    Mercedes, Nissan and Hyundai are the three main brand having and high impact on hasclaim

    Figure 35: Lift chart of car characteristics

    Accordingly to the lift chart using to quantify the prediction of the cars who has highest claim

    ,although there is no prediction results with perfect information .the variance between the

    confidence measure and the number of drivers contribute both on the prediction ; if we take for

  • DATA MINING 34 UNIVERSITE DE NEUCHATEL

    instance the confidence value between [0.4 -0.5] 529 drivers targeted out of 1092 can have

    prediction of 98% of True hasclaim but if we compare it to the confidence of 1 205 out of 211 we

    get only 40 % of true claim , from this result we confirmed that our model predicting the true has

    claim is acceptable and the results can contribute strongly in future prediction and help the

    insurance company to set new decision regardless this results.

    2.3 Decision Tree

    This operator generates a Decision Tree for classification both for nominal and numerical data. This

    classification model is really easy to interpret and predicts the value of our label Has Claim based

    on the attributes we have chosen. In this section we proceed in analyzing both group of attributes

    (Car Driver) on two operators: - Decision Tree in Rapidminer and J48- Weka

    Weka

    The decision tree below resulted after running the program with the car attributes. Unfortunately the

    tree is very big since the attributes have many possible outcomes. It has a size of 835 and 801

    numbers of leaves. It seems difficult to interpret, but with the help of the text view we managed to

    provide an interpretation.

    Figure 36: Screenshot Weka decision tree

    J48 is considering the Category the root attribute. Prive cars are being analyzed depending on

    the YearBuilt, make and horsepower. If the car is very old the algorithm checks the

    MAKE of the care before making a decision if true or false. If the car is prive and is

    categorized as a new car then this will be automatically assigned as a TRUE HasClaim

  • DATA MINING 35 UNIVERSITE DE NEUCHATEL

    prediction. If the car is categorized as an old car, the model will go first trough the MAKE of

    the car, depending on this it will proceed in analyzing the Horsepower.

    The generated result shows that the cars more likely to be implied in an accident are: Prive, Cargo,

    Rent and Taxis. The rest of the proportion is well spread over the remaining categories which own a

    small part of the total instances (they have a remote probability of an accident). Really exposed

    cars are: Mercedes, Nissan, BMW, Audi, Toyota and DAIHATSU. We are questioning ourselves if

    these results are really providing a good overview on the insurances of cars.

    The other categories are well spread and own a small part of the total instances. According to this,

    these categories have a small probability in being involved in an accident. The insurance company

    should analyze and focus on the four big car groups.

  • DATA MINING 36 UNIVERSITE DE NEUCHATEL

    Figure 37: Text View Weka Decision Tree

    Performing the evaluation the Weka J48 Tree obtained 67.09% accuracy. The prediction of the

    TRUE class, that is important for our analysis, has an even bigger accuracy of 71.23%.

    Figure 38: Confusion matrix using decision tree

    Because of the complexity of the Weka tree we decided to take the attributes with a large number of

    instances out from the model (Model, Make, and Category). We obtained a tree that is easier to

    interpret but provides a lower accuracy.

    Figure 39: Screenshot of Weka tree

  • DATA MINING 37 UNIVERSITE DE NEUCHATEL

    Figure 40: Confusion Matrix

    Using the driver attributes we obtained the tree below. A 16 size tree with a number of 10 leaves.

    The algorithm sets HasChildren as the root attribute and analysis the data in regards to this. We

    could conclude that clients that dont have children and are married depending on their age have

    the following results: adult true, old- true, young false. If the clients dont have children and are

    single then the model decides over false. On the opposite site if the client has children and is single

    has a bigger probability of an accident occurrence.

    Figure 41: Screenshot of Weka decision tree using driver profile attributes

    The Evaluation of this operator using the driver attributes provides a low accuracy of 56%, meaning

    that the interpreted results are not to certain. We tried changing the parameters of the operator, but

    we couldnt obtain better results.

  • DATA MINING 38 UNIVERSITE DE NEUCHATEL

    Figure 42: Confusion matrix Driver attributes

    Decision Tree Rapidminer

    While running the second operator, this time on Rapidminer we obtained the results below. The

    algorithm is producing a pruned tree that has a medium complexity. The root attribute used in the

    first model is Year Built and in the second model HasChildren. As a small conclusion after

    examining the results, we would say that the attribute YearBuilt and HasChildren will have the

    following interpretation: New car True, Old Car False, Very old cars False and Recent car

    True. If the client has children it depends on the age and the region if he has a claim or not.

    Figure 43: Screenshot of decision tree in Rapidminer and Confusion Matrix (Car

    attributes)

  • DATA MINING 39 UNIVERSITE DE NEUCHATEL

    We evaluated both algorithms and we obtained a 57% and 53% accuracy. Compared to Weka this

    operator is producing a less accurate result. In other words, the insurance company can get a better

    prediction of the label class using the J48 Tree.

    Figure 44: Screenshot of decision tree in Rapidminer and Confusion Matrix (driver

    attributes)

    2.4 Logistic Regression [2]

    We start modeling process by learning the relationship between claim frequency and underlying

    risk factors including age, gender, and marital status, region, and HasChildren based on this

    attributes which describe the driver profiles we use logistic regression to quantify the claim

    frequency and the effect of each risk factor, and also estimate the probability of claim frequency.

    Figure 45: Logistic Regression table result

  • DATA MINING 40 UNIVERSITE DE NEUCHATEL

    We compute logistic regression using the same attributes selected from the decision tree, based on

    the results from the table we can deduct that the age has high impact on hasclaim and by selecting

    drivers depending to the age the insurance company can make more profit increasing the amount

    premium when the drivers getting older .this is one of many reasons why we applied predicting

    model to learn the best rule for the future.

    3. Descriptive methods and evaluation

    We proceeded to find the hidden structure in the available data (process called supervised

    learning) by using two methods: clustering and association rules.

    3.1 Association Rules

    Association rules are a bit different from classification rules excepting the fact that they can predict

    any attribute, not just the target one, and consequently, any combination of attributes. Different

    association rules express different regularities that underlie the dataset, and they generally predict

    different things. In order to implement the association rules for our database, the following five

    operators are needed:

    Figure 46: Association rules process

    Read CSV , in which we import the database (but we should not use any label);

    Select Attributes, where we will select the attributes for the process. We decided to take

    out two attributes that have no impact on our analysis: Policy Number and Date

    Occurrence;

    Nominal to Binominal, which changes the type of selected nominal attributes to a

    binominal type. It also maps all values of these attributes to binominal values;

  • DATA MINING 41 UNIVERSITE DE NEUCHATEL

    FP Growth - the FP in FP-Growth stands for Frequency Pattern. Frequency pattern

    analysis is used for many kinds of data mining, and it is a necessary component of

    association rule mining. Without having frequencies of attribute combinations, we cannot

    determine whether any of the patterns in the data occur often enough to be considered

    rules. One important parameter of this operator is Min Support. It represents the

    occurrence rate of the rule (number of times the rule occurred divided by the number of

    observations in the data set).

    Create Associations - this operator uses the frequent pattern matrix data and seeks any

    patterns that occur frequently enough to be considered as rules. The Create Association

    Rules operator generates both a set of rules (through the rul port) and a set of associated

    items (through the ite port). In this model we are looking just for generating rules, so we

    simply connect the rul port to the res port of the process window. One of the influential

    parameters of this operator is Min Confidence. Confident percentage is a measure of

    confidence about the likeliness that both an attribute and its associated attribute to be

    flagged as true. The confident percentage is computed as the ratio between the number of

    times a certain rule occurs and the number of times it could have occurred.

    First, we decided to test the program using a confidence of 0.8 and a Min Support of 0.1.

    Surprisingly, 352 rules had a very low support. In our opinion this rules are not considered reliable

    because of the low frequency of the attribute combination.

    Figure 47: Association rules process 1

    We considered important to increase the min support in order to have rules with a high frequency of

    attribute combination.

    The second testing was considering the same confidence, but with a minimum support of 0.6. 5

    rules were generated that were having a support between 0.359 and 0.435 and a confidence between

    0.821 and 0.951.

  • DATA MINING 42 UNIVERSITE DE NEUCHATEL

    Figure 48: Association rules process 2

    Analyzing these rules we can consider the following:

    A car with a medium risk sumInsured is most probably a prive car;

    A prive car, with a medium risk sumInsured will have a high probability to be a standard

    car;

    Most of the times a standard car will be a prive car;

    If the car is a standard one and the driver is a male then we could conclude that the car is a

    prive one;

    If we have a standard car with a medium risk sumInsured, the car is most probably a prive

    car.

    3.2 Clustering

    Clustering is one of the unsupervised techniques we will deploy on the data in order to partition and

    to reveal some sub classes and discover their natural grouping. K means algorithm is one of the

    operators that rapid miner offers to divide the data. We preferred this particular algorithm as it is

    easy to understand and simple to interpret. Yet, a challenge we faced, was to pick the number of

    clusters in advance with the aim of acquiring a robust set of clusters. The performance of a

    clustering algorithm may be affected by the chosen value of K. Therefore, instead of using a single

    predefined K, a set of values will be adopted in order to find a satisfactory clustering result. The

    validity of the outcome will be assessed using the Evaluation Cluster Distance Performance which

    is provided by Rapid Miner. This operator relies on two main criteria to evaluate the performance:

    avg._within_centroid_distance: The average within cluster distance is calculated by

    averaging the distance between the centroid and all examples of a cluster.

    davies_bouldin: The algorithms that produce clusters with low intra-cluster distances

    (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity)

    will have a low DaviesBouldin index, the clustering algorithm that produces a collection

  • DATA MINING 43 UNIVERSITE DE NEUCHATEL

    of clusters with the smallest DaviesBouldin index is considered the best algorithm based

    on this criterion.

    Since we dont have a target attribute in clustering most of the attributes will be needed for the

    process except:

    Policy NO, Dateoccurence: Irrelevant

    SumInsured, TotalPremium: our outcome of this study is to help price the premium and

    sum insured so it will illogical to include them in the clusters.

    Make, Model, Category: they contain too many values and we would need large number of

    clusters for grouping, thus we omit to have more realistic and useful clusters.

    Hence, we would able to learn the correlation and similarities between the drivers profile and the

    car he is driving along with the claim occurrence. Furthermore, all polynominal and binominal

    attributes will be converted to numerical since K means deals with distances and accepts only

    numerical.

    The process is illustrated in the figure below:

    Figure 49: Clustering process with Rapidminer

    First Iteration

    K = 3, Max Runs = 10

    This first step will be considered as the reference point considering the outcome of the evaluation

    and the consistency of the clusters. The results are illustrated in the figures below:

  • DATA MINING 44 UNIVERSITE DE NEUCHATEL

    Text View

    Figure 50 : Number of clusters

    Centroid Table

    Figure 51 : Centroid table

  • DATA MINING 45 UNIVERSITE DE NEUCHATEL

    Centroid Plot View

    Figure 52: Screen shot of centroid plot view

    Each colored Line represents a cluster, with the peaks on the attributes that have a strong

    relationship to the cluster.

    Analysis of Results

    Attributes are considered a strong match within a cluster if they have a high average. For example,

    we look at cluster 0 in the centroid table and sort the averages from highest to lowest. We can

    conclude that the strongest element is Marital Status = single = 1. The strong elements which have

    a correlation are: HasChildren = False = 0.747, Body = standard car = 0.634, Gender Male = 0.626,

    HasClaim = False = 0.512. The rest of the elements are considered weak or have no correlation like

    Marital Status = 0. If we take a look at the folder view of cluster 0 we can have a better and clear

    example:

  • DATA MINING 46 UNIVERSITE DE NEUCHATEL

    Figure 53: Screenshot of cluster 0 folder view

    Cluster 0 includes the items in the red square with a value of 1. From the insurance business

    perspective, we can conclude that an adult single female with no children, living in an urban area,

    driving a big very old fast car has no claims.

    Analysis of Performance

    Performance Vector

    As a first iteration the performance vector doesnt inform much about the quality of the clusters,

    however it is taken as a reference point for the rest of the iterations to validate the relationship

    between the number of clusters and Davies Bouldin average.

  • DATA MINING 47 UNIVERSITE DE NEUCHATEL

    Second Iteration

    K = 6, Max Runs = 10

    Here we will try to double the number of K and observe the possible outcome and check the

    performance. The results are illustrated in the figures below:

    Text View

    Centroid Table

    Figure 54: Centroid table of second iteration

  • DATA MINING 48 UNIVERSITE DE NEUCHATEL

    Centroid Plot

    Figure 55: Screenshot of centroid plot - second iteration

    Analysis of Results

    Now that we increased the number of clusters we can have less attributes belonging to each one and

    a clearer view for conclusions. We will consider cluster 3 for this example. We sort it again form

    the high to low. As a first observation we can conclude that the very strong attributes belonging to

    the cluster has increased dramatically.

    Now we got body = standard car, HasChildren = true, hasclaim = false = 1. Plus the strong

    attributes have higher averages as well for example: Marital status = married = 0.883, gender =

    male = 0.650.

    Again, we will have look at the folder view with the aim of understanding better the cluster.

  • DATA MINING 49 UNIVERSITE DE NEUCHATEL

    Figure 56 : Screenshot of cluster 3 - folder view

    Cluster 3 includes the items in the red square with a value of 1. From the insurance business

    perspective, we can conclude that an adult married male with children, living in a suburban area,

    driving a standard fast car has no claims.

    Analysis of Performance

    Figure 57: Performance Vector Davies Boulding

  • DATA MINING 50 UNIVERSITE DE NEUCHATEL

    In the first iteration Davies Bouldin was -2.711. After the increase of K to 6 it decreased to -2.657.

    This indicates that k = 6 is suits better the algorithm with k = 3 in this data. However a question

    arises here which is if the increase of K always produces better clusters. Technically speaking it

    might be true but the increase of K / decrease of DB is not a direct relation. Therefore, we need to

    decide when stop increasing K whenever we notice DB decrease is becoming almost flat.

    Davies Boulding Chart and determining the ideal number of clusters

    A number of iterations were performed with different values of K to monitor the variation of

    performance.

    Figure 58 : David Boulding table

    Figure 59 : David Building graph

    We notice that the curve is becoming flat starting K = 24. Thus, we can conclude that an optimal

    number of clusters would be a range between 12 and 24.

    -2,750

    -2,700

    -2,650

    -2,600

    -2,550

    -2,500

    -2,450

    -2,400

    -2,350

    -2,300

    -2,250

    0.00 20.00 40.00 60.00 80.00 100.00 120.00

    Davies Boulding

    Davies Boulding

    K Davies

    Boulding

    3 -2.711

    6 -2.657

    12 -2.461

    24 -2.313

    48 -2.304

    96 -2.297

  • DATA MINING 51 UNIVERSITE DE NEUCHATEL

    Conclusion

    This project was a typical application of deploying data mining techniques into real life

    situations. We learned the different phases starting from data preparation through interpretation and

    drawing conclusions. We noticed as well that not all techniques fit perfectly the data, thus a

    thorough knowledge of the business is essential in order to have an initial set up of what to use in

    the plan.

    It is good to mention here that sometimes result evaluations are subjective; the company in

    need of this information would be the only judge of the accuracy of certain outcomes. For instance,

    we discovered several correlations between the driver, the car and the occurrence of an accident.

    Now, it is up for the firm to take decisions whether to increase or decrease premiums for certain

    group of people and group of cars.

  • DATA MINING 52 UNIVERSITE DE NEUCHATEL

    Glossary

    A

    Accuracy: A measure of a predictive model that reflects the proportionate number of times that

    the model is correct when applied to data.

    B

    Binning: The process of breaking up continuous values into bins. Usually done as a

    preprocessing step for some data mining algorithms. For example breaking up age into bins for

    every ten years.

    C Claims: Claims and loss handling is the materialized utility of insurance; it is the actual

    "product" paid for. Claims may be filed by insured directly with the insurer or through brokers or

    agents. The insurer may require that the claim be filed on its own proprietary forms, or may accept

    claims on a standard industry form.

    Cross Validation (and Test Set Validation): The process of holding aside some training data

    which is not used to build a predictive model and to later use that data to estimate the accuracy of

    the model on unseen data simulating the real world deployment of the model

    Comma separate value (CSV): A common text-based format for data where the divisions

    between attributes (columns of data) are indicated by commas.

    Confidence level: A value, usually 5% or 0,005, used to test for statistical significance in some

    data mining methods. IS statistical significance is found, a data miner can say that there is a 95%

    likelihood that a calculated or predicted value is not a false positive.

    D

    Decision Tree: A class of data mining and statistical methods that form tree like predictive

    models.

    Data analysis: The process of examining data in a repeatable and structured way in order to

    extract mining patters or messages from a set of data

  • DATA MINING 53 UNIVERSITE DE NEUCHATEL

    L Label: In Rapidminer, this is the role that must be set in order to use an attribute as the

    dependent or target, attribute in a predictive model.

    M Missing Data: These are instances in an observation where one or more attributes does not

    have a value. It is not as same as zero, because zero is a value.

    P Prediction: The target, or label or dependent attribute that is generated by a predictive model,

    usually for a scoring data set in a model.

    T Training Data: In a predictive model, this data set already has the label, or dependant variable

    defined, so that it can be used to create a model which can be applied to a scoring data set in order

    to generate prediction for the latter.

  • DATA MINING 54 UNIVERSITE DE NEUCHATEL

    Webography

    [1] http://chirouble.univ-lyon2.fr/~ricco/data-mining/

    [2] https://www.casact.org/pubs/forum/03wforum/03wf001.pdf

    [3] http://www.cnbc.com/id/101586404

    [4]http://online.wsj.com/news/articles/SB1000142405274870464860457562075099

    8072986

    [5] http://arxiv.org/ftp/arxiv/papers/1309/1309.0806.pdf

    [6] http://docs.salford-systems.com/insurance4211.pdf

    [7] http://www.ulb.ac.be/di/map/adalpozz/pdf/Claim_prediction.pdf

    [8] http://chirouble.univ-lyon2.fr/~ricco/data-mining/

    [9] https://www.casact.org/pubs/forum/03wforum/03wf001.pdf