Computer Engineer Master Project

42
Development of an intelligent system predictor of delinquency profiles

Transcript of Computer Engineer Master Project

Page 1: Computer Engineer Master Project

Development of an intelligent system predictor of delinquency

profiles

Page 2: Computer Engineer Master Project

What we're going to see● Motivation and goals

● Review on Case-Based Reasoning

● A few learning techniques

● Most relevant error estimators

● Software implementation

● Implied technologies

● Testing the software

● Delinquency detection

● Planning of the project

● Future of the project

● Conclusions

Page 3: Computer Engineer Master Project

Motivation● Release a valuable project by taking advantage of the

recent knowledge learned.

Goals● Development of a software in Ruby under CBR

capable of predicting customer profiles involving fraud.

● Testing the software

● Attempt of predicting with real cases provided by Maderas Gomez S.A.

Page 4: Computer Engineer Master Project

What is Case-Based Reasoning?

Case-Based Reasoning (CBR) is a name given to a reasoning method that uses specific past experiences rather than a corpus of general knowledge.

It is a form of problem solving by analogy in which a new problem is solved by recognizing its similarity to a specific known problem, then transferring the solution of the known problem to the new one.

CBR systems consult their memory of previous episodes to help address their current task, which could be:

● planning of a meal,

● classifying the disease of a patient,

● designing a circuit, etc.

Page 5: Computer Engineer Master Project

Case-Based Reasoning Features

Possibly the simplest way of machine learning

Training cases are simply stored Each case is composed by a set of

attributes and one is assigned to classification

Use those previous solved experiences to resolve actual cases

May entail storing newly solved problems into the case-base

Page 6: Computer Engineer Master Project

Case-Based Reasoning Cycle

● At the highest level of generality, a general CBR cycle may be described by the following four processes:

1.RETRIEVE the most similar case or cases

2.REUSE the information and knowledge in that case to solve the problem

3.REVISE the proposed solution

4.RETAIN the parts of this experience likely to be useful for future solving

• A new problem is solved by retrieving one or more previously experienced cases, reusing the case in one way or another, revising the solution based on reusing a previous case, and retaining the new experience by incorporating it into the existing knowledge-base (case-base).

Page 7: Computer Engineer Master Project

CBR Common Applications

Help-desk Diagnosis

Page 8: Computer Engineer Master Project

Learning Techniques

Decision Tree Method for approximation of discrete-valued target

functions together with disjunctions (classification)

One of the most widely used methods for inductive inference

Can be represented as if-then rules

Nearest Neighbor All instances correspond to points in an n-dimensional

Euclidean space

Classification done by comparing feature vectors of the different points

Target function may be discrete or real-valued

Page 9: Computer Engineer Master Project

Decision Tree Example

Each internal node corresponds to a test

Each branch corresponds to a result of the test

Each leaf node assigns a classification

Page 10: Computer Engineer Master Project

1-Nearest Neighbor Example

Page 11: Computer Engineer Master Project

3-Nearest Neighbor Example

Page 12: Computer Engineer Master Project

Error estimators

● There are many ways of estimating error. The following ones are three of them:

– Hold-out

– K-fold cross-validation

– Leave one out

Page 13: Computer Engineer Master Project

Hold-out Method

● The hold-out method splits the data into training data and test data (usually 2/3 for train, 1/3 for test). Then we build a classifier using the train data and test it using the test data.

● Used with a large amount of instances

● Needs plenty information from each class

Page 14: Computer Engineer Master Project

K-Fold Cross-Validation Method

● k-fold cross-validation avoids overlapping test sets:– Step 1: data is split into k subsets of equal size– Step 2: each subset in turn is used for testing

and the remainder for training● The subsets are stratified before the cross-validation● The estimates are averaged to yield an overall

estimate

Page 15: Computer Engineer Master Project

Leave One Out Method

● Leave-One-Out is a particular form of cross-validation:

– Set a number of folds of training instances– e.g., for n training cases, build a classifier n

times● Makes best use of the data● Very computationally expensive

Page 16: Computer Engineer Master Project

Software Development

● Two different algorithms have been implemented:• C4.5, which is an extension of Quinlan's ID3 algorithm

and generates a decision tree capable of classification.

• K-Nearest Neighbor, which classifies instances based on closest training examples in the feature space.

Page 17: Computer Engineer Master Project

C4.5 implementation

● Entropy:

● Information gain:

● Data structures:– Training cases → Vector of classes (filled iteratively) –

Each instances is a class

– Decision tree → Vector of classes (filled recursively) – Each node is a class

Page 18: Computer Engineer Master Project

C4.5 implementation (II)

● Pruning technique:

pre-pruning: Stop building a branch due to not reliable information.

post-pruning: Discard inefficient branches, once the decision tree is been completed.

The next formula estimates the error by taking into account the pruning:

The next formula estimates the error without pruning:

So the condition is as follows:

if E(S) < BackUpError(S) then prune the node

Page 19: Computer Engineer Master Project

C4.5 implementation (III)

● Continuous attributes:• Each one of the continuous values are discretized

into nominal values by taking into account the maximum and minimum of their attributes.

• Moreover three different ranges of discretization are possible and configurable:

Two levels: [High,Low] Three levels: [High, Middle, Low] Four levels: [Very High, High, Low, Very Low]

• Thus the range of distinct tests is wider.

Page 20: Computer Engineer Master Project

Minkowsky Sokal-Michener Overlap

● Norm:– Each continuous attribute of each instance is standardized

as follows:

● Sum of all different distances:

● Distance functions:

K-NN implementation

Page 21: Computer Engineer Master Project

● Number of neighbors (k):– This parameter is configurable

– Despite most common k are: 5, 7, 11 and 21. Nonetheless it depends on the problem domain.

– It must be odd to avoid possible draws between number of classifications

● Data structures:– Training cases → Vector of classes (filled iteratively) –

Each instances is a class

– Distances → Vector of floats (filled iteratively)

K-NN implementation (II)

Page 22: Computer Engineer Master Project

System Schema

0;valor01;valor12;valor23;valor3...N;valorN

Stats File

Page 23: Computer Engineer Master Project

Project File Structure

K-Nearest NeighborSubsystem

Decision TreeSubsystem

Input Files

Projectdescription

File from which thesystem starts up

Page 24: Computer Engineer Master Project

Subsystem KNN Class Diagram

Page 25: Computer Engineer Master Project

Subsystem DT Class Diagram

Page 26: Computer Engineer Master Project

Technologies

● Ruby

– Dynamic

– Reflective

– Imperative

– Of general-purpose

– Object-oriented

– Inspired by Perl and Smalltak

● Redcar

– Full features for Ruby

– Still on development

● Ubuntu 12.04

– Best O.S. to deploy Ruby's virtual machine

– Fast

– Easy-to-use

Page 27: Computer Engineer Master Project

Experiments

● The rating of predictions is done by calculating the accuracy as follows:

● The software is tested with:

– case bases extracted from UCI Machine Learning Repository.

– the error estimator Leave One Out, which is a particular case of K-Fold Cross-Validation. The case bases are partitioned into 10 portions: K = 10

– 1.000 executions.

Page 28: Computer Engineer Master Project

Hepatitis Detection Experiment● Features of the case base:

Source Doctor Bojan Cestnik of Jozef Stefan Institute

Motive Classify if a patient suffers from hepatitis

Number of attributes 19

Type of attributes Categorical, integer and real

Number of instances 155

Missing values? Yes

Number of classes 2

Algorithm C4.5

Levels of discretization 4

Official accuracy ≈ 80%

Page 29: Computer Engineer Master Project

Hepatitis Detection Experiment (II)

● Accuracy of the 1.000 executions:

➔ Average accuracy ≈ 78% ≈ 80 %

➔ Pretty good precision

max max

min min

Page 30: Computer Engineer Master Project

Hepatitis Detection Experiment (III)

● Some important rules pulled out of the decision trees:

# Rule Classification

1 (ALBUMIN = Very High or Low) and(PROTIME = Very Low) and (HISTOLOGY = No)

LIVE

2 (HISTOLOGY = No) and (PROTIME = Very High)

LIVE

3 (HISTOLOGY = Yes) and (PROSTIME = High) and (ALBUMIN = Low)

LIVE

4 (ALBUMIN = High) and (SGOT = Low) and (PROTIME = Very Low) and (HISTOLOGY = No)

DIE

5 (ALBUMIN = Very Low) and (SGOT = Low) and (HISTOLOGY = Yes)

DIE

Page 31: Computer Engineer Master Project

Vehicle Shape Experiment● Features of the case base:

Source Pete Mowforth i Barry Shepherd of Turing Institute

Motive Classify a vehicle silhouette into four different kinds according to several characteristics

Number of attributes 18

Type of attributes Integer

Number of instances 946

Missing values? No

Number of classes 4

Algorithm K-Nearest Neighbor

K 7

Official accuracy None

Page 32: Computer Engineer Master Project

Vehicle Shape Experiment (II)

● Accuracy of the 1.000 executions using Euclidean distance:

➔ Average accuracy ≈ 69-70% → not bad

➔ Results under k=21 give maximums of higher value but its average accuracy remains equal

max

min min

Pretty high

Page 33: Computer Engineer Master Project

Delinquency Detection

● The rating is done similarly as the previous experiments:

● Dataset provided by a catalan SME called Maderas Gomez S.A.

● Error estimator: Hold-out

– 70% of dataset → Training

– 30% of dataset → Test

● Variable amount of executions

Page 34: Computer Engineer Master Project

Delinquency Detection (II)● Features of the case base:

➔ Unfortunately all attributes are continuous

Source Maderas Gomez, S.A.

Motive Label customer profiles in payment delinquents and non-delinquents.

Number of attributes 5

Type of attributes Integer and float

Number of instances 770

Missing values? Yes

Number of classes 2

Algorithm C4.5 and K-Nearest Neighbor

K 5, 11

Levels of discretization 2, 4

Official accuracy None

Page 35: Computer Engineer Master Project

Delinquency Detection (II)

● Accuracy of :

• 50 executions

• C4.5 algorithm

• 2 levels of discretization

➔ Average accuracy ≈ 95-96%

max

min

Pretty high

Page 36: Computer Engineer Master Project

Delinquency Detection (III)

● Accuracy of :

• 100 executions

• C4.5 algorithm

• 4 levels of discretization

➔ Average accuracy ≈ 94-96%

max

min

max

Page 37: Computer Engineer Master Project

Delinquency Detection (IV)

● Accuracy of :

• 50 executions

• 5-Nearest Neighbor

• Euclidean function distance

➔ Average accuracy ≈ 94-95%

max

min

Pretty high

Page 38: Computer Engineer Master Project

Delinquency Detection (V)

● Accuracy of :

• 50 executions

• 11-Nearest Neighbor

• Euclidean function distance

➔ Average accuracy ≈ 94% → a little worse than with k=5

min

max

Page 39: Computer Engineer Master Project

Delinquency Detection (V)

● As for the rules pulled up off the decision tree:

# Rule Classification

1 (DIFERENCIA = Very High) and (FORMA DE PLAZO = Very Low) and (F.P REAL = Very Low)

DELINQUENT

2 (CONSUMIDO = Very Low) and (CONCEDIDO = Very High) and (DIFERENCIA = Very Low) and (FORMA DE PLAZO = Very High) and (FP. REAL = Very Low)

DELINQUENT

Page 40: Computer Engineer Master Project

Planning of the project

Research -100 h

Designing - 80 h

Implementation - 300 h

Experiments - 75 h

Report - 75 h

Page 41: Computer Engineer Master Project

Future Of The Project

● Implementation of a funcionality capable of drawing a Voronoi Diagram for k-Nearest Neighbor algorithm.

● Embed the system core (KNN and Decision Tree subsystems) in to a Web environment.

● Obtain new and better information related to customers of the same business and see if we get more reliable results.

● Apply the software upon other sorts of field.

Page 42: Computer Engineer Master Project

Conclusions

● Despite knowing that the software works good, it may be suspicious of getting accuracies as high as the last ones shown along delinquency prediction slides. I suspect the attributes don't provide the most suitable information.

● If Maderas Gomez S.A. wants to try to predict possible delinquency more accurately then must start to gather as much information related to the clients as possible.

● Ruby is a very powerful programming language which can be extrapolated to many fields that computation touches and in the next years it will be one of the most important.

● As a personal point, Machine Learning has drawn my attention to even devoting my professional career in such field.