EE660 Project_sl_final

21
EE660 Project Walmart Recruiting: Trip Type Classification Shanglin Yang, [email protected] Yi Zheng, [email protected] December 8 th 2015 Instructor: Professor B. Keith Jenkins

Transcript of EE660 Project_sl_final

EE660 Project

Walmart Recruiting: Trip Type

Classification

Shanglin Yang, [email protected]

Yi Zheng, [email protected]

December 8th 2015

Instructor: Professor B. Keith Jenkins

Table of Contents

Abstract.......................................................................................................................................... 2

1. Project Homepage................................................................................................................. 3

2. Problem statement and goals ............................................................................................... 3

3. Literature Review ................................................................................................................. 4

4. Prior and Related Work....................................................................................................... 5

5. Project Formulation and Setup ........................................................................................... 5

6 Methodology .......................................................................................................................... 8

7. Implementation ................................................................................................................... 11

7.1 Feature Space .................................................................................................................... 117.2 Pre-processing and Feature Extraction ........................................................................................ 11

7.3 Training Process ............................................................................................................................. 13

7.3.1 Naïve Bayes classifier: .............................................................................................................. 13

7.3.2 K-nearest neighbors (KNN) classifier:...................................................................................... 13

7.3.3 Scikit-based using SVM, Random Forest and Adaboost .......................................................... 14

7.4 Testing, Validation and Model Selection ...................................................................................... 15

8. Final Results ........................................................................................................................ 19

9. Interpretation ...................................................................................................................... 20

10. Summary and conclusions.................................................................................................. 21

Abstract

The project aims to help Walmart classify customer trip type using the dataset of the items they’ve

purchased. We apply machine learning methods on the purchase history data and the customer trip

type provided by Walmart to solve the problem. The main challenges are transforming the raw

data to features that represent trip type well and learning a predictive model based on these features.

We first look for best representation of sample data as features and solve the trip type classification

problem by five machine learning methods, i.e. naïve Bayes, K-nearest neighbor (KNN), Support

Vector Machine (SVM), Random Forest and adaptive boosting. The random forest performs best

and we use this model as the final classification system to predict the trip types of unseen customer

purchase data. We get a good predictive score and there still remains a big improvement in our

score. We find features are essential in applied machine learning and will next explore more in

process of features selection and extraction based on raw data to improve accuracy of our

classification system.

1. Project Homepage

https://github.com/ee660finalproject/EE660_Group_pro

2. Problem statement and goals

Walmart improves customers' shopping experiences by segmenting their store visits into different

trip types. Whether they're on a last minute run for new puppy supplies or leisurely making their

way through a weekly grocery list, classifying trip types enables Walmart to create the best

shopping experience for every customer. Currently, Walmart's trip types are created from a

combination of existing customer insights and purchase history data. In this problem, we will focus

on the purchase history data and classify customer trips using only a transactional dataset of the

items they've purchased. The goal is to help Walmart refine their segmentation process by

improving the data behind trip type classification. Walmart has categorized the trips contained in

this data into 38 distinct types with 647054 training samples and 653646 test samples.

Data Fields:

TripType - a categorical id representing the type of shopping trip the customer made.

TripType_999 is an "other" category.

VisitNumber - an id corresponding to a single trip by a single customer

Weekday - the weekday of the trip

Upc - the UPC number of the product purchased

ScanCount - the number of the given item that was purchased. A negative value indicates a

product return.

DepartmentDescription - a high-level description of the item's department

FinelineNumber - a more refined category for each of the products, created by Walmart.

This is an interesting and challenging problem. Since we are not provided with more information

than what is given in the data (e.g. what the TripTypes represent or more product information), we

need to mine the useful information behind the purchase history data by ourselves to predict the

trip types. Using both art (customer insights) and science (purchase history data) will help Walmart

make progress on the core mission of better understanding and serving their customers. The

challenge is to recreate this categorization/clustering with a more limited set of features. This could

provide new and more robust ways to categorize trips. It requires significant amounts of

preprocessing. There exists some missing data. Each customer only has one trip type but may

purchase more than one commodity. Depending on the VisitNumber, we ensemble the samples

and find that there are 94247 customers. The data has a high dimensionality of feature space with

totally 102984 dimensions which would be a sparse matrix. We need to perform feature selection

and feature extraction to reduce the dimensions. A good selection of features leads to an excellent

classification but It is hard to select the them among a huge number of features. Since the project

involves massive data, the training procedure is time consuming.

3. Literature Review

Title: <Largeron, Christine, Christophe Moulin, and Mathias Géry. "Entropy based

feature selection for text categorization." Proceedings of the 2011 ACM

Symposium on Applied Computing. ACM, 2011. >

This paper made a review of several feature selection methods including document frequency (DF),

information gain (IG), mutual information(IM), χ2, odd ratio and GSS and proposed a feature

selection criterion, called Entropy based Category Coverage Difference (ECCD). From the paper,

we get some basic ideas of the feature information theory and the implementation of information

gain (IG). We then implement the algorithm in our feature selection.

The basic idea of feature selection is to build a functions which tries to capture the intuition that

the best terms for ci are the ones distributed most differently in the sets of positive and negative

examples of ci. We choose to use IG for its easy implementation as well as high performance.

Given a term tj and a category ck, IG(tj , ck) can be computed from a contingency table. Let A be

the number of documents in the category containing tj ; B, the number of documents in the other

categories containing tj ; C, the number of documents of ck which do not contain tj and D, the

number of documents in the other categories which do not contain tj (with N = A + B + C + D):

Fig.1 The ECCD Matrix

In our problem, we use a 97714*38 matrix as IGM97714∗38. The row stands for each term (Upc

number) and the column stands for the category (the trip type). Each element in the matrix is the

occurrence of the upc number in the trip type, i.e. A𝑗𝑘= IGM𝑗𝑘. B𝑗𝑘 = ∑ IGM𝑗𝑖38𝑖=1 − IGM𝑗𝑘. C𝑗𝑘 =

∑ IGM𝑖𝑘97714𝑖=1 − IGM𝑗𝑘. D𝑗𝑘 = ∑ ∑ IGM𝑘𝑖

97714𝑘=1

38𝑖=1 − A𝑗𝑘 − B𝑗𝑘 − C𝑗𝑘.

Then we use a a 97714*38 matrix as IGM97714∗38. Using the contingency table, Information Gain

can be estimated by:

𝐼𝐹(𝑡𝑗 , 𝐶𝑘) ≈ −𝐴+𝐶

𝑁log (

𝐴+𝐶

𝑁) +

𝐴

𝑁log (

𝐴

𝐴+𝐵) +

𝐶

𝑁log (

𝐶

𝐶+𝐷) (1)

Then, we can use the IG value as a criterion of choosing features. For example, we would use the

feature with the large IG (Greater than Threshold).

4. Prior and Related Work

There is no prior and related work

5. Project Formulation and Setup

After analyzing the problem description and goal. We decided to implement the standard

machine learning and testing process to handle this problem. Using the known sample features and

label to training multi-classification mode, which could assign class to new sample with same

feature. In our case, we decide to implement five algorithms to do the classification:

5.1 Naïve Bayes classifier

Naïve Bayes is a simple kind of generative classifier which is the model of the form:

𝑝(𝑦, 𝑥|𝜃) = 𝑝(𝑦|𝜋) ∏ 𝑝(𝑥𝑗|𝑦, 𝜃)𝐷𝑗=1 (2)

It fits by MAP estimation with a vague Dirichlet prior (add-one-smoothing). Typically, the

results are not too sensitive to the setting of this prior (unlike discriminative models). In this

problem, we use MAP estimation. The model has two fields: theta (c, i) and and classPrior. This

could be our baseline classification.

Table 1 Parameter within Naïve Bayes classifier

model.theta(c, j) The probability the feature j turns on in trip type c

model.classPrior(c) The probability of trip type c

5.2 K-nearest neighbors (KNN) classifier

KNN is a generative classifier where the class conditional density is a non-parametric kernel

density estimator. Based on the samples, the function is only approximated locally and all

computation is deferred until classification. KNN assign weight to the contributions of the

neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

In this problem, the KNN model finds the closest training sample to the test sample and assigns

the label of the training to the test sample.

• Parameter:

— similarity function: 𝐾: 𝑋 × 𝑋 → 𝑅

— number of nearest neighbor to consider: k=38

• Prediction rule:

— test sample 𝑥′

— KNN: training samples with nearest Euclidian distance: knn(𝑥′) = d(𝑥′, 𝑥𝑖).

y(𝑥′) = argmax𝑦∈𝑌

{∑ 1[𝑦𝑖=𝑦]𝑖∈𝑘𝑛𝑛(𝑥′) }

5.3 SVM (Multi-class)

Support vector machines (SVM) is supervised learning models with associated learning

algorithms that analyze data and recognize patterns, used for classification and regression analysis.

Given a set of training examples, each marked for belonging to one categories, an SVM training

algorithm builds a model that assigns new examples into different sides based on the kernel

function. The basic idea behind the SVM shown in Fig.2(a) is to Maximize margin and minimize

training error simultaneously.

The SVM could fit the dataset with high dimension effectively and it also uses a subset of

training points in the decision function (called support vectors), so it is also memory efficient.

There are also several modifications we can apply to fit our problem:

(a) (b)

Fig.2 The SVM principle (a) and non-linear model (b)

As shown in Fig.3(b), we try to fit a non-linear classification. To do that, we decide to use

kernel function to map the non-linear problem. We decide to use Radial Basis Functions as kernel

for it requires less parameter to optimize and could measure the distance in high dimension

appropriately. Mathematical principle described below:

Training : 𝑚𝑎𝑥𝑚𝑖𝑧𝑒 𝐷(�⃗�) = (∑ 𝛼𝑖𝑛𝑖=1 ) −

1

2∑ ∑ 𝛼𝑖

𝑛𝑗=1

𝑛𝑖=1 𝛼𝑗𝑦𝑖𝑦𝑗𝐾(�⃗�𝑖, �⃗�𝑗) (3)

𝑠. 𝑡. ∑ 𝛼𝑖𝑦𝑖 = 0 𝑎𝑛𝑑 0 ≤ 𝛼𝑖 ≤ 𝐶

𝑛

𝑖=1

𝐾(�⃗�𝑖 , �⃗�𝑗) = exp (−|�⃗�𝑖 − �⃗�𝑗|

2

𝜎2⁄ )

Classification: For new example 𝑥, ℎ(�⃗� ) = 𝑠𝑖𝑔𝑛(∑ 𝛼𝑖𝑦𝑖𝐾(�⃗�𝑖, �⃗�𝑗) + 𝑏) 𝑥𝑖∈𝑆𝑉 (4)

Considering the SVM in general is a binary classification. We could use one-vs-all (ovo) or

one-vs-rest (ovr) method to make the model fit to the multiclass problem; we also use cross

validation to choose the best parameters.

The parameters include the Penalty parameter C (int), kernel function (‘linear’,‘rbf’,

‘poly’), Degree of the polynomial kernel function (int), gamma(int), decision_function_shape

( ‘ovo’, ‘ovr’).

5.4 Random forest Classification

Random forest is a meta estimator that fits a number of decision tree classifiers on various sub-

samples of the dataset and use averaging to improve the predictive accuracy and control over-

fitting (Fig.3).

Basically, it would generate good results with sufficient trees, considering it also use subset of

feature to increase the final result.

The parameters that could be modified include number of trees/estimators(int),

max_depth(int).

There are also other parameters we could use, but in general, we would set them automatically.

5.5 Adaboost Classification

An AdaBoost classifier is another meta-estimator that begins by fitting a classifier on the

original dataset and then fits additional copies of the classifier on the same dataset but where the

weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more

on difficult cases. It based on weak estimator using subset of feature as well as samples and

continues to boost the results iteratively.

Adaboost could work for our algorithm, basically because it could on the one hand fits the high

dimension, on the other hand it could penalize the misclassified data continually making the

performance good. It could also prevent overfitting significantly.

The algorithm basically includes training process and weighting as well as boost process. The

weak classifier we used in implementation is the basic decision-tree classifier with small depth.

Fig.3 The framework of Random Forest

Each time after the basic estimator give temporary results, the misclassified samples would be

assigned larger weight to make a better classification in the next time.

The parameters that we can adjust including: number of trees/estimators(int), learning rate(int) ,

we also let the system to choose other parameter automatically.

6 Methodology

The frame work is shown in Fig 4. The whole work includes Preprocessing, Training and

Evaluation.

6.1 Preprocessing

Too much information can reduce the effectiveness of classifier learning. It may actually

detract from the quality and accuracy of the model. Thus, the representation and quality of data is

first and foremost before running a classifier. In this section, we need to determine the actual

features for training our systems. Since we can hardly find meaningful information to the

classification of the trip types from the original purchase history data, we assembled some data

attributes and find the relationships between features to form new training set. Moreover, the new

training data set with many attributes may contain groups of attributes that are correlated. These

attributes may actually be measuring the same underlying feature. The redundant attributes simply

add noise to the data and affect model accuracy. Noise increases the complexity of the model and

the time and system resources needed for model building and scoring. The higher the

dimensionality of the processing space, the higher the computation cost involved in algorithmic

processing. To minimize the effects of noise, correlation, and high dimensionality, some form of

dimension reduction is sometimes a desirable preprocessing step. Feature selection and extraction

are approaches to dimension reduction. The product of data preprocessing and feature extraction

is the final training data.

6.2 Training Process

Once we get the final training set after preprocessing and feature extract, we segment the final

training data into two parts: training set and test set. To avoid data snoop, we set the test

set aside and never look at it in the training process. We apply cross validation method to the

training process, i.e. the training set is divided into five equal-size sets, each time four of them are

used for training the classifier and the rest is used for testing the performance for the specific model

and parameters. In our project, we use total five machine learning methods: naïve Bayes, KNN,

SVM, Random Forest and Adaptive Boosting. By cross validation, we will find the best classifier

for each training method.

Especially, we need to consider the Hypothesis sets of different algorithms for it is related to the

feasibility and performance of machine learning.

Naïve Bayes: The naïve Bayes probability model is an independent feature model. The naïve

Bayes classifier combines this model with decision rule. One common rule is to pick the hypothesis

that is most probable; this is known as the maximum a posterior or MAP decision rule. The

corresponding classifier, a Bayes classifier, is the function that assigns a class label �̂�=𝐶𝑘 for

some k as follows:

(5)

K-nearest Neighbors: Since KNN is an instance-based learning algorithm, in a k-NN model, a

hypothesis is built from the training data directly at the time a query is made to the system. The

prediction is based on the K training instances closest to the case being scored. Therefore, all

training cases have to be stored, which may be problematic when the amount of data is large.

SVM: Considering we are using the RBF kernel and ovo/ovr to solve multi-classification problem.

Recall the formulation (3) and (4). Our target parameter is the vector �⃗� ∈ ℝ𝑛, 𝑛 is the number of

training sample. So basically, out hypotheses set is the subset of ℝ𝑛 dimension space with

constraints :

Fig.4 The Framework of whole System

Random Forest Tree and Adaboost decision tree: For both case, the basic idea is using decision

tree as the weak classifier. For each tree, it should have the unit Hypotheses set ℎ{𝑖}, the depth

within the unit Hypotheses is decided by its required depth and halting situation. In general, for

depth = 𝑑 and node = 𝑛.The ℎ{𝑖} = 𝐷𝑛, D is the feature dimension. Therefore, for the whole

system, the Hypotheses set 𝐻 = ⋃ ℎ{𝑖}𝑁𝑖 , N is the number of decision trees.

6.3 Evaluation

After the training process, we get the optimal model for each classification method. In this

evaluation section, we use the test set to evaluate the performance of each classifier. We select the

classifier with the best performance as the final classification system. Then, we will complete the

competition of Walmart: Predict the trip type depending on the customer purchase and submit the

trip types to Kaggle.com to get a score of our classification system.

(https://www.kaggle.com/c/walmart-recruiting-trip-type-classification/submissions/attach)

7. Implementation

7.1 Feature Space

In the original dataset, each sample represents one commodity and one customer usually purchase

more than one commodity. Since the goal is to predict the trip type of the customer, we need to

know when and what does one customer buy in his/her trip. We at first assemble samples belonging

to each customer in our new dataset so that each new sample represent each customer. Each sample

has features such as purchasing weekday, department description, Upc and Finelinenumber of all

purchased items with corresponding quantity.

7.2 Pre-processing and Feature Extraction

STEP1:

Original data size Missing data size New data size

Training set 647054 4129 642925

Test set 653646 3986 649660

The original data sets have some missing data but the number is relevant small, so we just need

to discard the missing data.

STEP2:

Depending on the VisitNumber, we can determine each customer id and then merge each

customer’s items into one sample to form the new data set. There are 95674 customers in the

training set and 95674 customers in the test set. After discarding the missing data, the actual

number of customer in training set is 94247 and in test set is 94288. For training, it is all right to

use the 94247 samples, but for testing, we will use our classification system to find the predicted

trip types for the 94288 customers and assign trip type 999 (“other”) for the 1386 customers with

missing data.

STEP3:

Weekday is a categorical feature representing the purchasing weekday for each customer. We

use 7-digit binary number, set the corresponding bit “1” and others “0” to denote the corresponding

weekday. The format is shown below:

7-digit 'Friday' 'Monday' 'Saturday' 'Sunday' 'Thursday' 'Tuesday' 'Wednesday'

DepartmentDescription is a categorical feature describing the item’s department. It shows the

properties and functions of the items bought by the customers, we think it carries a larger amount

of information classifying the trip type. By calculation, there are totally 68 descriptions for all

commodities and we will use all of them by a 68-digit number. For each customer, set the

corresponding bit 1 multiplied by the ScanCount number depending on the descriptions of his/her

purchase items.

FinelineNumber is a more refined category of each product created by Walmart and this

feature gives more information for classification. By calculation, there are 5195 FinelineNumbers

in the training set and we will use 5195-digit binary number to represent this feature. For each

customer, set the corresponding bit 1 if the customer has purchased the item. If the customer in the

test set doesn’t buy any of the 5195 FinelineNumber items, set all the 5195 digit 0.

Upc is the upc number of each product. There are 97714 upc in the training data. It is not

feasible to use the 97714-bit for the upc feature. Our idea is to chose 5000 most representative upc

numbers of 97714 as the most useful in separating the training documents into the given classes.

To determine the 5000 upc number, we apply Information Gain method. A simple formulation of

entropy-based feature selection is presented:

The IG matrix provides very good Information-Theoretic feature selection. We pick the first 5000

upc numbers with the highest Information Gain. For each customer, set the corresponding bit “1”

if the customer has purchased the item with one of the 5000 upc numbers. If the customer in doesn’t

buy any item with one of the 5000 Upc numbers, set all the 5000 digit “0”.

STEP4:

We assemble all new features obtained form in step3 for each customer in this order:

Weekday DepartmentDescription FinelineNumber Upc

After the preprocessing and feature extraction, the new training set has size 94247*10270 and

the test set has 94288*10270. The row stands for the sample and the column stands for the feature.

STEP5:

Feature reduction: The new data set obtained in preprocessing and feature selection has

10270 dimensions, however the feature space still has a huge amount of features. There are many

irrelevant features simply adding the noise to the data and affecting the classifier accuracy. Some

features are highly correlated and reduce the effectiveness of the classifier. So we need to apply

feature extraction to reduce the features.

1. 7-digit Weekday feature: we keep the 7-dit unchanged. (→ 7-digit)

2. 68-digit department description feature:

we divide this feature as two parts:

(1) Apply LDA method to the 68-dimension feature space and transform it to lower 37-

dimension space. (→ 37-digit)

(2) By mining the data, we find that among the 68 department descriptions, there 20 pairs of

the descriptions with high correlations to the trip type. So we use 20-bit binary to denote

whether the pair occurs in the customer’s trip. (→ 20-digit)

3. 5195-digit fine line number feature: we use randomized PCA to reduce the feature dimension

from 5195 to 68. (→ 68-digit)

4. 5000-digit upc feature: we use randomized PCA to reduce the feature dimension from 5000 to

136. (→ 136-digit)

By feature extraction, we can reduce the feature dimension from 10270 to 268. Feature

reduction can reduce time and storage space required. The removal of collinearity improves the

performance of the machine learning model. Also, it can reduce the over fitting effect.

7.3 Training Process

7.3.1 Naïve Bayes classifier: We use naiveBayesFit from pmtk3 to generate a naïve Bayes

model using MAP estimation and predict with model using naiveBayesPredit. Since the features

are binary, 𝑝(𝑥𝑗|𝑦 = 𝑐, 𝜃) = 𝐵𝑒𝑟(𝑥𝑗|𝜃𝑗𝑐). It fits by MAP estimation with a vague Dirichlet

prior (add-one-smoothing). Depending on the input training data, we compute the

frequency of each trip type in the labels and use them as the prior for each class.

By counting the total number of 1’s and 0’s for each bit, we can get the likelihood for

each bit turns on in each class. The likelihood and prior are used as model parameters.

7.3.2 K-nearest neighbors (KNN) classifier: We fit the model using knnFit from pmtk3. The

KNN model is generated form the input training data since it records each training sample and the

corresponding class label. Once we fit the KNN model, we use knnPredict to find the predicted

trip types. For each test sample, the classifier finds the most amount of nearest training samples in

the model and assign the corresponding class label to the test sample. The input augments include

the training samples (X), the training labels (y) and the number of clusters (K).

% model = knnFit (Xtrain, ytrain, k)

% label = knnPredict (model, Xtest)

assign its class label to the test sample. The input augments include the training samples (X), the

training labels (y) and the number of clusters (K).

7.3.3 Scikit-based using SVM, Random Forest and Adaboost

We implement multiclass SVM, Random Forest and Adaboost using scikit library. The flowchart

of the whole problem is shown below.

Fig.5 The Training Process of the Scikit Based Training

The key concept involved in the whole process includes dimension reduction, parameter search

and model building.

The scikit provides the function related to the training. We use the function below to do the training

and testing. We choose some of the basic parameter to see the performance.Detail shown in next

section.

Table 2 Function within Scikit Based Training

SVC Randomforest Adaboost

Training

(Model)

sklearn.svm

.SVC

sklearn.ensemble.RandomForestCl

assifier

sklearn.ensemble.AdaBoostCla

ssifier

Predicting Model.predict

7.4 Testing, Validation and Model Selection

Considering our work involves numerous mode and parameters. So we need to do cross validation

to choose our parameters.

7.4.1 Flowchart of cross validation

We choose randomly 6000 samples from the original 94247 samples. Let the 60000 samples

belongs to 𝑋_𝑇𝑟𝑎𝑖𝑛. The rest of samples would be taken as 𝑋_𝑇𝑒𝑠𝑡 , which would be set aside

until the final test.

Then we separate the 𝑋_𝑇𝑟𝑎𝑖𝑛 to 5 subsets, each of which 12000 samples. Each time, the model

would use 5 subsets as the training sets and the left one would be the test set. Each time we record

the results and change the parameter, then repeat the process again.

7.4.2 Model Target and Parameter candidate sets.

We have 5 basic models/algorithms, but considering the Naïve Bayes and KNN are base line model.

So we care more about the performance along with change of parameters in other three models.

We choose to use the grid search algorithm to do the parameter search and optimization using tool

sklearn.grid_search_GridSearchCV.

The reason we use this method is because we have large sample as well as large class to train. So

we need to shrink the candidate set so that the computation would be faster.

For each model we have set different search parameter and their candidate sets:

SVC C Kernal Gammar decision_function

Candidate Sets [1,10,100,100] [‘rbf’] [0.01,0.001,0.0001]; [‘ovo’,’ovr’]

The cross_validation results are shown below:

Fig.6 The SVC Cross_validation results

We can see clearly that the decision_function do not make much difference. We can choose the

best parameter based on the results: {‘C’:100, ‘Gamma’:0.01}

Randomforest N_estimator Max_dapth

Candidate Sets [50,100,200,400,600,800,1000] [2,4,8,12,16,20,24,2 ,30]

The cross_validation results are shown below:

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

-200 0 200 400 600 800 1000 1200

Ero

r R

ate

C

Error_rate vs C

ovo ovr

0

0.1

0.2

0.3

0.4

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Erro

Rat

e

- log(gamma)

Error rate vs Gamma

ovo ovr

Fig.7 The Random Forest Cross validation results

we can choose the best parameter for the random forest: {‘N_estimator’:800, ‘Max_depth’:24}

Adaboost n_estimator Learning Rate

Candidate Sets [50,100,200,400,600,800] [0.6,0.8,1,1.2]

0.25

0.255

0.26

0.265

0.27

0.275

0.28

0.285

0 200 400 600 800 1000 1200

erro

r ra

te

N_estimator

Error rate vs N_estimator

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5 10 15 20 25 30 35

Erro

r R

ate

Max_depth

Error rate vs Max_depth

Fig.8 The Adaboost Forest Cross validation results

we can choose the best parameter for the random forest: {‘N_estimator’:800, ‘Learning Rate:1}

7.4.3 Final Test

We test our results through two ways. First we use the 𝑋_𝑇𝑒𝑠𝑡, which have not been ‘looked’

through the whole training process, to do evaluation on the different model with the best parameter

and get the prediction as well as the Error Rate.

Then, we decide our final model based on these error rates as well as the computation time to

choose the best model. Using the test.csv downloaded from Kaggle and the whole train.scv data

to do the training to get the label for the test file. Upload it and then get the result.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 100 200 300 400 500 600 700 800 900

Erro

r R

ate

N_estimator

Error rate vs N_estimator

0.45

0.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

Erro

r R

ate

Learning Rate

Error Rate vs Learning Rate

8. Final Results

Table 3 Final Results

Algorithm 𝐸𝑖𝑛 𝐸𝑜𝑢𝑡 Best Parameter

Computation

Time (20000

samples) (/s)

Naïve Bayes 0.345 0.415 - 600

K-nearest

neighbors (KNN)0.0799 0.441 - 900

SVM 0.005 0.281 {‘C’:100, ‘Gamma’:0.01} 800

Random Forest 0.007 0.228{‘N_estimator’:800,

‘Max_depth’:24}600

Adaboost 0.080 0.255{‘N_estimator’:800,

‘Learning Rate:1}1800

Depending on the results above, the Random Forest algorithm performs the best with lowest 𝐸𝑜𝑢𝑡.

We use the Random Forest model as our final classifier for the classification system. Then we

apply our system to the test samples provided by Walmart and predict the trip types, the result is

shown below:

Online competition submission performance:

Team name in Kaggle.com: Guoshiwushuang

Public score: 9.33204

According to the leaderboard, the highest public score is 0.50519 and the lowest score is 34.53878.

The score is calculated on approximately 30% of the test data. We have submitted 6 times and

improve our score from 20 to 9 which reveals that our method does work.

Meanwhile, from the competition forum topics, we find some teams with better score also apply

the random forest method. We think the reason our system can’t get the same score as them is that

the features we select and extract are still not efficient and accurate enough for the classification.

We need to find a better relationship between the features of the products and the trip types and

transfer the original features efficiently.

9. Interpretation

Our final results are not very satisfying compared to other performance on the board.

Meanwhile, from the table, we can see clearly, there is an overfitting phenomenon for all algorithm,

especially on the random forest although it has achieved the best performance. However, we still

learn a lot from the process and result.

9.1 Feature engineering and machine learning

The key problem within our work is not the implementation of the learning algorithm but the usage

of features. There are two part of the challenge. At first, considering the features are highly sparse

distributed which make the individual feature cannot generate little valuable information towards

the classification. On the other hand, both the fine number and UPC contain large dimension

information (almost 100000) which are also hard to handle.

We have implemented several methods to do the feature selection and extraction. However, the

results are not qualified using different algorithm. The direct reason for that should be that our

features are still not sufficient powerful.

In our case, for example, you cannot decide what trip-type it should be based on one or two object

they bought. Because, the objects people bought embrace high variety. From that, we can see that

not only the original feature to enhance the performance, but also some features may even be

confusing being harmful the classification.

Our methods combine the manual search and selection based on IG value which have improved

the performance compared to using only original feature. But the feature engineering is still far

from solvable. The process is still untraceable. The reason could be that we are still unable to

describe the relation between different features. Our new features may still not be distinguishable

for classification. Another reason is that there is always a dilemma between the precision and recall.

When we bring some new features which could help us to find more sample in one class, it has

larger chance to misclassified feature of other class simultaneously;

9.2 Limitation of solving large data

Besides the features selection, the limitation of solving computation of large data also damage our

final performance. To speed up the computation, we have to use small set of samples which may

let us unable to explore more information from the dataset. While spending lots time on data also

make it hard to tract individual sample or features; especially facing the parameter –search process,

the long computation time make us unable to get sufficient prediction of the performance.

To solve the problem, we need to figure out more advanced algorithm. Considering our case is

sparse distribution, randomized PCA is helpful. We in fact need more similar algorithm to solve

our problem.

10. Summary and conclusions

In our project, by comparing our score with others’ in the leaderboard, we find that even applying

the same machine learning method, different features result in different performances.

Transforming the raw data into features which are better representatives of the underlying problem

influences the accuracy of the model on unseen data. Feature plays an important role to success in

applied machine learning. Algorithms in machine learning are very important and we usually

invest our main effort in these. However, good features are key to make the algorithms work well

and guarantee a predictive model. Better features mean flexibility, simple models and better

performance. Next we will try our best to find better features from the purchase history data before

training our model.