Midterm Project: Analysis of the Bay Area Bike …As the goal of this report is to identify patterns...

University of California, Berkeley

Industrial Engineering & Operations Research

IEOR 290: Data Analytics and IoT: Machine Learning forOperations with Human Data Sources (Spring 2018)

Midterm Project: Analysis of the BayArea Bike Share Data

Authors:

Aishwarya Venketeswaran

Brenton Hsu

Sucheta Banerjee

Wiseley Wu

March 25th, 2018

1 Introduction

With increasing traffic congestion in the bay area, the development of transportation infras-tructure is vital. A strong transportation infrastructure connects employees to companies,travelers to tourist attractions, students to schools and much more. In particular, the FordGoBike program offers a biking transportation infrastructure to connect the people of thebay area to numerous locations in SF, East Bay and San Jose. The Ford GoBike infrastruc-ture involves having locations where people can rent bikes at any station and drop them offat any station. As a result, the consumers never have to pay for an actual bike, deal withmaintenance of a bike, or worry about having a bike being stolen.

Currently, Ford GoBike offers annual memberships, day passes, and single ride plans to usetheir bikes. As a result of annual membership plans offering the most consistent revenue anddemand for Ford GoBike, encouraging subscription of customers will heavily determine thegrowth of their biking infrastructure. Therefore, it is important for Ford GoBike to be ableto predict and analyze their subscribers and non subscribers in order to create strategies todevelop their subscription base users.

To solve the subscription issue with Ford GoBike, statistical models were developed to predictsubscribers and non subscribers (customers) based on weather, temporal and, rider attributevariables. Furthermore, the models allowed statistical inference that indicated the featuresthat have significant impact on a bike trip being completed by a subscriber. As a result ofthese models, Ford GoBike will have a data driven understanding of their subscription baseusers. Therefore, they can use the results to develop their program to increase subscribersand ultimately increase revenue. For example, they can use this info to understand if theyshould focus the marketing strategies to focus on local users or tourists, weekday fares orweekend fares, etc.

To determine the current demand of the Ford GoBike Share program, a forecast was de-veloped to predict the number of bike event trips on a monthly basis. The purpose of theforecast was to give a understanding of the volatility and trend of the usage of the bike shareprogram.

1

2 Forecasting

The two forecasting methods attempted on the data are moving average and single expo-nential smoothing. To optimize the parameters associated with the forecasting models, theminimum absolute error (MAE) was minimized with the following equation:

MAE =1

n

n∑1

|predicted− actual|

For moving average, the objective was to minimize the MAE based on the number of his-torical data points to consider when forecasting a moving average. After tuning the movingaverage forecast, considering 4 historical months (n = 4) to predict the next month providedthe lowest MAE. The equation for moving average forecast (MAF) is:

MAF =1

n

n∑1

trip count

In single exponential smoothing (EWMA), the parameter to minimize MAE is α. α is avalue between 0 and 1. With 1 meaning give all the weight to the most recent month toforecast, while 0 means to give weight to all historical months to forecast. Ultimately, αof 0.93 minimize the exponential smoothing forecast for demand of the bike share program.The equation for EWMA is below:

EWMA = α(current period trip count) + (1− α)(previous forecast)

Comparing the moving average and single exponential smoothing, the moving average fore-cast gave the minimum MAE at 2567, compared to the single exponential smoothing givingan MAE of 3749. Therefore, the moving average forecast gave the best prediction with anaverage error for each month in the forecast being +/- 2567 trips. From inference of then and α parameters of the two models, it can be seen that the trip count on a monthlybasis had high variability. Specifically, moving average only looks at the past 4 months tomake a forecast that minimize MAE. Looking at only 4 months, indicates that the demandfor the bike share event trips is continuously dropping or growing inconsistently. Similarly,α of 0.93 indicates that the bikeshare trip count forecast should heavily rely on looking atonly the most recent month to make a forecast to minimize MAE. Consequently minimizingMAE requires the forecasts to look at the most recent data more heavily, thus the forecaststends to overreact to spikes in trip demand. Figure 1 below illustrates the trip count forecastmethod compared to the actual trip count.

2

Figure 1: Comparing the forecast models to the actual trip count

Overall, the parameters of the forecasts to minimize MAE indicates that the demand for thebike share program is highly volatile. In addition, since subscribers make up 86% of the biketrips, they are the largest demand and it is important for the company to determine thefeatures that defined the subscriber status, so they can develop an infrastructure that cansupply the correct amount of bikes. Furthermore, distinguishing customers and subscribersfrom simply the trip metadata add tremendous values to the company’s marketing platform,as this would be the key to expand their user base. Features that defined a subscriber couldreveal their riding habits, and the company could use that information to provide targetedsign-up bonus to non-subscriber to further increase adaptation. On the other hand, featuresthat defined a customer could reveal distinct information that kept them from subscribingto the service. The company could then use that information to provide targeted promotionto convert these customers to subscribers.

3

3 Data

The dataset being used in this report is bike share data sourcing from Ford GoBike, for-merly known as ”Bay Area Bike Share”. The company publishes system and real-time databased on the ”General Bikeshare Feed Specification”, which allows users to query data viatheir maintained real-time API. The actual data itself was collected between August 2013to August 2015, transformed, and posted to Kaggle.com by user Ben Hamner (henceforthreferred as the data author). This dataset has four separate csv files (detailed data featuresare listed in the appendix):

• station.csv - contains information of the bikeshare stations themselves

• status.csv - contains number of bikes and docks available for a given station and minute

• trip.csv - contains data of a single bike trip by the consumer

• weather.csv - contains daily weather information

The data was collected by the data author with the goal of correlating between bike tripsand weather, and to identify bike trip patterns which vary by time of day and the day ofthe week. Due to the fact that the data contains different recording formats and stored indifferent files, it was essential to pre-process, transform, join, and engineer new features fromthem before it was fit to be used in a variety of models.

As the goal of this report is to identify patterns of a subscriber and a customer from anindividual bike trip, the file trip.csv containing a total of 669959 unique trips ,was usedas the base dataframe table for merging and feature engineering purposes. The data fromstation.csv was first merged into the base dataframe by station id (there were typographicalerrors in station names under the base dataframe, making it unsuitable for join index) inorder to obtain latitude, longitude, city, and dock count information for each start and endstation pair of an individual trip. The start/end station location information was then usedto feed into Google Distance Matrix API in order to obtain the distance (in meters) andtime (in seconds) needed (by Google Map direction for bikes) to travel between two stations.This information was then used to calculate feature duration difference which is the percent-age difference between the duration of the bike rental and duration if the user were to bikedirectly between the start and end station .This could be utilized further to infer the natureof the trip i.e a high percentage would imply that the user was taking its time( a sign of atraveler); while a low percentage would imply the user was in a hurry, (a sign of a commuter).

% duration difference =duration difference− gmap duration

gmap duration

4

Additional time-related features were engineered based on the trip start time. In order todetermine if a holiday affects the subscription type of a rider, a US Federal holiday datasetfrom data.world containing 2011 - 2020 holidays was used to create a new holiday feature -where the value 1 indicated the trip took place on a holiday and the value 0 otherwise. Thetime of day and day of week when a trip took place could also be relevant to whether a useris a subscriber, since a subscriber is more likely to be relying on the bikeshare service as acommute option and tend to utilize it during weekdays (Monday to Friday) and commutehours (6 am - 10 am, 3pm - 7pm). Therefore, two new features were created based on thecriteria above.

Weather data was also compiled to understand how weather affects ridership. The originalintent was to join the weather data with the base dataframe by zip code, but closer inspectionof the zip code data within the dataframe revealed many non-standard zip codes (more orless than 5 digits) and non-numeric data. Additional research from the data source showedthat these zip codes were collected during Ford GoBike’s account registration, hence it waslinked to the trip user instead of the station. So, the station zip code was obtained basedon its city location. Since the cities in scope are all located closely, the weather would havebeen very similar and lacked diversity. Alternatively, only adverse weather conditions suchas thunderstorm, rain, and fog were used - where 1 indicated such weather occured at startstation, and 0 otherwise. The unintended mistake from the zip code data above was usedto add an additional binary feature called local user.Since the zip code was provided duringaccount registration as part of the billing information, one could infer the location of theuser i.e It would be a local user in the bay area if the zip code begins with 94 or 95 , which ismore likely to be a subscriber than a person with a non bay area zip code. Therefore, localuser with the value 1 would indicate a user with zip code beginning with 94 or 95, 0 otherwise.

The most labor intensive part of data preparation was the joining between the base dataframewith information from the status.csv, this was necessary to obtain bikes and docks availabil-ity of start and end station for each individual trip. However, a careful inspection of bothdata files showed that the first recorded trip data predated the first recorded status of allbikeshare stations. Therefore, trips data that took place before and after the recorded statustime range was removed before any attempt of table join, which reduced the total uniquetrips to 669874.

The record time for both the trip and status data were then used as the join index, never-theless 4225 rows were shown to be missing data after table join despite some preparatoryclean-up beforehand. Another close inspection of the status file gave insights of the issue:there were several time gaps where status data were either not being published to the real-time api, or were not being recorded by the data author. Based on this issue, an assumptionwas made with the data which was that the bikes and docks availability in a the stationshould not change drastically within 2 minutes of data record. With this assumption inmind, an algorithm was developed to shift the trip time forward and backward by 2 minutes

5

in order to fill in the missing the data - this method was able to reduce trip with missingdata by 2219 rows, only 2006 rows had to be removed due to status data not being foundedwithin a 2-minute interval.

Two additional feature that were developed with the new combined data frame were startstation utilization and end station utilization. The purpose of these metrics was to removemulticollinearity from variables that were likely to be related. For example, originally thefeatures that were available were docks available at a start station and the number of bikesavailable at a start station. Obviously, these two features are correlated, because as oneincrease, the other will decrease. Therefore, the start station utilization is engineered tocapture the information on bike availability. The calculation for start station and end dateutilization can both be calculated similarly as shown below:

# of docks available

# of bikes available+ # of docks available

With the bulk of the dataframe defined, some preliminary data exploration was done todiscover whether there was anything out of ordinary within the feature distribution. A quickcalculation was shown that among all the trips data, the longest duration was a total of4797 hours, or nearly 200 days. This was clearly an outlier, as 99 percent of consumers spentless than 3.6 hours using the bikeshare service. Furthermore, it was discovered from FordGoBike website that consumer is liable for a lost bike charge if the bike is not docked within24 hours. Therefore, a decision was made to remove any trips where duration was longerthan 24 hours - a new distribution of duration could be seen in figure 2.

Figure 2: Normalized distribution of bike rental duration after removing samples longer than24 hours (in hours)

6

Additionally, there were 23854 trips that started and ended at the same bikeshare station.This was definitely not out of ordinary as consumers are very likely to return a bike to thesame bikeshare station if sightseeing is the goal. However, this introduced complication inthe calculation of duration difference as a trip started and ended at the same station hada duration of 0 on Google Map trip duration. Removing these trips would be the easiestsolution, but care must be taken as these trips could very well be describing a special typeof consumer (e.g. a tourist). One way to ensure these trips were not significantly differentfrom the main population was to observe the distribution of duration between these twopopulation (Figure 3).

Figure 3: Normalized distribution of bike rental duration of trips started and ended at samestation (in hours)

As seen in these figures, the distribution difference between these two populations was veryminor, hence the trips that started and ended at the same station could be safety removedfrom the total population.

Finally, a look at the distribution of gmap distance (distance between two stations based onGoogle Map bike direction) showed that the longest travel distance was 92476 meters (57miles), while 99 % of the population only traveled up to 4203 meters (2.6 miles). Since thebikes from the bikeshare service are not built for long distance cycling, an arbitrary distance(8 km, or roughly 5 miles) was imposed on the data, thereby removing any trips that tookmore than 5 miles. (503 rows removed).

With the data processing completed, the dataframe was then being randomly separated into70 % training and 30 % testing set. This was done in a stratified manner in order to ensurethe fraction of subscriber and customer are consistent across the data set.

7

4 Model

The goal of this study was to predict whether a consumer of the Ford GoBike bike-sharingservice was a subscriber (paying a monthly subscription) or customer (pay-as-you-go) basedsolely on trip and station metadata, such as, rental duration, travel distance, weather, sta-tion utilization, and more. Formally, this was defined as a classification problem, assigning 1(subscriber) or -1 (customer) based on the trip data. There were a couple ways to define thegoal of the algorithm as it was entirely based on the evaluation metrics. On one end, gaugingthe model based on accuracy means the algorithm needs to assign as many correct labelsto each trip, regardless of their classes . On the other hand, evaluating the model based onrecall means that the algorithm should prioritizes the correct classification of positive overnegative samples or vice versa. Therefore, the optimization becomes a loss minimizationproblem for misclassification of positive samples. In this study, logistic regression and sup-port vector machine (SVM) was used to learn the trip data, produce a model, and predictwhether a given trip was conducted by subscriber or consumer. To judge the best model,the accuracy and the ability to do statistical inference were the main indicators.

Some assumptions were made to the dataset prior to the model implementation. Firstly, thetemporal component of the data was ignored during the formulation of model. This is impor-tant as the data were collected sequentially, making each subsequent observation related tothe previous one. However, instead of treating the data as time series, the time informationwas promptly removed after the station utilization and weather advisories features were de-veloped. Each trip was essentially treated as a random collection of event, as this is necessaryfor the next assumption. Secondly, all samples collected were assumed to be independentand identically distributed random variables. This is a key assumption made for both logisticregression and SVM, and it was necessary to make the same assumption to the data beforethe modeling. This was a valid assumption, because with the temporal component ignored,each trip was conducted by an independent agent. Specifically, the agents trip takes placesat different stations and each agent had a different destination in mind. Therefore, it’s nottoo far fetch to see these trips were i.i.d. Thirdly, another assumption was to assume nomulticollinearity between variables. As mentioned in previous section, new features wereengineered and the final features were chosen deliberately to avoid multicollinearity. This isimportant as both of the models would not be robust if multicollinearity were present. Thevariables chosen were grouped in the following categories:

• Weather (rain, fog, thunderstorm)

• Dock Utilization (start dock utilization, end dock utilization)

• User Type (local user)

• Day/Time Type (commute hours, weekday, holiday)

• Trip Duration (Duration difference, gmap distance, duration)

8

These variables are independent as they characterize different pieces of information regardingthe trip and their multicollinearity should be low.

There were a total of 39 features in the dataframe, with many of them derived from othercolumns. Furthermore, some columns would be difficult to use in any prediction modeldue to their format, namely datetime information. Therefore, it was imperative to removesome features in order to reduce dimensionality of the problem and reduce multicollinearitybetween variables. In the end, only features from table 1 remained as they were determinedto be most concise and carried the most information from the rest of the variables that wereremoved.

9

Column Name Data Type Description

subscription type Boolean1 represented a trip conducted bysubscriber, 0 represented customer

duration FloatThe duration of the bike rental in seconds

(from checking out at start station tochecking in at end station)

holiday Boolean1 represented the trip took place on a US

Federal holiday, 0 otherwise

fog Boolean1 represented the trip took place on a day

with fog advisory, 0 otherwise

rain Boolean1 represented the trip took place on a day

with rain advisory, 0 otherwise

thunderstorm Boolean1 represented the trip took place on a daywith thunderstorm advisory, 0 otherwise

gamp distance FloatGoogle Map distance for bike direction

between start and end station (in meters)

weekday Boolean1 represented the trip took place on a

weekday, 0 otherwise

start dock utilization FloatThe bike utilization fraction of the start

station at the beginning of the trip

end dock utilization FloatThe bike utilization fraction of the end

station at the end of the trip

local user Boolean1 represented the trip was conducted by a

user with a bay area zip code in theiraccount, 0 otherwise

duration difference FloatThe % difference between the google map

travel time for bike between start/endstation and the actual rental duration itself

commute hours Boolean1 represented the trip took place during

commute hours (6am - 9am, 3pm - 7pm), 0otherwise

Table 1: Description of features used in the models

Logistic Regression

The logistic regression model was implemented to determined the probability estimated ofwhether a person will be a subscriber or not based on a variety of features. Logistic Regressionuses maximum likelihood estimation to provide to the optimal coefficient of features to

10

predict a subscriber. The advantage of logistic regression is that the probability outputtedallows decision makers to modified the probability threshold of being a subscriber to optimizetheir business. Furthermore, logistic regression coefficients allow inference in determiningwhich features increase the probability of a subscriber. The logistic regression formula tocalculate probability of subscriber is generalized below:

P (Y = Subscriber|X) =1

1 + ea+bX

Support Vector Machine

The SVM model was also applied to this problem to determine whether a trip is associ-ated with a subscriber or a customer. This classification problem can be re-formulated as anoptimization problem in which the goal is to find the hyperplane the the maximizes the sepa-ration between the two classes (subscriber and customer). This is specifically done by findingthe hyperplane that maximizes the distance to the nearest points (support vectors). SVM, asupervised machine learning algorithm, was chosen because it can handle high-dimensionalspaces in a memory efficient manner, and it is versatile in handling different kinds of kernels.Once SVM was applied to the trainings dataset and the model was trained, the model wasused to make predictions about the class of the test data.

In this report, linear and RBF kernel were applied to the problem. The linear kernel createsa linear hyperplane to separate two different classes. This is desirable since it can scale tobig datasets and provides flexibility while choosing the penalties and loss function. On theother hand, to compare performance, Radial Basis Function (RBF) kernel was applied inwhich the hyperplane calculated is non-linear. SVC solves this problem:

minw,b,ζ

1

2wTw + C

n∑i=1

ζi

subject to yi(wTφ(xi) + b) ≥ 1− ζi

ζi ≥ 0, i = 1, . . . , n

Grid search was conducted in order to obtain a combination of gamma and C that givesthe highest accuracy with SVM using RBF kernel. The gamma parameter is used to definehow far the influence of a single training example could reach; while a low value means theinfluence is far and a higher value means the influence is close. The C parameter, on theother hand, trades off misclassification of training data against simplicity of the decisionsurface. A small C makes the decision surface smooth - generalization that prones to highbias; while a large C flexes the decision surface in order to classify all training data correctly- specialization that prones to high variance. A balanced trade-off between C and gamma isa key to create a strong SVM model while not overfitting to the training data and behavepoorly on test data.

11

Since the underlying algorithm of Support Vector Machine is not scale invariant, moreweights might be given to a feature with a large numerical values then one with smallernumerical values if their scales have not been normalized. Additionally, the geometric in-terpretation of the hyperplane with unscaled features might be hard to understand due tounits and scaling difference. Therefore, it’s imperative to scale the dataset such that eachfeature has consistent range. For boolean features, the positive class could maintain with avalue of 1, while the negative class was converted from 0 to -1. For numerical (continuous)features, the data were rescaled and bounded using the following equation:

x =2x−max(x)−min(x)

max(x)−min(x)

Which would bound the data between -1 and 1.

12

5 Code

Moving average and exponential smoothing forecasting model was used in order to capturethe bikesharing service demand (as shown in introduction). In the main analysis, both logis-tic regression and Support Vector Machine were implemented. The pseudocode is providedbelow to explain the model formulation, evaluation, validation, and finally testing for eachmodel types at a high level, with the actual Python code for data processing and modelingattached separately from this report.

Pseudocode for Forecasting

count number o f b ike t r i p s for each monthd e f i n e tuning parameterd e f i n e minimum abso lu t e e r r o r (MAE) for f o r e c a s t model eva lua t i onfor each f o r e c a s t i n g model

Setup f o r e c a s t model to p r e d i c t t r i p countfor each month

Ca l cu la t e f o r e c a s tCa l cu la te MAECalcu la te average MAEChoose f o r e c a s t i n g model with lowest average MAEChoose the model with lowest MAEFit the model with data

Pseudocode for Logistic regression

d e f i n e C ( r e g u l a r i z a t i o n parameter ) as tuning parameterd e f i n e accuracy as s c o r i n g metr icfor each C

Train l o g i s t i c r e g r e s s i o n modelCa l cu la te accuracy

choose model with h i ghe s t accuracyeva luate model with t e s t data

Pseudocode for SVM

d e f i n e tuning parameters for SVMd e f i n e s c o r i n g metr i c s for model eva lua t i onfor each s c o r i n g metr i c s

for each randomized data s p l i tSetup SVM model with s e l e c t e d parameters and s c o r i n g metr i c sF i t s e l e c t e d s p l i t data with the SVM model aboveEvaluate f i t t e d model with v a l i d a t i o n dataCa l cu la te s c o r i n g metr i c s with v a l i d a t i o n data

Choose model with h i ghe s t s co r e on e i t h e r s c o r i n g metr i c sEvaluate f i t t e d model with t e s t data

13

6 Results and analysis

In order to assess the performance of the Logistic Regression and SVM models on the testset, a baseline model of always predicting a person is a subscriber is used for comparison.The accuracy of the baseline model on the test set is 86 %, meaning there are 86 % of sub-scriber and 14 % of customer among the test set. The final model should at the very leastperforms better than the baseline model with higher than 86 % in accuracy.

Logistic Regression

To answer the question if weather, time, location and other features affects the probabilityof a person being a subscriber when using a ford bike, the logistic model is implemented. Ainverse regularization strength of 1000 is used to decrease the bias and increase the abilityof the model to be more general. Also with statistical inference, the features are analyzed todetermine their effect on increasing the probability of a person being a subscriber. From theinferences, Ford Bikes can use this information to assess where Ford Bikes can focus theirattention to increase subscriber rate.

As seen below in table 2, the coefficient of the features effect on the subscriber type is below.Profoundly, the fact that a person is a local user greatly increase the likelihood a consumeris a subscriber, as seen in the figure 4 below.(Odd ratio of features). Specifically, a ride bya local user is 204 times more likely to be a subscriber than a customer. This is important,because it signifies that Ford Bikes should expect subscribers that make up the majority oftheir trips are local and live near the station. Therefore, Ford can focus on developing newbikes, pricing, stations, and more with the local users being their main users/subscriber foreach station.

Also, weekday and commute time greatly influences the likelihood of a trip being a sub-scriber. Particularly, if a bike ride is on a weekday, the trip is 4.5 times more likely to bedone by a subscriber than a customer. Similarly, a trip during the commute hours is 2.2times more likely to be from a subscriber than not. With this information, Ford can assumethat during weekdays and commute hours, there demand will be at a consistent level due tothe subscribers tending to be the more frequent users. Additionally, Ford can adapt theirmaintenance of the bikes by avoiding maintenance to be done during commute times andthe weekdays, because of their being more subscribers during that time.

14

Features Coefficients

local user 5.32040

weekday 1.50822

commute hours 0.81035

rain 0.42993

fog 0.14856

duration -0.00019

gmap distance -0.00022

thunderstorm -0.00474

duration distance -0.02067

start dock utilization -0.04181

end dock utilization -0.35107

holiday -0.62693

Table 2: Coefficient of features for logistic regression

Figure 4: Odd Ratio of features for logistic regression

To assess the logistic model performance on the test set, the baseline of 86 % accuracy, is

15

compared to the the logistic regression model 95 % accuracy. As a result of the logistic re-gression model, Ford is able to predict subscription type with a 9 % increase. Furthermore,the logistic model is able to predict a person is truly a subscriber with a true positive rateof 99 %, while only having a 34 % false positive rate of incorrectly identifying a person as asubscriber when they actually are a customer. The high accuracy is of the logistic regressionmodel is likely due to the local user feature being a very strong predictor of a subscriber.Overall, Ford can be confident that they can predict between subscription types with thelogistic regression model.

Below in figure 5 is a confusion matrix on the test set, that is created to determines the falsepositive and true positive rates of the logistic model on predicting a subscriber.

Figure 5: Confusion matrix for logistic regression on the test set

If Ford GoBike would like to view the trade off between the true positive rates of identifyinga subscriber and the false positive rates of identifying a customer as a subscriber, a ROCcurve is below in figure 6. Notably, Ford can overall have a high true positive rate greaterthan .9, while still having a relatively low false positive rate less than .3. Furthermore, theAUC of .83 indicates that the model is adequate in determining between subscriber andcustomer (figure 6).

16

Figure 6: ROC Curve of logistic regression model

Overall, the logistic model offers a high accuracy of .96 of predicting the subscription typeof bike trips. Therefore, Ford can confidently use the logistic model determine subscriptiontype of their bike trips for future uses. In addition, Ford can use the statistical inferencefrom the model to plan growth, maintenance, and marketing of their bike share programsbased on subscription type likelihood.

Support Vector Machine Model with Linear Kernel

For linear SVM, a quick grid search showed that accuracy did not vary much as the Cchanged. Therefore, the model was setup with default parameters except dual, where it wasopted for solving the primal optimization problem instead as the number of samples weremuch higher than the number of features. The model was then trained with the trainingdata, and then used to predict the classification of the test data. The accuracy of the testdata was 94.67 %. Figures 7 and 8 below show the resulting confusion matrix, area under thecurve (AUC) of the receiver operating characteristic (ROC). The linear SVM model with anaccuracy of 94.67 % on the test set has performed better than the baseline of 86 % accuracy.With this model, Ford can predict subscription type more accurately (9 % increase). Likelogistic regression, Ford can have a high TPR (> .9), while still having a low FPR (< .35).And, the AUC of .82 indicates that the model does a good job of differentiating betweensubscriber and customer. Overall, Ford can be confident about the predictions made by thismodel.

17

Figure 7: Confusion matrix for linear SVM on the test set

Figure 8: ROC and AUC for linear SVM on the test set

18

In order to identify which features were founded to be most important in data separation,the model coefficients were evaluated, which represented the vector coordinates that wereorthogonal to the hyperplane. Therefore, their absolute size in relation to each other couldthen be used to determine the features importance. Figure 8 below showed a bar chart withvalues representing the coefficients of each features, where it could be interpreted that thedot product of a point and the vector (coefficients) defined the class: positive dot productimplies positive class (subscriber), and negative dot product implies negative class (customerclass). From the figure, it could be determined that local user has the most influence onpositive class (subscribers), with weekday and commute hours significantly behind, but stillhad slight influence. On the other hand, duration and duration difference has the mostinfluence on negative class (customers), with holiday and gmap distance significantly behindwith slight influence. An observation could be made that the positive coefficients (weekday,local user, commute hours) resembled a consumer that lives in the bay area and the usesthe bikesharing service during weekday commute hours; while negative coefficients (dura-tion, holiday) resembled a tourist that uses the bikesharing service for a long period of timeduring holidays. SVM was able to identify that a subscriber is most likely a local residentusing the bikesharing service for commute purposes, while a customer is most likely a touristusing the bikesharing service for sightseeing purpose.

Figure 9: Linear SVM feature coefficient vectors

19

RBF Kernel

Two logarithmic grid were setup to search for a pair of gamma and C that gives the highestaccuracy - gamma from 10−5 to 102 and C from 10−3 to 104. A SVM model was setup withthe RBF kernel, and the parameter range chosen above were optimized by a 5-fold cross-validation grid-search.

The highest accuracy from the grid-search was achieve with a gamma of 0.01 and a C of1000. Nevertheless, a closer inspection of the accuracy ranges showed that the values differonly slightly across the grid. It does not make sense to pick a high penalty parameter foran accuracy boost of only 0.002. Furthermore, a big C is highly specific to the training dataand might cause overfitting and poor performance and the test set. All things considered,the final values for gamma and C were chosen to be 0.01 and 1, respectively.

The final SVM was retrained with the parameters listed above, and the test data was appliedto the model in order to predict their individual classifications. From the confusion matrixseen in Figure 10, the accuracy was measured to be 94.6 %, with subscriber recall rate of99.3 % and customer recall rate of 64.3 %.

Figure 10: Confusion matrix for SVM (RBF Kernel) on test set

20

Figure 11: Confusion matrix for SVM (RBF Kernel) on test set

Figure 11 showed ROC and AUC computed with SVM RBF kernel, and the trace itself andthe AUC were very similar to what obtained from SVM linear kernel.

A performance summary of all the models evaluated could be seen in Table 3.

Models AccuracyRecall

(Subscriber)Recall

(Customer)AUC

Logistic Regression 94.8 % 99.2 % 66.0 % 0.83

SVM (linear kernel) 94.7 % 99.2 % 65.0 % 0.82

SVM (RBF kernel) 94.6 % 99.3 % 64.3 % 0.82

Table 3: Summary performance of all models on test data set

From the table above, it’s apparent logistic regression model edged out slightly fromboth SVM models in terms of accuracy. In terms of subscriber recall, SVM with RBF ker-nel was slightly higher, while logistic regression performed the best in terms of customerrecall and ROC AUC. Additionally, logistic regression model took much less time to buildand train compared to SVM model, and its interpretation was also much easier comparedto the SVM models. More importantly, logistic regression had the highest customer recallaccuracy, which would be important for Ford GoBike to identify patterns that representthis group and develop strategy to convert them to subscribers. All things considered, logis-tic regression should be used by the company to identify subscribers and customers patterns.

21

7 Discussion

When comparing the results from the two classification methods i.e Logistic regression andSVM, it is evident that Logistic regression provides higher interpretability and also slightlyhigher accuracy in terms of predicting if a consumer is a customer or subscriber.

Although the Logistic Regression model offer desirable results, there were some shortcomingsof the model. Specifically, most of the independent variables of the model relied on binaryvariables, so there were some lower level of detail interpretability that was lost. For example,since the day of week is transformed into a binary variable based on weekday or weekend,the model lose interpretability in the specific day that has the most subscribers. Anothershortcoming was the model could not capture the google map distance of people who startedand ended at the same station, since it would always be 0. Therefore, distance informationwas loss for those segment of customers.

Even though the logistic regression offered a high accuracy and interpretability, there areseveral improvements that can be done in modeling the bike share program data. First,Random Forest and Boosting models could have been used to increase the prediction ac-curacy. On the other hand, a classification tree could have been attempted to gain moreinterpretability of factors involved in predicting a subscriber. Also, the team could have ben-efited more by being more methodical about the data processing. Specifically, a more directapproach in solving data processing issues was taken as they came along. If brainstorm wasdone ahead of time and a plan was created to transform the data, a large amount of timecould have been saved.

In terms of challenges encountered along the way, one of the biggest challenges was to findthe right dataset and evaluating the completeness of it in terms of whether there are enoughfeatures for the analysis or whether it is necessary to add additional datasets which cancontribute to a more complete analysis. Once that got sorted ,the next challenge was toidentify the right features and process the data which was inconsistent in terms of missingvalues, formatting, to make it more meaningful and interpretable. Also, the dataset chosenwas quite large which added to the time it took to run the model.For visualization, findingthe right visuals to interpret and explain the conclusions was another challenge especiallywhen there were binary variables in the dataset.

Given the opportunity to collect data from scratch, there are a few recommendations for thenew data. For one, collecting information on demographics of the users can provide oppor-tunities to extract information on more specific market segments. Specifically, demographicinformation would allow Ford GoBike to develop their bike program and pricing models withage groups, economic status, and other population characteristics in mind. Also, another

22

feature that lists the description of outlier events would be helpful. Previously, bike tripsthat were longer than 24 hours were considered outliers and thrown out, providing a descrip-tion why these trips took so long would allow better decision making when modeling thedata. Lastly, the subscription type dependent variable can be more specific. For example,adding a previous subscriber as a type could provide Ford GoBike with information on whysome people are not a subscriber anymore.

Throughout the project, the team has learned about each step in the life cycle process ofanalyzing data. From data processing, we learned how to merge datasets from multiple datasources and transform data into a usable format. Furthermore, we have learned the impor-tance of validating our datasets to confirm that they make sense. For instance, originally wethought the zip code data is for the bike station, when in fact, the zip code feature is for theuser account. In terms of analyzing data, we learned that SVM took an immense amount oftime to fine tune and provide similar accuracy to Logistic Regression model. As a result, welearned that simple models can perform better than more complicated models. Overall, theproject provide us with the opportunity to learn about data processing and analysis.

23

8 Appendix


name String Name of the bikeshare station

lat Float Latitude of bikeshare station

long Float Longitude of bikeshare station

dock count Integer Number of docks in bikeshare station

city String City where the bikeshare station is located

installation date String Date of installation of bikeshare station

Table 4: Features in stations.csv


station id Integer Unique ID representing bikeshare station

bikes available IntegerNumber of bikes available to rent at the

specified bikeshare station and time

docks available IntegerNumber of docks available for parking atthe specified bikeshare station and time

time String

Time when the data is collected, withformat YYYY/MM/DD HH:MM:SS

(though second is a dummy data since theAPI only provided real-time data to the

minute)

Table 5: Features in status.csv

24


id Integer Unique ID representing the trip

duration Integer Trip duration in seconds

start date StringTime when the bike left the dock of start

station, with format YYYY/MM/DDHH:MM

start station name String Name of station where the trip begins

start station id Integer ID of station where the trip begins

end date StringTime when the bike arrived at the dock ofend station, with format YYYY/MM/DD

HH:MM

end station name String Name of station where the trip ends

end station id Integer ID of station where the trip ends

bike id IntegerUnique ID representing the bike being used

in the trip

subscription type StringEither subscriber (user subscribed to the

bikeshare system) or customer (user payingper trip)

zip code MixedZip code provided by user during account

registration

Table 6: Features in trip.csv

25

Midterm Project: Analysis of the Bay Area Bike …As the goal of this report is to identify patterns...

Documents

Transcript of Midterm Project: Analysis of the Bay Area Bike …As the goal of this report is to identify patterns...