Bike Sharing Final Report (By Arigami, Chen, and Cotton)

22
California State University, Fullerton Submitted to Dr. Bhaskar ISDS 415 Spring 2015 Bike Sharing Systems Submitted By: Wakasa Arigami Szu Tung Chen Keoka Cotton

Transcript of Bike Sharing Final Report (By Arigami, Chen, and Cotton)

Page 1: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

California State University, Fullerton Submitted to Dr. Bhaskar

ISDS 415 Spring 2015

Bike Sharing Systems

Submitted By:

Wakasa Arigami Szu Tung Chen Keoka Cotton

Page 2: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

1 Bike Sharing Systems

Spring 2015

Executive Summary Bike sharing programs were first created in 1964, but due to a lack of proper monitoring the program was not successful. More than 30 years later, the program was reintroduced and has since gained popularity for reasons such as health, environment awareness, and affordability. We obtained a dataset from UCI containing bike sharing data for Maryland for the years 2011 and 2012 to try to predict whether a bike sharing program could be successfully implemented in one of the following California cities: Huntington Beach, Bakersfield, and Berkeley. After reviewing the data set, we eliminated variables which we found to be inconsequential. The largest difference in climate information for Maryland and California was the temperature. After this discovery, we removed all unnecessary data in the original dataset using three separate steps. Firstly, we set a threshold for data elimination which we decided should be 40F. Secondly, we collected non-normalized climate data for Maryland and removed variables which were below the set threshold and normalized the remaining climate data. Following the removal process, we found that the lowest temperature in our new data set was 0.5135. Thirdly, we removed any days which had a temperature below 0.50 which allowed us to remove 47% of our observations. Next, we used charts to get a visual representation of the data in order to detect any possible outliers which still needed to be removed. Afterwards, we tried various combinations of variables to run a regression analysis in order to create the optimal regression model. We ended up coming up with two optimal models: one which forecasts the counts of the bike sharing system during its introduction phase and one which forecasts counts after the introduction phase. Finally we adopted our models, and discovered that the best Californian city for implementing a bike sharing program would be Huntington Beach.

Page 3: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

2 Bike Sharing Systems

Spring 2015

Contents Executive Summary ............................................................................................................. 1

I. Introduction ..................................................................................................... 3

II. Background Information .................................................................................. 3

A. History of Bike Sharing Programs .................................................................... 3

B. Benefits of Bike Sharing Systems ..................................................................... 4

C. City Backgrounds ............................................................................................. 4

1. Huntington Beach ...................................................................................... 4

2. Bakersfield ................................................................................................. 5

3. Berkeley ..................................................................................................... 5

III. UCI Bike Sharing Data Exploration ................................................................... 6

A. Data Set Variable Definitions ........................................................................... 6

B. Variable Reduction ........................................................................................... 7

C. Data Visualization ............................................................................................ 8

D. Data Reduction .............................................................................................. 10

E. Model Selection ............................................................................................. 11

IV. Model Adoption ............................................................................................. 15

A. Huntington Beach .......................................................................................... 15

B. Bakersfield ..................................................................................................... 17

C. Berkeley ......................................................................................................... 18

D. City Comparison for Introduction and Growth Phase ................................... 19

V. Limitation of Analysis ..................................................................................... 19

VI. Conclusion ...................................................................................................... 20

Bibliography ....................................................................................................................... 21

Page 4: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

3 Bike Sharing Systems

Spring 2015

I. INTRODUCTION

The bike sharing data set that we will be using for this report was

acquired from UCI’s Machine Learning Repository. The data set

includes bike sharing data for the years 2011 and 2012. It also contains

both hourly and daily counts for bike rentals, as well as data on both

weather and climate. The data which was included in this data set was

originally collected from bike sharing systems located in the state of

Maryland.

II. BACKGROUND INFORMATION

A. HISTORY OF BIKE SHARING PROGRAMS

Bike sharing programs were first created in 1964 in the western Netherlands, but due to

a lack of proper monitoring and user accountability, the program failed within a matter

of days (History of Bike Sharing, n.d.). In 1996 the bike sharing program was

reintroduced in England, but this time users were held accountable for these rentals

through the use of technology. Users were required to use a special card which tracked

who bikes were checked out to and when bikes were checked out and returned.

Bike sharing systems are similar to traditional bike sharing programs, however the entire

process is technology based. “Bike sharing systems are [a] new generation of traditional

bike rentals where [the] whole process from membership, rental, and return back has

become automatic” (Fanaee-T & Gama, 2013). Bike sharing programs are continuing to

grow in popularity. According to The Bike-Sharing World Map, as of April 2015, there are

861 cities worldwide participating in a bike sharing program and 252 more cities are

currently in the process of planning or building a bike sharing program for their city

(Meddin & DeMaio, n.d.).

Page 5: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

4 Bike Sharing Systems

Spring 2015

B. BENEFITS OF BIKE SHARING SYSTEMS

Bike sharing systems are an ideal mode of transportation in numerous metropolitan

areas and work best for shorter point-to-point distances. Not only are bike sharing

programs an affordable form of transportation, they can be used by people who are

environmentally or health conscious. These benefits make bike sharing systems an ideal

method of transportation for both city residents who need to commute to work, as well

as students who commute to school. Bike sharing systems are also great for tourists or

people who want to take spur-of-the-moment trips because the systems are set up

across different points within a city which allows users to visit different city landmarks,

go shopping, have lunch, and participate in other leisurely activities.

C. CITY BACKGROUNDS

We decided to use the aforementioned data set because it contains a lot of information

which will help us determine which city in California has the most ideal conditions for

implementing a bike sharing program. For the purposes of this report, we chose to

explore cities in California which have different climate patterns and which are located

in different parts of the state. The first city we chose is Huntington Beach which is

located in Southern California. The second city we chose is Bakersfield which is located

in central California and the third city we chose is Berkeley which is located in Northern

California. The next section of this report gives details on such factors as demographics,

climate, and population for each of our selected cities.

1. HUNTINGTON BEACH

Huntington Beach is a city located on the coast of southern California. As of 2014, it has

a population of over 199,000 residents. According to the Huntington Beach government

website, each year the city receives over 16 million beach visitors (Demographic Data,

n.d.). When it comes to entertainment, typical households in this city spend more on

“recreational equipment and supplies” than on any other form of entertainment

(Demographic Data, n.d.). The climate in Huntington Beach tends to be “sunny, dry and

cool” (Climate, n.d.). The weather in this city tends to be nice year round with

temperatures varying between 65 degrees F and 80 degrees F (Climate, n.d.). During the

Page 6: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

5 Bike Sharing Systems

Spring 2015

summer, temperature highs tend to fall below 85°F and temperature lows in the winter

tend to be 40°F or higher (Climate, n.d.). Due to these demographics, Huntington Beach

is an ideal city to investigate to determine whether or not a bike sharing program could

be successfully implemented in this city.

2. BAKERSFIELD

Bakersfield is a city located in the middle between southern California and Northern

California. From 1970 to 2010, the population in Bakersfield has grown 400%. In the

2013 census, the population in Bakersfield was 363,630 thus making it the 9th largest

city in California (Bakersfield, California, n.d.). The climate in Bakersfield is comfortable

and barely raining. The average high temperature during summer time in Bakersfield is

around 90°F, and the average low temperature during winter time in Bakersfield is

around 40°F (Historical Weather For 2013 in Bakersfield, California, USA, 2013). Due to

the weather factors, Bakersfield might be an ideal city to implement the bike-sharing

system and further to increase the welfare for local residents.

3. BERKELEY

Berkeley is located in Northern California. The population of the city is 117,000, which is

a relatively small city compared to other cities in California. However, Berkeley is

famous for having many famous academic institutions such as University of California,

Berkeley. In addition, the city has high rate of civic involvement in the local politics (City

of Berkeley, 2015). According to U.S. climate data (2015), the city gets as cold as 42F

during winter, and reaches as high as 75F during summer. The yearly average

temperature is around 60F. The climate tends to be very cool and dry, but the city gets a

lot of rain during the winter. Although the population of the city is not large, the small

size of the city may allow its residences to utilize a bike sharing program since each

buildings and destination points are close compare to other big cities.

Page 7: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

6 Bike Sharing Systems

Spring 2015

III. UCI BIKE SHARING DATA EXPLORATION

A. DATA SET VARIABLE DEFINITIONS

Table 1: Bike Sharing - Day Original Data Set (Fanaee-T & Gama, 2013)

After acquiring a bike sharing data set from the UCI Machine Learning Repository, we

examined the variables (Table 1). The definitions for each of the variables were obtained

from our data set and are included in Table 2 below.

Table 2: Variable Definitions (Fanaee-T & Gama, 2013)

Page 8: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

7 Bike Sharing Systems

Spring 2015

B. VARIABLE REDUCTION After analyzing the variables in the UCI data set, the first step that we had to complete

was reducing the number of variables that we would use in our analysis. We decided not

to use several different variables for various reasons. The first and second variables that

we excluded were the “year” and “month” variables because the data set already has a

variable for the date. The third and fourth variables that we decided not to use are

“workingday” and “weekday”. Instead, we decided to use dummy variables to represent

the days in the week. These can be seen in the new variable titled “weekday_dummy”.

We decided to use the number 0 to represent Saturday and Sunday and we used the

number 1 to represent the days Monday through Friday. The fifth variable that we

decided to exclude is “weathersit” and instead decided to use a dummy variable titled

“weathersit_dummy”. In this new dummy variable, we used the value ‘0’ to represent

days which are clear or partly cloudy and the value ‘1’ to represent days which are not

clear (i.e. days which have mist, rain, snow, etc.). The sixth excluded variable is “atemp”.

We decided to eliminate this variable and only use the “temp” variable because it is the

actual temperature (normalized) and gives us a more accurate picture of the various

temperatures and rental patterns. The final two variables that we eliminated were the

“casual” and “registered” variables. We decided that it is not important to our analysis

to know whether or not a person is a casual or registered user for the bike sharing

program. In our total reduction process, we eliminated eight different variables from the

original data set and implemented two dummy variables. Our adjusted UCI dataset is

shown below in Table 3.

Page 9: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

8 Bike Sharing Systems

Spring 2015

Table 3: Dummy Variables

C. DATA VISUALIZATION

Data visualization allows us to understand the data as well as detect possible outliers

which need to be removed. Exhibit 1 below shows how many bikes were rented

between 2011 and 2012. We found that the chart has a positive trend with strong

seasonality. The bike sharing is popular in warmer months, while the number decreases

during the winter. In addition, the overall number was increasing.

Historical Bike Count (Exhibit 2) explains a difference between the two years. The green

line is the number of bikes rented during 2011, while the blue line is for 2012. Through

this exhibit, we found that the total number of bikes rented has been increasing. This

trend may continue in the future.

Page 10: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

9 Bike Sharing Systems

Spring 2015

Exhibit 1: Bike Count

Exhibit 2: Historical Bike Count for 1991 and 1992

Exhibit 3 (shown below) is a histogram showing the frequency for the number of bikes

rented. The numbers are somewhat normally distributed, with a peak of 4001-5000. On

the other hand, some bins, such as count 1-1000 and 8001 to 9000, show significantly

lower frequency. For further analysis, we may need to eliminate those days.

Exhibit 3: Number of Bikes Rented

Page 11: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

10 Bike Sharing Systems

Spring 2015

To make sure the data set does not have any significant outlier(s), we used a heat map

(Table 4).

Table 4: Heat Map for Temperature, Humidity, Wind speed, and Count

In conclusion, the dataset contains seasonality and trends. There are no outliers or

significant values. However, we may need to consider days that have lower counts for

bikes rented in the model selection process.

D. DATA REDUCTION The significant difference in climate information for Maryland and California is their

temperature. During the winter, Maryland gets much colder than California, while an

average temperature during summer does not show a big difference. In fact, Maryland

hits 89F as an average high temperature in July, and an average low temperature is 29F

in January. On the other hand, an average high temperature in California during July is

92F, which is close to Maryland, but the average low temperature in January is 39F (U.S.

climate data, 2015). Therefore, the UCI bike sharing data set, which is based on

Maryland, needs to be adjusted to the climate in California. It took 3 steps to remove

unnecessary data from the original dataset.

First, we determined the threshold value for data elimination. Threshold value is a

measurement to detect days in the Maryland dataset which need to be removed. We

Page 12: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

11 Bike Sharing Systems

Spring 2015

determined the threshold value is 40F because the average low temperature in

California is 39F while Maryland is 29F (U.S. climate data, 2015).

Next, since the original dataset is in a normalized format, we used other climate

information for Maryland which represents the same period as the original dataset,

which is from January 2011 to December 2012. After we found the temperature data for

Maryland which was not normalized, we detected days which marked lower

temperatures than the threshold value of 40F. Then, the entire new dataset was

normalized, and days that are lower than the threshold were removed. After the

elimination, we found that the lowest normalized temperature in the new data set is

0.5135.

Finally, we removed days where the temperature marked lower than 0.50 in the UCI

bike sharing dataset. Although the UCI dataset used Celsius while our climate data uses

Fahrenheit, the normalization processes eliminated this difference. After the

elimination, 345 rows (days) were remained out of 731 observations in the original

dataset. We normalized the temperature for the remaining 345 rows, so the UCI dataset

is fully adjusted to the climate in California.

E. Model Selection

We have been trying different combinations of variables to run the regression analysis

in order to develop the best regression model. First, we discarded the variables of

holiday, weekday, working day, and atemp from the original bike sharing data set, and

used the total count instead of casual count and registered count. After several

attempts and meticulous consideration, we only used the variables of season,

weathersit, temperature, and wind speed, and came up with two regression models for

forecasting counts of bike sharing system.

The first regression model is for forecasting the counts for bike sharing systems during

the introduction phase of the system, which is the first three months. In order to

Page 13: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

12 Bike Sharing Systems

Spring 2015

improve the value of R Square and MAPE, we decided to delete some outliers that have

a count under 1000. After we deleted those outliers in the first three month, the

number of observation became 79 instead of 90. The result is shown in Exhibit 4.

Exhibit 4: Regression Model for Introduction Phase of Bike Sharing System

As you can see in Exhibit 4, the R Square of this model is nearly 60%, which is not bad,

and the p-value of the independent variables are acceptable except for weatherist. We

did try to run the regression model without weatherist, and we found that the result has

around 60% for R Square, but the p-value of season and wind speed became larger.

Therefore, we decided to keep the season variable.

In conclusion, the equation of the regression model will be:

Forecast Count = 649.14 + 181.62 * (season) - 241.95 * (weathersit) + 4099.82 *

(temperature) - 852.53 * (wind speed)

In order to better verify our model, we decided to use this equation on the historical

data and compare the historical fitted value with actual value. Exhibit 5 illustrates the

comparison between historical fitted value and actual value.

Page 14: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

13 Bike Sharing Systems

Spring 2015

Exhibit 5: Historical Fitted Value and Actual Value in Introduction Phase of Bike Sharing

System

From Exhibit 5, we can see that this model is suitable for forecasting the count for bike

sharing systems in the introduction phase. This model somehow follows the actual

count data, and it has the MAPE of 14.59% and the MAD of 235.75. Both the error

measures are acceptable, so we decided to use this model as the final model to forecast

the count for bike sharing systems in the introduction phase for Huntington Beach,

Bakersfield, and Berkeley.

The second regression model is for forecasting the counts of bike sharing system after

the introduction phase of the system, and we call this the growth phase for bike sharing

system. There is only an introduction phase and growth phase for our regression models

because the original bike-sharing data set only has two years’ worth of data. Due to the

time limitation of the original bike-sharing data, we do not think we are able to come up

with a third regression model tracking the maturity phase for bike sharing system. We

also decided to delete some outliers that have count under 1000, so the number of

observation became 610 instead of 641. The result is shown below in Exhibit 6.

Page 15: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

14 Bike Sharing Systems

Spring 2015

Exhibit 6: Regression Model for Growth Phase of Bike Sharing System

As you can see in Exhibit 6, the R Square of this model is only 25%. However, this is the

highest R Square we can get for the second regression model. Sometimes the low R

Square does not mean the model is bad, we need to further examine this model by

other factors. Besides the low R Square, the p-value of all variables is significant.

In conclusion, the equation of the regression model will be:

Forecast Count = 2733.82 + 177.95 * (season) - 1960.69 * (weather) + 4064.74 *

(temperature) - 1573.05 * (wind speed)

In order to better verify our model, we decided to use this equation on the historical

data and compare the historical fitted value with actual value. Exhibit 7 illustrates the

comparison between historical fitted value and actual value.

Page 16: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

15 Bike Sharing Systems

Spring 2015

Exhibit 7: Historical Fitted Value and Actual Value in Growth Phase of Bike Sharing System

From Exhibit 7, we found this model does not forecast the count of bike sharing system

in the growth phase well. Though the second model is not that accurate, it somehow

follows the trend of actual count data. This model has a MAPE of 23.49% and the MAD

of 1093.19, and both the error measures are the lowest we can get. Therefore, we

decided to choose this regression model as our final second regression model to

forecast the count of bike sharing system in growth phase for Huntington Beach,

Bakersfield, and Berkeley.

IV. MODEL ADOPTION We assumed that Huntington Beach, Bakersfield, and Berkeley will start the bike sharing system in the beginning of 2016.

A. HUNTINGTON BEACH Exhibit 8 shows the forecast count of bike sharing system in Huntington Beach in the

introduction phase in 2016.

Page 17: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

16 Bike Sharing Systems

Spring 2015

Exhibit 8: Forecast Count of Bike Sharing System in Huntington Beach in Introduction Phase

The average count of the bike sharing system in introduction phase in Huntington Beach

is 2175. Exhibit 9 shows the forecast count of bike sharing system in Huntington Beach

in the Growth phase from April 2016 to December 2017.

Exhibit 9: Forecast Count of Bike Sharing System in Huntington Beach in Growth Phase

The average count of the bike sharing system in growth phase in Huntington Beach is

5246.

Page 18: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

17 Bike Sharing Systems

Spring 2015

B. BAKERSFIELD

The forecast count of bike sharing system in Bakersfield in the introduction phase in

2016 is shown in Exhibit 10.

Exhibit 10: Forecast Count of Bike Sharing System in Bakersfield in Introduction Phase

The average count of the bike sharing system in introduction phase in Bakersfield is 1649.

Exhibit 11: Forecast Count of Bike Sharing System in Bakersfield in Growth Phase

The average count of the bike sharing system in growth phase in Bakersfield is 5027.85

(see Exhibit 11).

Page 19: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

18 Bike Sharing Systems

Spring 2015

C. BERKELEY

The forecast count of bike sharing system in Berkeley in the introduction phase in 2016

is shown in Exhibit 12.

Exhibit 12: Forecast Count of Bike Sharing System in Berkeley in Introduction Phase

The average count of the bike sharing system in introduction phase in Berkeley is 1674.

Exhibit 13: Forecast Count of Bike Sharing System in Berkeley in Growth Phase

The average count of the bike sharing system in growth phase in Berkeley is 4048 (see Exhibit 13).

Page 20: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

19 Bike Sharing Systems

Spring 2015

D. CITY COMPARISON FOR INTRODUCTION AND GROWTH PHASE

The comparison between the count in introduction phase and the count in growth

phase in these three cities is in Table 5.

Count in Introduction Phase Count in Growth Phase Huntington Beach 2175 5246

Bakersfield 1649 5027

Berkeley 1674 4048 Table 5: City Comparison for Introduction and Growth Phase

As you can see in Table 5, Huntington Beach has the highest count in both the

introduction phase and the growth phase among these three cities. In conclusion, we

would like to implement a bike-sharing system in Huntington Beach in 2016 because it is

the most ideal city and could generate more profit than Bakersfield and Berkeley.

V. LIMITATION OF ANALYSIS

The original UCI bike sharing dataset contains abundant data with clean and no missing

rows; however, there are some limitations regarding this project. First, the climates for

Maryland, which is where the UCI dataset is from, and for cities in California are distinct.

Maryland is relatively colder than California. We adjusted the dataset to the California

climates, but this led to a large amount of original data being eliminated. Since an

accuracy of data mining and analysis heavily depends on the quality and quantity of

data, this might have caused a lower accuracy rate and a difficulty in the model

selection process, as well as the model adoption process. Moreover, cities in California

do not always provide historical weather/climate information completely. Datasets

often have missing variables and rows. This could also prevent us from gaining the best

accuracy rate.

Next, the original UCI bike sharing dataset only has data related to weather and

holidays/weekends. It does not contain variables related to what types of people use

the bike sharing program. We assume popularity of the bike sharing program could

depend on age, ethnicity, and characteristics of the city. If a city is relatively small,

people may use the program more often since the distance between a location that the

Page 21: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

20 Bike Sharing Systems

Spring 2015

program is offered and a user’s destination point may be close. This would provide an

increase in the number of bikes rented. On the other hand, if a city is large and the

distance between a location that the program offers and a user’s destination point is far,

a user will choose to take a car instead of renting a bike. Nevertheless, in this research,

we did not consider demographic information and characteristics of the cities.

VI. CONCLUSION

This project intended to choose a city in California that is most suitable for a bike

sharing program based on UCI bike sharing dataset. The three California cities included

Berkeley, Bakersfield, and Huntington Beach. Through visualization, we found that the

UCI dataset contains very clean data with no missing value. After this analysis, the

dataset was adopted to the climate in California. Next, the project built two types of

multiple linear regressions: the model for introduction phase and the model for growth

phase. Both models marked acceptable R square and MAPE, which represents the

accuracy of a model. Finally, by adopting the models to the climate data of the three

cities, we concluded that Huntington Beach is the most appropriate location for

introducing a bike sharing program.

Page 22: Bike Sharing Final Report (By Arigami, Chen, and Cotton)

21 Bike Sharing Systems

Spring 2015

BIBLIOGRAPHY

Bakersfield, California. (n.d.). Retrieved from United States Census Bureau:

http://quickfacts.census.gov/qfd/states/06/0603526.html

Bakersfield. (n.d.). Retrieved from Wikipedia:

http://en.wikipedia.org/wiki/Bakersfield,_California

City of Berkeley. (2015). About Berkeley. Retrieved form

https://www.cityofberkeley.info/

Climate. (n.d.). Retrieved from City of Huntington Beach, California:

http://www.huntingtonbeachca.gov/about/climate/

Demographic Data. (n.d.). Retrieved from City of Huntington Beach, California:

http://www.huntingtonbeachca.gov/business/demographics/

Fanaee-T, H., & Gama, J. (2013). Bike Sharing Dataset Data Set. Retrieved from UCI

Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Historical Weather For 2013 in Bakersfield, California, USA. (2013). Retrieved from

Weather Spark: https://weatherspark.com/history/29736/2013/Bakersfield-

California-United-States

History of Bike Sharing. (n.d.). Retrieved from College of Charleston:

http://bike.cofc.edu/bike-share-program/history.php

Meddin, R., & DeMaio, P. (n.d.). The Bike Sharing Map. Retrieved from Bike Sharing

Map: www.bikesharingmap.com

U.S. Climate Data. (2015). Climate Berkeley - California. Retrieved from

http://www.usclimatedata.com/climate/berkeley/california/united-

states/usca0087