IU Bus Route Optimization Group Name : Octo-Duck...Indiana University Bloomington CSCI-B565 Data...

21
Indiana University Bloomington CSCI-B565 Data Mining January 23, 2016 IU Bus Route Optimization Group Name : Octo-Duck Authors: Bo Henderson Arpit Khandelwal Kevin O’Neill Chathuri Peli Kankanamalage Supervisor: Prof Mehmet M. Dalkilic

Transcript of IU Bus Route Optimization Group Name : Octo-Duck...Indiana University Bloomington CSCI-B565 Data...

  • Indiana University Bloomington

    CSCI-B565

    Data Mining

    January 23, 2016

    IU Bus Route OptimizationGroup Name : Octo-Duck

    Authors:Bo HendersonArpit KhandelwalKevin O’NeillChathuri PeliKankanamalage

    Supervisor:Prof Mehmet M. Dalkilic

  • Contents

    1 Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 What we have . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Understanding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2.2.1 Routes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.2 Ridership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.3 Work Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.4 Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.5 Stops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.3 Problems with Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3.1 Data Inconsistencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    3 Solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Data Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3.2.1 Route Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.2 Adding Weather data from NOAA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2.3 Database Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3.3 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3.1 Confirmation of Route Progression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3.2 Generation of Route Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3.3 Determination of the Time Between Stops . . . . . . . . . . . . . . . . . . . . . . . . . 63.3.4 Determining weather effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.3.5 Java Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.4 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.1 Average Ridership, Average Dwell time and Average time between stops . . . . . . . . . 8

    4.1.1 Time Between Stops Throughout The Day . . . . . . . . . . . . . . . . . . . . . . . . 114.2 How Weather affects the timing of Bus stops and ridership . . . . . . . . . . . . . . 11

    5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1 Implications of Travel Time Throughout the Day . . . . . . . . . . . . . . . . . 12

    5.1.1 Problem with Bus Sensor Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.2 Ideas to improve IU Bus System . . . . . . . . . . . . . . . . . . . . . . . 125.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    1

  • 1 Executive Summary

    The Indiana University bus system is the major source of transportation for most IU students. Studentsdepend on the service for most of their day-to-day activities. The IU bus system contains 4 major routes,which are designed to cover the most populated areas around the university. With increasing numbers ofstudents admitted to the university, demand for a better and more efficient IU bus service is an emergingrequirement. IU Bus system has provided us with their data for the 2014 Fall, 2015 Spring semesters.This data contains schedule information, real-time bus location information from DoubleMap, ridershipinformation, work records and service books of drivers. The IU bus system asked us to analyze this data inorder to accomplish the following goals:

    • Determine the difference between the scheduled and actual arrival time of the buses.

    • Determine the effect of various factors on stop timings including weather and passenger counts.

    • Determine the average travel time between any two stops.

    • Provide suggestions to optimize the service.

    In this report, we hope to provide some insight into the data, and show our efforts in accomplishing thesegoals.

    Our results show that there could be improvements to the way data is collected and stored. One possiblesolution to improving the timing of routes could be building a pedestrian bridge between Kelly and Wells.

    We found that dwell times at and travel times between major stops increase in correlation with breaksbetween classes. However, the dwell times at the turn around points of routes and travel times from thesestops show little relation with time of day. We also found that average daily ridership increased by 5% whenthere was precipitation and increased by 9% when the average daily temperature is below freezing.

    2 Data

    The IU Bus system provided us with data files that contain data related to their routes, schedules, stops,ridership, service books and work records.

    2.1 What we have

    • Access database files containing DoubleMap data for Fall 2014 and Spring 2015

    • Access database files containing ridership data for Fall 2014 and Spring 2015

    • intervaldata2014-2015.tsv.gz containing raw data

    • GPS Cordinates for stops

    • Service books for Fall 2014 and Spring 2015

    • Weather data from NOAA

    2.2 Understanding Data

    2.2.1 Routes

    The IU bus system has been broadly divided into 4 major routes, namely Route A, Route B, Route E andRoute X. Each of these routes are then sub-divided according to schedules and the term, Fall or Spring. Thebasic objective is to increase the number of passengers carried while maintaining a minimum frequency ofbuses. For each major route there are also temporary or detour routes, identified as A Route S-U, A RouteF, A Route Detour and so on. Each route is uniquely identified with route ID. For this project we are onlyinterested in the main Routes for Monday-Thursday.

    2

  • 2.2.2 Ridership

    The Ridership database contains the data for how many people each bus served. These counts are recordedby the bus drivers. They record inbound counts and outbound counts. So basically, instead of recordingthe data for each stop separately we have a cumulative number for some stops. The 3rd & Jordan stop isthe pivot point between the inbound and outbound direction. All passengers boarding between the startingpoint and 3rd & Jordan are included in the inbound count, and riders boarding after this stop are includedin the outbound count.

    2.2.3 Work Records

    The Work record table contains information about the drivers; the bus they are driving, the schedule andthe shift they are working. This table provides us a way to link the buses with their corresponding scheduleon any particular day.

    2.2.4 Schedules

    The IU Bus system has scheduled a time at which the buses arrive at all major stops. These times are areference point for potential location of a bus at any given time of the day, provided they are not delayed.The actual map contains more stops than defined in the schedule. For A route, there are only 4 major stops:Stadium, 3rd & Jordan, Wells Library and IMU. Different schedules are defined for each bus. For A routethere are 7 buses defined as A1,A2,A3,..A7. Schedule id is defined by capturing day of the week, route name,bus number for that route and shift number. For example, schedule id of T-A1.2 means Tuesday A route1st bus 2nd shift.

    2.2.5 Stops

    Stops are defined and identified from a unique Id. For some stops, they are defined as inbound and outbound.Outbound stops end with parenthesis. For example, ”Fisher Court” means inbound stop while ”Fisher Court()” means the outbound stop.Additionally, We have stops of two categories: Major time-stops and Minor time-stops. We only have thescheduled time for the major time-stops and these are usually the main points for passengers to board ordisembark the buses. The bus will stop at the minor time-stops only if a passenger requests a stop, or ifthere are passengers waiting to board the bus at that stop.

    2.3 Problems with Data

    In this project, most of time was spent understanding the data. We have to understand what kind ofdata is present, what are the relationships between different data entities, and how we can map the datafiles. The first challenge was working with data that was in many different formats. Instead of a singledatabase, we were given csv files, Access databases, excel spreadsheets and some images. These differingformats led to issues with the same data being located in different forms in different sources. For example,intervaldata2014− 2015.tsv file and the DoubleMap data contains the same data, but they are in differentforms. In the DoubleMap data, ”stop” column has the name of the stop, but intervaldata2014 − 2015.tsvcontains the stop id for that column. We see similar problems with other data files as well.

    2.3.1 Data Inconsistencies

    When analyzing the data, we came across some data inconsistencies. For some days, data for the first busdoes not begin until the middle of the day. For the most part, this behavior occurs at the beginning of thesemester. Another problem we see is that some trips will end after one or two stops and then start over againafter an usually long recorded time period. These might be erroneously entered data. Also, it was explainedto us that the bus needs to be registered in the DoubleMap application in order to record data. For some

    3

  • days, drivers may forget to register their buses at the start of their shifts. Such inconsistencies affected thevariance results and needed to dealt with.Another strange behavior we faced is for B, E and X routes, they did not record the time for the startingstop. For example, B route ”Fisher Court” is the starting time, but in intervaldata2014 − 2015.tsv, firstentry for the B route is from ZBT which is the next stop. We see similar behavior for E and X routes aswell.

    3 Solution

    Once we had a better understanding of the existing data, we designed some solutions for achieving theproject goals. We used a relational database since the data model itself is relational. We initially tried anunstructured database, mongoDB, but concluded that MySQL is the best fit for the existing problem. Weadded the basic database tables for raw data such as ”intervaldata”, ”stopIds”, ”routeIds”, ”schedules” andso on. Then we used a python script to create a round trip table for any given route. Once we had therefined data in the database, we wrote a Java program to output some results to csv files that were thenanalyzed in R.

    3.1 Technologies Used

    • MySQL is used as the data storage

    • Java, Python and R was used for data processing and analysis

    3.2 Data Schema

    3.2.1 Route Modeling

    Each route was modeled using a table in which the first two columns specified the date and bus id. Twocolumns for each stop on the route followed with one representing the dwell and the other representingthe trip between stops. Two tables were generated for each stop, one was populated with the time stampsrepresenting the time of day each event occurred and the other was populated with integers representingthe duration of each event in seconds. For each route, these two tables are correlated in such a way thatcolumn n of row m of one table holds information correlating to that in the same column and row of theother table. These tables were generated using loops.py from a MySQL table containing the raw data fromintervaldata.tsv.

    3.2.2 Adding Weather data from NOAA

    We downloaded csv files containing weather data from NOAA for the dates between August 2014 throughMay 2015 and filtered out only the useful information, i.e. precipitation and temperature, and from thereadded it to the MySQL database.

    3.2.3 Database Structure

    We initially inserted all the raw data we got from IU Bus System to the MySQL database. We developed apython script which generates round trip information using the raw data table, ”intervaldata”. It generatedtwo csv files which contain time difference data and timestamp data. We then inserted them into MySQLdatabase as shown in Figure 5 on page 5. The database also contains schedule information and the weatherdata acquired from NOAA.

    4

  • Figure 1: Tables for A Route Figure 2: Tables for B Route

    Figure 3: Tables for E Route Figure 4: Tables for X Route

    Figure 5: Data Tables for route information

    5

  • 3.3 Data Processing

    For achieving the project goals, we processed the refined data that was added to the database using Javaand R.

    3.3.1 Confirmation of Route Progression

    route progression.R contains a function that pulls all the interval data for a particular route id and filtersout the data for stops which are visited infrequently. The threshold for this filter is a user-definable parameterbut it’s default value of 1000 works for any of the major routes. For each of the most frequently visited stopsin a given route id, it then identifies the two stops that most frequently come after it. One of these two isitself and defines the dwell time and the other is the next stop in the route. This program defines the orderof stops in a route without requiring any background information about the route other than it’s route idand, optionally, a stop to define the beginning of the loop. This function lets us verify that the progressionof stops indicated by the data is in agreement with what is expected. A few unexpected route progressionswere identified by this function.For example, the Monday-Thursday X-route had the expected stadium stop ( → 76 →) in the Fall but inthe Spring, the stop in the stadium was also registered as stop id 68 and resulted in stop 76 being countedtwice (→ 76 → 68 → 76 →).Identifying deviations like this early led to much smoother data analysis later on. Additionally, for each stopid, this function would produce a frequency table of stops visited after the current one. It was observed thatbuses would occasionally skip stops but these situations were rare and were not indicative of the bus’s usualbehavior.

    3.3.2 Generation of Route Tables

    loops.py is a command line tool that we created and it was used to generate csv files from our intervaldataMySQL table. This tool required the route id of interest along with a list of stop IDs in the correct order (thiswas generated using route progression.R). The header of the output csv file had ’to’ and ’from’ labels foreach stop in the route in addition to date and bus id labels. A MySQL query was generated that extractedonly the rows from intervaldata which exactly matched what was expected based on the given list of orderedstop ids. The strictness of this query ensured that every piece of data considered would fit into one of thecolumns of the output file. This tool wrote one row at a time and each time it encountered data that didn’tfit the column being considered, a blank would be inserted and it would check the following column. Rowswould be filled with blanks if a new date or bus ID was encountered to ensure that each row contains datafor a single day or a single bus.

    3.3.3 Determination of the Time Between Stops

    time between.R contains a function that will predict the time it takes to travel between any two adjacentstops on a route given the time of day. It reads in the two csv files generated by loops.py and defines amaximum allowable number of seconds and earliest allowable timestamp to filter out erroneous or irrelevantdata. For the cleaned data, it creates a cubic spline to approximate how the entire route is affected throughoutthe day. This R file contains a function that takes this approximation and scales it to fit whichever columnof data the user is interested in. It outputs a time prediction as well as plotting the approximation graphon top of the actual data. Examples of such plots can be found in Figures such as 16 on page 14.

    3.3.4 Determining weather effects

    There are several R functions within rd weath.r that were used to determine the average variance for eachstop time, average ridership, as well as columns for average ridership and variance when there is significantprecipitation. Another function precipMean calculates the mean daily ridership given any precipitationthreshold.

    6

  • 3.3.5 Java Program

    From the Java program, we got the following results.

    • Time difference for each scheduled stop in each route, for both semesters.

    • Inbound, outbound and total ridership for each scheduled stop in each route during both semesters.

    • Total precipitation for each scheduled stop in each route during both semesters.

    • Average dwell times and average time it takes to travel between two stops for each route during bothsemesters.

    The above results were generated by joining multiple tables that we have in the database. After getting theactual times at which the bus stopped from the database, there was a challenge in figuring out to whichschedule it belonged. Data retrieved from the database is assigned to two map objects which contain starttimes as the key and an array of other stop times in the same order as the schedule. One map containsthe schedule times and the other map contains actual arrival times of the bus of a given route. Comparingthe keys of the two maps allow us to identify which schedule the actual time belongs. Once the schedule isidentified for the starting stop, we assume other stop times should follow since we took the data from theroute table.The Java program generates CSV files containing the above listed information so that our R code can processthem in order to find the average time differences and the effects weather had on time difference.

    3.4 Data Cleaning

    Because we discovered many inconsistencies in the data, we needed to clean and transform the data so itcould be processed. When calculating average dwell times and average travel time between two stops, therewere some entries which contain times that are out of a reasonable range (greater than 1800 seconds). Whenwe executed the queries, we filtered out entries which are beyond this range.

    7

  • Day Average Inbound Average outbound Average TotalFall Spring Fall Spring Fall Spring

    1 39.4677 33.7125 39.4677 17.7031 64.1516 51.41562 43.2041 40.3374 43.2041 32.5932 78.7226 72.93063 42.6085 38.2825 42.6085 29.8300 77.2175 68.11254 43.0337 40.0299 43.0337 32.9440 79.4872 72.97395 41.2488 38.0362 41.2488 30.6050 75.2662 68.64136 33.3286 27.5380 33.3286 25.8522 64.2431 53.39027 38.1283 25.1026 38.1283 20.9039 77.6903 46.0066

    Table 1: Day-wise Ridership for A-Route

    Day Average Inbound Average outbound Average TotalFall Spring Fall Spring Fall Spring

    1 18.4121 18.4947 18.0503 19.5532 36.4623 38.04792 24.4549 24.2020 24.5329 25.1163 48.9878 49.31823 24.9090 24.8081 26.9764 24.9230 51.8853 49.73104 24.8842 24.1246 26.6702 25.8378 51.5544 49.96245 25.0823 24.9858 26.1708 24.8713 51.2531 49.85716 21.1352 20.7314 20.4535 18.8857 41.5887 39.61717 26.1803 21.6288 22.2066 17.8221 48.3869 39.4509

    Table 2: Day-wise Ridership for E-Route

    4 Results

    4.1 Average Ridership, Average Dwell time and Average time between stops

    We calculate average ridership for all the routes according to day of the week. Table 1 on page 8, Table 2 onpage 8, Table 3 on page 8 and Table 4 on page 9 shows average ridership results. We also calculate averagedwell times for each stops. Figure 15 on page 10 shows average dwell times for each stop in each route. ForA route, 10th & Woodlawn has highest dwell time compare to other stops. For route B, 10th & Jordanhas the highest dwell time and for route E, 10th & Woodlawn has the highest dwell time. For X route, theStadium has the highest dwell time.We also calculated the average travel time between two adjacent stops for each route. Figure 10 on page9 shows the results. For A Route, highest travel time is for travelling from Stadium to Alumni Center.For B Route, highest travel time happens when the bus is travelling from Fisher Court to ZBT. For ERoute, Evermann Redbud to Bicknell takes the highest travelling time. For X Route, Stadium(X) to 7th &Woodlawn has the highest travelling time.

    Day Average Inbound Average Outbound Average Total1 9.1071 8.5268 17.63392 23.0754 26.5284 49.60373 22.1055 24.2944 46.39994 21.1468 24.8453 45.99225 21.3376 23.5704 44.90806 14.9111 16.2666 31.17777 7.0089 6.4018 13.4107

    Table 3: Day-wise Ridership for B-Route Spring

    8

  • Day Average Inbound Average Outbound Average Total2 17.5022 16.4880 33.99023 17.3141 16.0594 33.37354 17.3289 15.8987 33.22765 16.2712 15.4155 31.68676 14.0027 12.1891 26.1917

    Table 4: Day-wise Ridership for X-Route Spring

    Figure 6: Travel Time Between Stops for A Figure 7: Travel Time Between Stops for B

    Figure 8: Travel Time Between Stops for E Figure 9: Travel Time Between Stops for X

    Figure 10: Travel Time Between Stops

    9

  • Figure 11: Dwell Time Between Stops for A Figure 12: Dwell Time Between Stops for B

    Figure 13: Dwell Time Between Stops for E Figure 14: Travel Time Between Stops for X

    Figure 15: Dwell Time Between Stops

    10

  • 4.1.1 Time Between Stops Throughout The Day

    Figures 16, 17, 18, and 19 were generated by time between.R with the y axis showing the number ofseconds it took to travel between GPS checkpoints. The red line shows the predicted travel time betweenstops and has the same shape across all stops in a given route but it is shifted and stretched to fit the dataat any particular stop.

    4.2 How Weather affects the timing of Bus stops and ridership

    We downloaded weather data from NOAA for the dates between August 2014 through May 2015. Thisdata was matched with the Monday-Thursday data for the Fall and Spring semesters that we attained fromDoubleMap. See Figure 20 on page 18 for A and B routes, and Figure 21 on page 19 for the E and X routes.After analyzing the effects of the weather across all times of day, we see that there is not much change in thearrival time vs scheduled time. This can be explained by the fact that, for all the days that there is rain, itwill not rain at the same time of day. So, we should expect the effect of precipitation to even out over time.However, as is you can see in Figure 22 on page 20, there is a significant difference for the total ridership ofthe day. We found that the most significant change occurred when there was more than .025 inches of rainor snow. When the total precipitation for the day reached that amount the ridership increased by about5%. On Fridays, we found that the ridership actually drops by 17.8% when there is more than .025 inches ofprecipitation. Since there are not as many classes on Friday, students most likely choose to stay at home onthese days. Also shown in this chart is the effect of temperature on ridership. The results show that whenthe average temperature of the day is below freezing, ridership increases by 9.18%.

    11

  • 5 Discussion

    5.1 Implications of Travel Time Throughout the Day

    As is evident by any of figures 16, 17, 18, or 19, neither the time it takes to travel between stops nor dwelltimes are static throughout the day. Travel times experience regular spikes throughout the day and thesespikes are believed to correspond to breaks between classes. This belief is supported by the observationsthat the spikes are about an hour apart and that the biggest spikes in travel times occur in early afternoonwhen campus is most crowded. Furthermore, every route displays spikes in travel time at about the sametimes of day; notice the similarity in shape between the time-prediction curve for the A route (Figure 17)and B-route (Figure 16).

    The dwell time at Briscoe shows a lack of data points around 20 seconds, instead they are clusteredaround 10 seconds or are greater than 30 seconds. This is believed to be due to the bus frequently notstopping at Briscoe. It’s a minor stop so the bus would be able to skip this stop frequently and the relativelack of traffic near this stop means the bus wouldn’t have to slow down if it sees the stop is empty. Incontrast, the Jordan Hall stop barely shows this gap at all. This is unsurprising as it is a major stop thatcould be skipped very infrequently and even if it can be skipped, 3rd street is a busy street with several stoplights so buses will proceed slowly no matter what.

    The travel times to and from Fisher Court are much more irregular than any other travel time in the Broute. For example, Figure 18 shows the travel time from Fisher Court to ZBT. Fisher Court is the turnaround point for the B route and this is believed to contribute to this part of the route being much moreindependent of the time of day. The predictive power of our spline curve is similarly poor for other endpoints such as the stadium.

    In contrast to time-independent nature of the ends of routes, the Kelly to Wells (Figure 19) travel timedisplays very sharp spikes. This strengthens our claim that the stops correspond to time between classesbecause the student cross walk between Kelly and Wells is very crowded at these times. Traffic comes toa stop for the 5-15 minutes between classes contributing to travel times upwards of 5 minutes between twostops that are otherwise very close to each other.

    5.1.1 Problem with Bus Sensor Location

    The data shows that the average travel time between the stadium and alumni hall is about 15 minutes whilethe dwell time at the stadium is, on average, less than a minutes. This goes against what was expectedas the bus is known to wait at the stadium between route loops. This leads us to believe that the bus iswaiting outside the radius of stop 67. We propose that the bus sensor be moved closer to where the busesare actually waiting to ensure that the data being collected is representative of what’s actually happening.

    5.2 Ideas to improve IU Bus System

    • Build a pedestrian bridge over 10th street between SPEA/Kelly and Wells.

    • Better way to collect data. Instead of relying on what drivers enter, they can automate data collectionwhich will give more accurate data set.

    • Instead of having data in multiple places in multiple formats, unify data schema which will reduce thecomplexity of the existing data models to a great extent.

    • They have a complex data model. We feel that they could reduce the complexity of the data model.

    5.3 Future Works

    We could not work on optimizing tasks due to time constraints. We believe our system can be extended toachieve optimizing goals in the future.

    12

  • • We identified how factors like ridership, weather affect the time variances. We can come up with aprediction system which consider these factors and predict variance in such conditions.

    13

  • Figure 16: Dwell time at Jordan Hall throughout the day (B Route)

    14

  • Figure 17: Dwell time at Briscoe throughout the day (A Route)

    15

  • Figure 18: Time to get from Fishers Court to ZBT throughout the day (B Route)

    16

  • Figure 19: Time to get from Kelly School to Wells throughout the day (A Route)

    17

  • Figure 20: Effects of Weather on the Timing of the Bus, Routes A and B

    18

  • Figure 21: Effects of Weather on the Timing of the Bus, Routes E and X

    19

  • Figure 22: Effects of Weather on Total Ridership

    20