Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured...

12
Data Processing Methodology for Big Data in Public Transportation Anila Cyril 1 , Dr. Raviraj H. Mulangi 2 , Dr. Varghese George 3 1 Research Scholar , 2 Assistant Professor , 3 Professor 1,2,3 Department of Civil Engineering National Institute of Technology Surathkal June 11, 2018 Abstract Performance evaluation, optimization and increasing the patronage of public transport is the key requirement of a sustainable transportation system. In the recent years, big data sets related to passenger and operational characteris- tics of public transport are generated and archived by the use of technological advancements like electronic ticketing machines (ETM). However, the ETM data has not been ex- plored thoroughly for transportation planning although it is nowadays collected and compiled by public transport under- takings on a regular basis. ETM is a big data source with average monthly passenger transactions of approximately one million, which can be audited to obtain passenger de- mand, operators performance and service effectiveness. The link volume required for bus scheduling, can be effectively mapped on the public transport network using this data. The Origin-Destination (OD) matrix of the bus commuters, which is the key to travel demand modelling, derived from this data can provide a firm basis for planning and deci- sion making in the transit industry. Also, the load profile, 1 International Journal of Pure and Applied Mathematics Volume 120 No. 6 2018, 649-660 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ Special Issue http://www.acadpubl.eu/hub/ 649

Transcript of Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured...

Page 1: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

Data Processing Methodology for BigData in Public Transportation

Anila Cyril 1, Dr. Raviraj H. Mulangi 2,Dr. Varghese George 3

1Research Scholar ,2Assistant Professor ,3Professor

1,2,3 Department of Civil EngineeringNational Institute of Technology

Surathkal

June 11, 2018

Abstract

Performance evaluation, optimization and increasing thepatronage of public transport is the key requirement of asustainable transportation system. In the recent years, bigdata sets related to passenger and operational characteris-tics of public transport are generated and archived by theuse of technological advancements like electronic ticketingmachines (ETM). However, the ETM data has not been ex-plored thoroughly for transportation planning although it isnowadays collected and compiled by public transport under-takings on a regular basis. ETM is a big data source withaverage monthly passenger transactions of approximatelyone million, which can be audited to obtain passenger de-mand, operators performance and service effectiveness. Thelink volume required for bus scheduling, can be effectivelymapped on the public transport network using this data.The Origin-Destination (OD) matrix of the bus commuters,which is the key to travel demand modelling, derived fromthis data can provide a firm basis for planning and deci-sion making in the transit industry. Also, the load profile,

1

International Journal of Pure and Applied MathematicsVolume 120 No. 6 2018, 649-660ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

649

Page 2: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

load rate of bus stops and other statistical details pertain-ing to performance and line planning can be drawn fromthe ETM data. This paper explores the possibility of us-ing the archived fare transactions data of selling ticketsusing ETM in Kerala State Road Transport Corporation(KSRTC) buses for passenger demand estimation and pro-poses a methodology for data processing and determinationof OD matrix using MATLAB software and Python pro-gramming language.

Key Words:big data; data processing; OD matrix; linkload; electronic ticketing machine; public transport

1 INTRODUCTION

The rapid urbanization and motorization tends to build a greatchange in connectivity and travel needs of the people. The trans-portation planners and engineers have to forecast the responses oftravel demand due to changes in the attributes of public trans-portation system and the people using the system. The estimatesof public transport demand are important for strategic and opera-tional planning of public transport. In addition, the accurate pre-diction of traffic loads and proper assignment on the network linksare requisite for public transport planning. The developed coun-tries have formulated many approaches for demand modeling andtransport planning but developing countries which are becomingsignificant in the world are still struggling to develop efficient mod-eling techniques due to lack of sufficient data and allied resources.The researchers in the recent years have focused the attention inidentifying and evaluating various approaches for automated datacollection and analysis which will curb the cost and time require-ments of transportation planning procedures.

Recently in India, most of the public transport operators haveinstalled Electronic Ticketing Machines (ETM) for providing tick-ets to the passengers and for fare collection. With the use of ETM,large amount of data related to number of passengers, fare col-lected, operated kilometers and passenger kilometers are generatedand stored. This data is available for entire public transport net-work and fleet. The ETM data can be used to estimate the pas-senger demand, improve the operational profit and for performance

2

International Journal of Pure and Applied Mathematics Special Issue

650

Page 3: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

evaluation. The archived ETM data is commonly used for perfor-mance evaluation and for improving the operational profit but thepotential of the ticket data for passenger demand estimation andforecasting is not explored. Also, it is required to develop toolsfor extracting the data compiled using ETM. This paper exploresthe possibility of using the archived fare transactions data of sell-ing tickets using ETM in Kerala State Road Transport Corporation(KSRTC) buses for passenger demand estimation and proposes amethodology for data preprocessing and mining.

2 REVIEW OF LITERATURE

A. Big Data SourcesThe use of Automatic Data Collection (ADC) system [4,17,19]

for daily transport operations have generated a very large amountof data that can be used for various applications. Commonly, theADCs data are used for performance evaluation [17], improving theoperational profit, decision making, planning and operation [15] butthe potential of these data sources for demand estimation [12] isleast explored. Various researchers have summarized the availablesources of big data for transportation operations [4,6,19]. The com-mon sources of ADCs include Automatic Passenger Count (APC)system [4], Automatic Fare Collection (AFC) system [1] and Auto-matic Vehicle Location (AVL) system [12]. Recently, various othersources like electronic ticket machine data [10], smart card data[9,14], mobile phone data [6,13], GPS location of vehicles and hu-man beings, information systems, Bluetooth data and social mediacheck-in are used as a source of travel data.

Reference [8] presented a decision support system framework forfinding travel strategies by collecting, aggregating and analyzingthe data from unstructured and structured sources like SIRI andGTFS. The proposed methodology can be applied to a spatial andtemporal coverage of public transport services and can be useddevelop policies to improve the modal split. Transport modelingusing big data was proposed [6] to understand travel behaviourand allows to perform what-if analysis. Smart card data [14] wasused by various researchers for analyzing the travel behaviour [9],travel pattern [11,20], and demand estimation [2,5,7,12-13]. The

3

International Journal of Pure and Applied Mathematics Special Issue

651

Page 4: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

open big data from ticketing websites were proved to be useful tocharacterize the travel patterns in Chinese high-speed rail system[18].

B. Electronic Ticket MachineElectronic Ticket Machine is a handheld device which weighs

800g. The parameters in the ticket are almost same as that of a rail-way ticket. The ticket contains details like trip number, route code,ticket number, origin station, destination, number of full ticket, halfticket, luggage, physically challenged, time of ticket issued, fare andticket type. The data scores of ETM over other surveys are listedbelow [10, 1]:

• Reduced Cost and time requirements.

• Large sample size is obtained as the fare transactions of allpassengers are recorded. Therefore, the observations may be100

• Bias in data collection is ruled out.

• More frequent estimation and data collection can be carriedout on any day.

• Data sets can be interpreted for any time period based on theavailability of data.

• The data can be used for day to day public transport oper-ational planning as well as to compliment in transportationplanning.

The ETM data falls short in explaining the details regardingtravel purpose, transfers and user perception and attitude. ThusETM can be used to supplement the survey data or even can replacethe survey in appropriate cases.

3 DATA COLLECTION AND METHOD-

OLOGY

A. Data Collection

4

International Journal of Pure and Applied Mathematics Special Issue

652

Page 5: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

This study investigates various aspects related to demand mod-eling based on data elicited from Electronic Ticket Machine (ETM).The ETM data has not been explored thoroughly for transporta-tion planning although it is nowadays collected and compiled bypublic transport undertakings on a regular basis. The data usedin the study is part of transactions on ticket sales by Kerala StateRoad Transport Corporation (KSRTC) maintained at 6 interzonalbus depots in Trivandrum city. The KSRTC operates its servicesfor Trivandrum city from mainly 6 depots namely Trivandrum CityDepot, Vikas Bhavan, Peroorkada, Pappanamcode, Vizhinjam andKaniyapuram. The ticketing details of all the buses and routeswere collected from the respective depots for the period between2011 and 2013.

B. Data Processing MethodologyThe data used in the study is a set of transactions in the Ker-

ala State Road Transport Corporation (KSRTC) six city depots forthe year 2011-2013. The database contains trip tables along withthe details of fare stages. The ETM data contains missing or il-logical values which have to be dealt with before proceeding withthe analysis. Several factors are to be considered in data prepara-tion step. The one such factor is the type of data available. Someoperators may not have Automatic Vehicle Location (AVL) data.Even though AVL is not available, the boarding and alighting ofthe passengers can be determined at a route level or zonal level.However, the exact bus stop level accuracy has to be compromised.In this study, AVL data is not available and therefore, the studyfocuses on fare stage level details.

A single ETM data text file represents all the trips made by asingle bus in a day. The raw ETM data is in text format with firsttwo rows having the waybill number and date of fare transactions asin Figure 1. This text file is converted into Microsoft excel formatwith the first two rows removed and the column names added asdepicted in Figure 2. The converted excel files are then compiledbased on the date and further the data for each route is recompiledbased on the route code. The route wise data can be combinedagain for the level of aggregation required. The following steps areinvolved in the data processing.

1) Recompiling ETM Data into Excel Files:a) Step 1 : Read the text files from the directory. Figure 2 shows

5

International Journal of Pure and Applied Mathematics Special Issue

653

Page 6: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

the format of raw ETM data.

Fig. 1. Format of Text File.

b) Step 2: Data values in each line of the file is recompiled andentered into the relevant columns. Data in each file is stored asseparate array. The file name for each array will include details onticket collection which is obtained from the first row of the text file.The array will be stored as an excel file as shown in Fig 2. Thisprocedure is repeated till the end of the files in the directory.

Fig. 2. Recompiled Excel File.

c) Step 3 : The converted excel files is again recompiled basedon the trip date. The excel files having the same date are mergedfor obtaining the data for a single day.

2) Date-stamping the Excel Files: The recompiled excel files foreach date from previous data processing step is to be date stampedfor further analysis. The procedure of date-stamping the excel filesis given in Figure 3. Date stamped excel files are then stampedwith the depot ID as shown in Figure 4.

Fig. 3. Procedure of Date-stamping the Excel Files

Fig. 4. Depot ID and date-stamped recompiled excel file.

6

International Journal of Pure and Applied Mathematics Special Issue

654

Page 7: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

3) Determination of Origin-Destination matrix: OD matrix de-scribes the movement of people in an area. The various uses of ODmatrices are given in Figure 5.

Fig. 5. Various Uses of OD Matrix in Public Transportation(Source: Adapted and modified from [3])

The excel files obtained from the previous step of data process-ing is further processed for creating the origin-destination (OD)matrix. The data for each route is filtered and then the columnsfull ticket, half ticket and physically challenged are summed up toget the total passengers travelling in a single ticket. Then for eachtrip number of a route, the unique values in the origin column isidentified by removing all the duplicate values and entered in firstcolumn of spreadsheet and the unique values of the destination col-umn is selected and entered in the first row excluding the first cell.The source data is filtered and the sum of the passengers from oneorigin to destination is entered in the respective cells to obtain theOD matrix from the recompiled ETM data as given in Table I. Theroute wise OD matrix gives the number of passengers boarding andalighting the bus at various stops.

TABLE I. SAMPLE ORIGIN DESTINATION MATRIX OFTRIP 1 OF ROUTE 1

Calculation of Link Load: Link is the road segment connectingtwo bus stops/ destinations. It is the smallest element in a roadnetwork. Once we obtain all the link loads of a road network, any

7

International Journal of Pure and Applied Mathematics Special Issue

655

Page 8: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

information regarding the volume of traffic can be derived. The linkload pertaining to each route is also used to identify the rarely usedas well as mostly used links. The boarding and alighting data in theOD matrix is used to calculate the link load as given in equation 1and equation 2.

Load on the first link=

n∑

j=1∀i=1

Tij (1)

Load on the mth link=

Lm−1 +n∑

j=1∀i=m

Tij −n∑

i=1∀j=m−1Tij (2)

where, Tij is the number of passengers from origin i to destinationj and L(m−1)is the load of (m− 1)th link. Moreover, the OD matrixfor each trip is used to check the passenger load across all stops asexplained in Table II. The loads are plotted, as depicted in Figure6, to establish a passenger load profile with respect to distancetravelled from departure stop to end of the route.

TABLE II. LOAD PROFILE WITH OCCUPIED & EMPTYSEAT KILOMETRE

Fig. 6. Load profile with occupied & empty seat km for 10:55 am

8

International Journal of Pure and Applied Mathematics Special Issue

656

Page 9: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

4 CONCLUSION

The explosion of big data has contributed enormously in transportplanning and operations especially for performance evaluation anddetermination of travel patterns. The data obtained from varioussources have to be preprocessed in different manners and then an-alyzed to get the estimates pertaining to transportation planningand management. This study proposes a methodology for data pro-cessing that can be used for the data from electronic ticket machinesource. Further, the methodology for deriving the OD matrix of thebus commuters, which is the key to travel demand modelling, is de-veloped. The route OD matrices by stops and the passenger loadprofiles helps in developing new local or express routes and shortrun or zonal services. While the boarding and alighting counts,passenger load by road segment (link load) can be used for deter-mining the bus size and frequency of bus service required for theroute. This OD matrix can be further analyzed and processed fordemand estimation of the bus, and planning and decision makingin the transit industry. The present study has demonstrated theeffectiveness of using ETM data for OD matrix determination. Fur-ther works include the estimation of public bus transport demandusing the OD matrix.

References

[1] A.A.Nunes, T.G. Dias and J.F.Cunha, Passenger Journey Des-tination Estimation from Automated Fare Collection SystemData Using Spatial Validation, IEEE Transactions on Intelli-gent Transportation systems Vol 17, No 1, 2016

[2] A. Andreoni & M.N. Postorino, A Multivariate Arima ModelTo Forecast Air Transport Demand. Association for EuropeanTransport and contributors, 2006

[3] A. Ceder, Public Transit Planning and Operation-Modeling,Practice and Behavior, CRC Press, 2016.

[4] A. Cui, Bus Passenger Origin-Desination Matrix EstimationUsing Automated Data Collection Systems M.S Report, Uni-versity of California at Berkeley, 1996

9

International Journal of Pure and Applied Mathematics Special Issue

657

Page 10: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

[5] A.Wijeweera and M.B.Charles, Determinants of PassengerRail Demand in Perth Australia: A Time Series Analysis, Ap-plied Econometrics and International Development, Vol 13-2,pp. 221-234, 2013.

[6] C. Anda, A. Erath, P. J. Fourie, Transport modelling in theage of big data, International Journal of Urban Sciences. 21,2017, pp. 19-42.

[7] C. H. Tsai, C. Mulley and G. Cliffton, Forecasting Pub-lic Transport Demand For The Sydney Greater MetropolitanArea: A Comparison Of Univariate And Multivariate Meth-ods, Australasian Transport Research Forum 2013 Proceed-ings, 2 - 4 October 2013, Brisbane, Australia

[8] G. Guido, D. Rogano, A. Vitale, V. Astarita and D. Festa,”Big Data for public transportation: a DSS framework”, in 5thIEEE International Conference on Models and Technologies forIntelligent Transportation Systems (MT-ITS), 2017, pp. 872 -877.

[9] J. Maktoubian, M. Noori, M. G. Mouziraji and M. Amini,”Analyzing Large-Scale Smart Card Data to Investigate PublicTransport Travel Behaviour Using Big Data Analytics”, Jour-nal of Information Technology & Software Engineering, vol.07, no. 04, 2017.

[10] J. Meal & D. Carter. Use of Electronic Ticket Machine Datain Transport Planning Models, uropean Transport Conference,1998

[11] LEE, Roy Ka Wei and KAM, Tin Seong. Time-Series DataMining in Transportation: A Case Study on Singapore PublicTrain Commuter Travel Patterns. (2014). International Jour-nal of Engineering and Technology. 6, (5), 431-438. ResearchCollection School Of Information Systems.

[12] L.M. Matias and O. Cats, Toward a Demand Estimation ModelBased on Automated Vehicle Location, Transportation Re-search Record: Journal of the Transportation Research Board,No. 2544, 2016, page 141-149

10

International Journal of Pure and Applied Mathematics Special Issue

658

Page 11: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

[13] M.G.Demissie, S. Phithakkimukoon, T. Sukhvibul, F. An-tunes, R. Gomes, C. Bento, Inferring Passenger Travel Demandto Improve Urban Mobility in Developing Countries Using CellPhone Data: A Case Study of Senegal, IEEE Transactions onIntelligent Transportation Systems, 2016

[14] N.V.Oort, T.Brands and E.Romph, Short-Term Prediction ofRidership on Public Transport with Smart Card Data, Trans-portation research Record: Journal of the Transportation Re-search Board, No. 2535, 2015, pp. 105-111.

[15] N.V.Oort and O. Cats, Improving public transport decisionmaking, planning and operations by using big data: Casesfrom Sweden and the Netherlands, IEEE 18th InternationalConference on Intelligent Transportation Systems (ITSC), pp.19-24.

[16] O. Cyprich, V. Konen, and K.Kilianov. ”Short-Term PassengerDemand Forecasting Using Univariate Time Series Theory.”Promet Traffic & Transportation, 2013, 25 (6), pp. 533-541.

[17] P.G. Furth, B. Hemily, T.H.J. Muller, & J.G. Strathman,Using Archived AVL-APC Data to Improve Transit Perfor-mance and Management. Transit Cooperative Research pro-gram (TCRP) Report 113, published by Transportation Re-search Board, Washington, 2006.

[18] S. Wei, J. Yuan, Y. Qiu, X. Luan, S. Han, W. Zhou and C. Xu,”Exploring the potential of open big data from ticketing web-sites to characterize travel patterns within the Chinese high-speed rail system”, PLOS ONE, vol. 12, no. 6, p. e0178023,2017.

[19] W. Wang, J.P. Attanucci and N.H.M. Wilson, Bus Passen-ger Origin-Destination Estimation and Related Analyses UsingAutomated Data Collection Systems, Journal of Public Trans-porattion, Vol 14, No 4, 2011

[20] X. Ma, Y. Wu, Y. Wang, F. Chen and J. Liu, ”Mining smartcard data for transit riders travel patterns”, TransportationResearch Part C: Emerging Technologies, vol. 36, pp. 1-12,2013.

11

International Journal of Pure and Applied Mathematics Special Issue

659

Page 12: Data Processing Methodology for Big Data in Public ... · the data from unstructured and structured sources like SIRI and GTFS. The proposed methodology can be applied to a spatial

660