LICENSE PLATE SURVEY FOR TRAFFIC ANALYSIS ... - University of Hawaii · PDF file(TTL): Alyx...
Transcript of LICENSE PLATE SURVEY FOR TRAFFIC ANALYSIS ... - University of Hawaii · PDF file(TTL): Alyx...
LICENSE PLATE SURVEY FOR TRAFFIC ANALYSIS:
IMPROVING ACCURACY WITH CORRECTION ALGORITHMS
A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF
HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF
MASTER OF SCIENCE
IN
CIVIL ENGINEERING
MAY 2012
By
Alireza Abrishamkar
Thesis Committee:
Panos D. Prevedouros, Chairperson
Peter G. Flachsbart
Michelle H. Teng
ii
I would love to dedicate this thesis to my lovely parents;
and to three spiritual treasures I was so lucky to get to know in my life:
Dr. Chamran, Edoardo, and Mr. Ahmadian
iii
ACKNOWLEDGEMENTS
First off, I would like to express my sincere gratitude and appreciation to my advisor Dr.
Panos Prevedouros for his support, encouragement and advice during the course of this thesis
and my entire Master’s program. I believe the lessons I have learnt from his rectitude and
demeanor during these three years, are no less important than what I have learnt from him in
Traffic Engineering.
I would like to sincerely thank Dr. Flachsbart and Dr. Teng for serving on my thesis
committee, providing guidance and support for finalizing this thesis, and for all of their valuable
helps during my Master’s studies.
I would also like to thank my dear colleagues and friends in Traffic and Transportation Lab
(TTL): Alyx (Xin) Yu, Lambros Mitropoulos, Laxman KC, Kevin Jenkins, Maja Caroee, James
Tokishi, Natasha Soriano, and Myles Gota, for all their helps, and all the nice times we had
together.
I am indeed completely indebted to my parents for everything in my life including this thesis,
and I like to give my sincerest appreciations to them. Finally, I thank my brothers, Afshin and
Amin, for all their help and support.
iv
ABSTRACT
Vehicle tracking methods are widely used for a variety of purposes including collection of
travel time and duration of stay data. The collected data are used for planning and management
purposes. The type of data depends on the method of data collection. Tracking methods are
usually classified into active and passive. In this research they are classified into two categories,
discrete and continuous. Among all methods, the discrete method of license plate matching is the
most prevalent for data collection.
The purpose of this research is to discuss the accuracy of manual license plate matching
method for vehicle tracking and travel time data collection, and provide correction algorithms to
improve the results. The impacts of recordation style and visual similarities between characters
(letters and numbers) on the matching errors are investigated. The correction algorithms are
compared and evaluated.
The application of correction algorithms – specifically those that are more constrained to
filter out false matches – can considerably increase the percentage of matched license plates. To
a lesser degree, this processing can improve the statistical values of the license plate datasets
such as average, standard deviation and median of travel time and duration of stay in a location.
This study also found evidence that a significant portion of mistakenly recorded letters while
recording the license plates are visually similar letters, that by itself underlines the human factor
in the accuracy of the method. Digits are not significantly probable to be mistaken because of
their visual dissimilarity.
The workload of recordation is also proved to be significant: more letters to be recorded
results in more errors.
v
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ..................................................................................................... iii
ABSTRACT ......................................................................................................................... iv
List of Tables .................................................................................................................... viii
List of Figures .....................................................................................................................xi
List of Equations ............................................................................................................... xii
CHAPTER 1 INTRODUCTION ............................................................................................... 1
1.1 Background .......................................................................................................................... 1
1.2 Purpose and Objectives ........................................................................................................ 2
1.3 Definitions ........................................................................................................................... 3
1.4 Thesis Description ................................................................................................................ 3
CHAPTER 2 VEHICLE TRACKING AND TRAVEL TIME DATA COLLECTION .............................. 6
2.1 Active and Passive Vehicle Tracking Methods ....................................................................... 6
2.2 Static and Mobile Vehicle Tracking Methods ........................................................................ 8
2.3 Discrete and Continuous Vehicle Tracking Methods ............................................................. 8
2.3.1 GPS ..................................................................................................................................................... 12
2.3.2 Bluetooth and Radio Frequency Identification (RFID) ....................................................................... 12
2.3.3 License Plate ...................................................................................................................................... 13
CHAPTER 3 LICENSE PLATE MATCHING ERRORS .............................................................. 17
3.1 License Plate Correction ‐ Edit Distance .............................................................................. 18
3.2 Human Memory Factor ...................................................................................................... 20
CHAPTER 4 METHODOLOGY ............................................................................................ 22
4.1 License Plate Format .......................................................................................................... 23
vi
4.2 License Plate Matching ....................................................................................................... 24
4.3 Processing of Unmatched Data ........................................................................................... 27
4.3.1 Full Correction Algorithm .............................................................................................................. 28
4.3.2 Algorithm for Correction of Similar Characters (“Similar Algorithm”) .......................................... 32
4.4 Discussion on Processing Results ........................................................................................ 36
CHAPTER 5 DATA COLLECTION AND ANALYSIS ................................................................ 37
5.1 Data Collection .................................................................................................................. 37
5.1.1 Dataset 1 (ABC1): ITE ......................................................................................................................... 37
5.1.2 Dataset 2 (C123): HAVO 2009 ............................................................................................................ 39
5.1.3 Dataset 3 (ABC123): HAVO 2007 – 1 .................................................................................................. 40
5.1.4 Dataset 4 (ABC123): HAVO 2007 – 2 .................................................................................................. 41
5.2 Individual Analyses ............................................................................................................ 42
5.2.1 Algorithm Comparison ....................................................................................................................... 42
5.2.2 Evaluation of Impact of Similarity on Errors ...................................................................................... 43
5.2.3 Analyses of each Dataset ................................................................................................................... 48
5.2.3.1 Dataset 1 (ABC1): ITE.................................................................................................................. 48
5.2.3.2 Dataset 2 (C123): HAVO 2009 .................................................................................................... 53
5.2.3.3 Dataset 3 (ABC123): HAVO 2007 ‐ 1 ........................................................................................... 57
5.2.3.4 Dataset 4 (ABC123): HAVO 2007 ‐ 2 ........................................................................................... 61
5.3 Aggregate Analyses ............................................................................................................ 64
5.3.1 Processing Time of Algorithms ........................................................................................................... 64
5.3.2 Influence on Percentage of Matched Vehicles .................................................................................. 66
5.3.3 Influence on Statistical Indices ........................................................................................................... 68
5.4 Evaluation of Impact of Similarity after One Iteration and Redefinition of Similar Characters
...................................................................................................................................................... 71
CHAPTER 6 CONCLUSION ................................................................................................ 81
REFERENCES ..................................................................................................................... 84
vii
Appendix A Algorithms ................................................................................................... 87
Appendix B Mistakes Matrices by Algorithms A to D ....................................................... 92
viii
List of Tables
Table 1. Qualitative Comparison of Travel Time Data Collection of Different Techniques for
License Plate Method ................................................................................................................... 15
Table 2. Travel Time Data Collection of Different Techniques for License Plate Method..... 16
Table 3. Similar Letters. .......................................................................................................... 33
Table 4. Similar Digits. ............................................................................................................ 34
Table 5. Sample of Substituted Letters Using the Full Correction Algorithm C. .................... 34
Table 6. Data Collection Specifications .................................................................................. 42
Table 7. Uniform Mistakes Matrix for 900 Mis‐recorded Numbers ...................................... 44
Table 8. Hypothesized Mistakes Matrix for Letters. Yellow Cells are the Intersection of
Similar Letters. .............................................................................................................................. 46
Table 9. Hypothesized Mistakes Matrix for Digits. Yellow Cells are the Intersection of Similar
Digits. ............................................................................................................................................ 46
Table 10. Letters Mistakes Matrix for Dataset 1, by Algorithm E .......................................... 48
Table 11. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 49
Table 12. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 49
Table 13. Numbers Mistakes Matrix for Dataset 1, by Algorithm E ...................................... 50
Table 14. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 51
Table 15. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 51
Table 16. Letters Mistakes Matrix for Dataset 2, by Algorithm E .......................................... 53
Table 17. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 54
Table 18. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 54
Table 19. Numbers Mistakes Matrix for Dataset 2, by Algorithm E ...................................... 55
Table 20. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 56
Table 21. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 56
Table 22. Letters Mistakes Matrix for Dataset 3, by Algorithm E .......................................... 57
ix
Table 23. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 58
Table 24. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 58
Table 25. Numbers Mistakes Matrix for Dataset 3, by Algorithm E ..................................... 59
Table 26. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 59
Table 27. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 60
Table 28. Letters Mistakes Matrix for Dataset 4, by Algorithm E .......................................... 61
Table 29. Ratio of Similar Character Misreadings to Total Count of Misreadings ................. 62
Table 30. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 62
Table 31. Numbers Mistakes Matrix for Dataset 4, by Algorithm E ...................................... 63
Table 32. of Similar Character Misreadings to Total Count of Misreadings .......................... 64
Table 33. Deviation of the Mistake Matrices by Algorithms A to D based on E .................... 64
Table 34. Processing Time of Different Correction Algorithms ............................................. 65
Table 35. Ratio of Processing Time for ‘Similar Algorithm’ to other Full Algorithms ............ 65
Table 36. Contribution of each Algorithm to Percentage of Matched Vehicles – Letters and
Digits Separately ........................................................................................................................... 67
Table 37. Ratio of Contribution to Number of Initial Unmatched License Plates ................. 67
Table 38. Average for the Duration of Stay ........................................................................... 68
Table 39. Standard Deviation for the Duration of Stay ......................................................... 69
Table 40. Median for the Duration of Stay ............................................................................ 69
Table 41. Updated Blank Mistakes Matrices for Letters........................................................ 72
Table 42. Updated Blank Mistakes Matrices for Digits .......................................................... 72
Table 43. Updated Letters Mistakes Matrix for Dataset 1, by Algorithm E (Second Iteration)
....................................................................................................................................................... 73
Table 44. Updated Numbers Mistakes Matrix for Dataset 1, by Algorithm E (Second
Iteration) ....................................................................................................................................... 74
x
Table 45. Updated Letters Mistakes Matrix for Dataset 2, by Algorithm E (Second Iteration)
....................................................................................................................................................... 75
Table 46. Updated Numbers Mistakes Matrix for Dataset 2, by Algorithm E (Second
Iteration) ....................................................................................................................................... 76
Table 47. Updated Letters Mistakes Matrix for Dataset 3, by Algorithm E (Second Iteration)
....................................................................................................................................................... 77
Table 48. Updated Numbers Mistakes Matrix for Dataset 3, by Algorithm E (Second
Iteration) ....................................................................................................................................... 78
Table 49. Updated Letters Mistakes Matrix for Dataset 4, by Algorithm E (Second Iteration)
....................................................................................................................................................... 79
Table 50. Updated Numbers Mistakes Matrix for Dataset 4, by Algorithm E (Second
Iteration) ....................................................................................................................................... 80
xi
List of Figures
Figure 1: A Depiction of GPS‐based Active Tracking System. ................................................... 7
Figure 2: Static and Mobile Vehicle Tracking Technologies. .................................................... 8
Figure 3: Discrete‐Continuous Spectrum for Classification of Vehicle Tracking Technologies.
....................................................................................................................................................... 11
Figure 4: Correction Algorithms that are used after Initial Matching. Gray Boxes Show the
Main Algorithms. .......................................................................................................................... 23
Figure 5: Flowchart of Initial License Plate Matching Procedure. .......................................... 26
Figure 6: Unconstrained Full Correction Algorithms. ............................................................. 29
Figure 7: Constrained Full Correction Algorithms. ................................................................. 30
Figure 8: Correction Algorithm for Similar Letters and Digits. ............................................... 35
Figure 9: Data Collection at Waipio Peninsula Soccer Complex ............................................. 38
Figure 10: Data Collection at the Entrance of Hawaii Volcanoes National Park (2009) ......... 40
xii
List of Equations
Equation 1 …………………………………………………………………………………………………………………… 42
Equation 2 …………………………………………………………………………………………………………………… 43
Equation 3 …………………………………………………………………………………………………………………… 43
Equation 4 …………………………………………………………………………………………………………………… 43
Equation 5 …………………………………………………………………………………………………………………… 45
1
CHAPTER1
INTRODUCTION
1.1Background
Collection of the license plate of a vehicle and recording the observation times is a simple
technique to track the vehicle and obtain travel time data. Currently the majority of vehicle
tracking systems are GPS‐based. A GPS system can provide data about vehicle whereabouts
instantly and with high accuracy; however, there are certain limitations. One limitation is that
such a system cannot work underground or in the tunnels. Another is that it can be expensive if
the number of tracked vehicles is high and there is no need to know their whereabouts in real
time. In such cases, other than expensive data collection, data reduction and processing is also
heavier and more costly, since usually a huge volume of data are collected while only a small
portion of it is really needed; therefore, tracking technologies and methods that do not collect
data continuously are preferred for local applications.
One of the most widely used methods for travel time data collection is license plate
matching. License plate matching is also used for origin‐destination studies and transportation
planning. By collecting the location and consequently the path of sufficiently large number of
vehicles, travel pattern recognition can be done; also the average speeds in different segments
of the path can be obtained and be used for traffic management, and planning purposes.
License plate recordation at the entrance and exit of parking lots is a common way to do
parking studies. Other than that, license plate recognition is widely used in toll collection
stations and to some extent for law enforcement.
2
License plate collection and matching can be performed using several methods. They
depend on the size of data to be collected, weather conditions, available budget, required
accuracy, vehicle speed, type of the route, etc. They range from completely manual to
completely automatic. More details about these methods together with advantages and
disadvantages and related issues for each of them are given in Chapter 3.
One of the major issues with license plate matching techniques is its accuracy. The way
accuracy is evaluated depends on the method of data collection and matching. When the
license plate is captured by cameras to be input to image‐processing algorithms and software,
the whole license plate number is recorded. But when it is collected manually, usually a subset
of characters ‐ typically the last four digits [1] ‐ is recorded, in order to expedite the recordation
procedure and allay fears of monitoring private properties. The later method increases the
chance of non‐identical license plates to be matched. Previous studies conducted on the
accuracy of license plate matching normally focus on the probability of spurious matches
because of the few characters that are not recorded and they assume that the recorded
characters are correct [2] (more in Chapter 3). In automatic recordation, the accuracy of
recorded characters is the subject of study [3].
1.2PurposeandObjectives
The purpose of this research is to investigate the accuracy of license plate matching
methods for vehicle tracking and travel time data collection, and provide correction algorithms
to improve the results. The main focus is on the manual license plate method where data are
collected using human observers with pen, paper and watches, and is matched and processed
by computer applications. The impacts of recordation style and visual similarities between
characters (letters and numbers) on the matching errors are investigated. After the unmatched
3
data are processed, the influences of this enhancement on the main statistics of the data are
measured. Finally, the correction algorithms are compared and evaluated.
1.3Definitions
Vehicle tracking is considerably intertwined with electronic devices and computer software.
A vehicle tracking system combines the installation of an electronic device in a vehicle, or fleet
of vehicles, with purpose‐designed computer software at one or more operational bases to
enable the owner or a third party to track vehicle location and other operational, passenger or
freight data.
There are two essential parameters: The “location” of tracked vehicle in some points of
“time”.
Travel time is broadly defined as “the time necessary to traverse a route between any two
points of interest.” [4] By tracking vehicles, travel time is one of the most common types of
collected data; and in fact, sometimes a simple method of vehicle tracking is not more than a
mere travel time data collection. By locating a vehicle at various locations, the travel time
between different points (segment of the whole route) is determined; all of these data
combined, turn the tracking of vehicles along the route.
In this research vehicle tracking is the process of acquiring the duration of stay of vehicles at
specific locations, usually parking lots.
1.4ThesisDescription
This thesis addresses the following topics:
4
• Overview of the methods of vehicle tracking, and categorizations of relevant
technologies and methods.
• Applications and accuracy of license plate matching as a common discrete vehicle
tracking method.
• Definition of possible errors (human vs. random) involved in manual license plate
recordation.
• Discussion on the similarities and differences between errors in manual license plate
recordation and automatic recordation/processing.
• Development of algorithms to improve license plate matching for both types of possible
errors in manual license plate recordation.
• Evaluation of the algorithms for different license plate recordation styles (e.g. the whole
license plate vs. four characters of it).
• Evaluation of the algorithms for letters versus digits.
• Evaluation of the algorithms in terms of performance, accuracy and processing speed.
Following this introductory chapter, the methods of vehicle tracking are presented and
classified in Chapter 2. Common technologies for vehicle tracking and their applications in
travel time data collection are discussed.
In Chapter 3 the accuracy, errors and improvements regarding license plate matching is
discussed and previous works are reviewed.
Chapter 4 describes the methodology used to develop correction algorithms that reduce the
errors and improve license plate matching. It also explains the structure of each algorithm and
how it operates. These algorithms increase matching percentage by processing those license
5
plates that have one mis‐recorded character. This chapter also describes how the results of
different algorithms can be interpreted.
Chapter 5 describes the four datasets that were used, and presents the results of the data
analyses performed on them. The performance of proposed correction algorithms is evaluated
based on these results.
Chapter 6 presents the general conclusions made based on the data analyses.
6
CHAPTER2
VEHICLETRACKINGANDTRAVELTIMEDATACOLLECTION
In this chapter the methods of vehicle tracking are presented. The categorization that
clarifies the connection between vehicle tracking methods and travel time data collection,
namely discrete‐continuous spectrum, is described, and the position and role of license plate
survey in these categories is discussed.
There are several categorizations of vehicle tracking methods and technologies. The most
common one defines two categories with overlapping applications, namely, Active Tracking and
Passive Tracking.
2.1ActiveandPassiveVehicleTrackingMethods
Active Tracking – also known as online tracking or real‐time tracking – is comprised of a
system that locates the vehicle by means of electronic location sensors, generally for
predefined regular points of time, and another system that transmits the data directly to a Data
Management Center (DMC). Depending on the scale of the tracking system the DMC can vary
from a PC or a smart phone to a big data center. The received data can either be recorded on
long‐term memories for future uses or be merely monitored. Although the installed tracker unit
in the vehicle may also store the collected data, the main point of active tracking is immediate
access to the data by the DMC., An active tracking system is used when vehicle must be
monitored in transit.
7
Commercial vehicle tracking systems usually use a cellular data service (e.g. GPRS or SMS)
or satellite communication to send the collected data to the computers at the DMC (Figure 1.)
Some active tracking systems allow for two‐way communications.
Figure 1: A Depiction of GPS‐based Active Tracking System.
Passive Tracking system does not transmit the location data immediately to a DMC after it’s
collected; instead, it records the data for future reference. If data are not needed right away, a
passive system is usually adopted since it reduces the costs.
Passive tracking systems provide a more cost‐effective approach to vehicle tracking. In this
approach when users want to review the recorded data, they need to access the GPS or other
tracking systems installed in the vehicle and manually download the data via proper interfaces.
Such systems are more common in transportation planning studies.
8
2.2StaticandMobileVehicleTrackingMethods
Another categorization classifies vehicle tracking technologies in two classes: Static and
mobile as shown in Figure 2. [5] “The static includes technologies like camera systems,
transponders and dual‐loop detectors. The mobile includes technologies like GPS and cell
phones. Transponders may figure in both classifications because it has characteristics of both,
yet the need for readers on the road makes it static. All the static technologies are tied to the
road … but they cannot be on any road. Budget limitation would not allow static technologies
on all roads.”
Figure 2: Static and Mobile Vehicle Tracking Technologies.
2.3DiscreteandContinuousVehicleTrackingMethods
Tracking methods are also categorized into discrete and continuous.
9
In the continuous method the location of the tracked vehicle is either stored or reported
(like in passive and active methods) for predefined points of time, and usually with predefined
regular intervals; in fact, the tracking procedure is time‐based. As an example, in the common
GPS‐based tracking systems, the location coordinates data are collected for predefined
intervals (e.g., one second) and it can be anywhere on‐route or off‐route.
The discrete method tracking is performed based on the location of the vehicle; meaning
that the time is collected for various preselected locations if the vehicle appears there, and thus
the procedure is location‐based. For instance, license plate tracking is a discrete method since
there are no specific predefined regular points of time for any specific vehicle, for which data
are recorded; instead if the vehicle is observed in any preselected location, the time of
observation is recorded.
Typically in the discrete method the volume of the data is smaller because the time
difference between two consecutive data records is higher.
Both continuous and discrete methods can be either Active or Passive. However, in the
discrete method, the manner of data collection is more diverse. Data can be collected via GPS‐
based systems, Bluetooth technology, license plate survey, via Radio Frequency Identification
(RFID) technology, etc. Data collected from a license plate survey for a group of vehicles is not
usually raw and online; instead, it’s recorded for further process and reduction; thus it is a
Passive method.
The categorization into discrete and continuous is not completely binary. The data collected
by continuous methods and technologies can be filtered so that only coordinates and time
values for specific predefined locations are kept for discrete applications. Moreover, some
continuous tracking‐capable technologies such as Bluetooth and RFID can be set to record data
only at specific locations, i.e., when the Bluetooth transmitter device inside a vehicle passes
10
through a gate where the reader is mounted. In these cases there is no need for subsequent
data reduction from continuous to discrete.
Figure 3 depicts this classification as a spectrum. On one end of the spectrum, license plate
matching is observed as a completely discrete method and on the other end of it are GPS
technologies.
11
Sample
(Continuous)
Applications:
Logestics
Urban logestics (e.g.
construction machines
management)
1) Fleet management
2) Bus schedule control
3) Asset & cargo tracking
Technology:LP
Matching
Voice
DispatchSMS
Passive
RFID
Active
RFID
Mobile
PhoneBluetooth GPS
Sample
(Discrete)
Applications:
1) Travel time data
collection
2) Toll collection
1) Taxi dispatch
management
2) Police
Fleet management
1) Travel time data
collection
2) Bus schedule
control
Toll colelction
More Discrete More Continuous
Figure 3: Discrete‐Continuous Spectrum for Classification of Vehicle Tracking Technologies.
12
2.3.1GPS
GPS‐based systems are the most widely used systems for vehicle tracking. In this system the
location of the tracked vehicles is calculated based on trigonometry laws for the signals
received from several satellites at any given point of time. Keeping contact with at least four
satellites is required for normal operation. These devices can locate vehicles anywhere that GPS
signal coverage exists. Since the signals are received from satellites, the GPS in‐vehicle device
requires a clear line‐of‐sight path to the satellite; the wider the sky view ‐ and therefore the
higher the number of contacted satellites ‐ the better the functionality of the system. The signal
coverage may be too low in some spots to allow proper functionality. For example some basic
GPS receivers cannot operate properly in deep valleys or near tall buildings where the sky view
angle is limited; not to mention inside tunnels and underground.
2.3.2BluetoothandRadioFrequencyIdentification(RFID)
RFID is a technology that uses communication through the use of radio waves to transfer
data between a reader and an electronic tag attached to an object for the purpose of
identification and tracking. There are two different types of tags, namely, passive and active.
The former does not broadcast a signal by itself but the later does. This results in different read
ranges for the two types tags. The read range for an active tag is 300 ft. or more. These tags can
be used for continuous tracking if the tracking field is limited. The read range of passive tags
ranges from three to over 20 ft. depending on the used wave frequency. Passive tags read
range is practically inadequate for continuous vehicle tracking. RFID tags have been used in
tolled highways worldwide. Toll operators use RFID tags to derive volume, speed, travel time
and origin‐destination data. [17, 18]
13
The Bluetooth technology is originally designed as a short‐range wireless connectivity
solution for personal, portable, and hand‐held electronic devices. The Bluetooth radio operates
on a license‐free, globally available Industrial, Scientific and Medical (ISM) band [13]. The
typical working distance of Bluetooth ranges from 10 m to 100 m [14]. The Bluetooth tracking
method is relatively new and not widely used given the limited references to it in the literature.
It has been used for tracking of constructional vehicles in dense urban areas where GPS is
limited. [15]
For vehicle tracking and travel time data collection purposes, attaching Bluetooth
transmitters or RFID tags to targeted vehicles and using RFID/Bluetooth reader equipments in
several needed spots provides the similar kind of data collected by license plate method.
However, since the signature is digital, the matching process is much easier, faster, cheaper and
less labor or processing intensive. If the data collection phase is done well, then the matching
process is of nearly perfect accuracy. On the other hand, in this method collaboration of the
tracked vehicles owners is also needed; at least to receive the transmitters or tags and keep
them in their vehicles. Bluetooth and Active RFID systems have more capabilities in vehicle
tracking mostly due to their longer range. Passive RFID has more limited applications and is
good for providing entry and exit information. [18]
2.3.3LicensePlate
This method is widely used for collection of travel time and duration of stay data for local
applications such as parking studies, corridor studies, etc. Its first phase is observation of
vehicles at specific locations and collecting their license plates. Then the pool of license plates
needs to be matched to identify each vehicle’s travel pattern. Knowing the distance between
the locations, average speed between two points can be calculated, resulting in travel time
14
data. Combining the travel time data for all segments for each vehicle, results in vehicle
tracking over the monitored network.
There are four basic techniques for collecting and processing license plates:
1) Manual: collecting license plates via pen and paper or audio tape recorders and manually
entering license plates and recorded times into a computer.
2) Portable Computer: collecting license plates in the field using portable computers that
automatically provide an arrival time stamp. There is software that facilitates this process.
3) Video with Manual Transcription: collecting license plates in the field using video
cameras or camcorders and manually transcribing license plates using human observers. This
minimizes field crew size and is required in harsh climate locations.
4) Video with Character Recognition: collecting license plates in the field using video, and
then automatically transcribing license plates and arrival times into a computer using
computerized license plate character recognition. This is the typical type of processing by tolling
authorities for exacting the toll charge or for recording toll paying violators.
The license plate matching method in general, regardless of the applied technique, has the
following advantages:
Ability to obtain travel times from a large sample of vehicles, which is useful in
understanding variability of travel times and destinations among vehicles within the traffic
stream.
Data collection equipment is relatively portable.
The license plate matching method, regardless of the applied technique, has the following
disadvantages:
15
Travel time data limited to locations where observation occurs.
Sampling surveys can achieve only limited geographic coverage on a single day.
Manual and portable computer‐based methods are less practical for high‐speed
freeways or long sections of roadway with a low percentage of through‐traffic.
Accuracy of license plate reading is an issue for manual and portable computer‐based
methods.
Skilled data collection personnel required for collecting license plates and/or operating
electronic equipment. [4]
Table 1 and Table 2 provide a comparison among different techniques of license plate
matching methods by FHWA. [4]
Table 1. Qualitative Comparison of Travel Time Data Collection of Different Techniques for License Plate Method
16
Table 2. Travel Time Data Collection of Different Techniques for License Plate Method
17
CHAPTER3
LICENSEPLATEMATCHINGERRORS
One of the major issues with license plate matching is its accuracy. The way accuracy is
evaluated depends on the method of data collection and license plate matching.
Automatic License Plate Recognition (ALPR) systems take a snapshot of the whole license
plate and extract the license plate character and number set by using Optical Character
Recognition (OCR) algorithms, . The performance of the OCR algorithms is critical and is usually
the major cause for errors. Accuracy of derived characters is usually under focus for this
method. Another possible cause of error in the ALPR systems is regarding detection of the
license plates before recognition of their characters. ALPR systems need to detect a vehicle first
then take a picture of it together with its license plate; then the image‐processing software
needs to detect the place of the license plate on the image and after that the OCR software can
extract the number.
This process is more challenging for trucks. A survey on I‐40 indicated that only 82% of
trucks on that route had installed their license plates on its normal place in the middle of the
bumper. Most LRP cameras are aimed at the bumper area [6]. This is a disadvantage of the
automatic method because some license plates are missed.
The advantage of the license plate recording by a crew is that even if a plate is placed
behind the windshield, it can be detected and recorded. Vehicle may be missed by ALPR
systems, but this does not influence the ratio of captured license plates to detected vehicles.
Therefore, high values for this ratio do not necessarily indicate a good ALRP system. If the
recorded license plates are retained in a long‐term memory, manual verification can be done
18
later with almost 100% accuracy. However, it is normally possible only for fractions of the data;
and excessive manual verification is not in line with the purpose of these systems.
Manual license plate recordation is used for different purposes compared to ALPR. Its major
application is in surveys to collect origin‐destination and travel time data. When license plates
are collected manually, usually a subset of characters is recorded to expedite the recordation
procedure. This increases the chance of non‐identical license plates to be matched because it
can create identical subset of characters while the whole license plates are not identical.
The major focus of previous studies conducted on the accuracy of manual license plate
matching is normally on the probability of false matches because of the few characters that are
not recorded. These studies show that statistically reliable estimates of travel parameters can
be obtained without the recording of entire license plate numbers. Makowski found that
although only the last three digits of the license plate numbers were recorded, statistically
reliable values were obtained [7]. The characters that are recorded or in some cases verified
manually are usually considered to be completely correct.
3.1LicensePlateCorrection‐EditDistance
In both automatic and manual methods of license plate matching, there are typically some
erroneous and unmatched license plates together with some mistakenly matched ones. The
techniques that are used to figure out the possible wrong matches include adding constraints
and using other available information such as calculated speed between the two points of
observation. For example if a license plate was observed in two points ten miles apart within
two minutes, it indicates a wrong match.
19
When dealing with batches of vehicles moving on the same road statistical outliers are
sometimes used to filter out wrong matches. There are many ways to define outliers and find
them. For parking lot and duration of stay surveys, using average speed is not appropriate and
the identification of outliers is less feasible. [8, 9]
When wrong matches are found or unmatched license plates are needed to be retried, Edit
Distance algorithms are used to find the nearest possible alternatives. [10] Edit distance or
more specifically the Levenshtein distance is a metric for measuring the amount of difference
between two strings. It is defined as the minimum number of edits needed to transform one
string into the other one, using the allowable edit operations that are insertion, deletion, or
substitution of a single character.
When two license plates match in the initial matching procedure their edit distance is zero.
If during recordation a character of the license plate is missed the recorded string requires a
change or an edit to be matched and if all other recorded characters are correct, then only one
insertion is needed, thus the Levenshtein distance equals one.
There is a more specific edit distance for equal length strings called Hamming distance. [16]
It measures the minimum number of substitutions required to change one string into the other,
or the number of errors that transformed one string into the other. For example the Hamming
distance between ‘ABC123’ and ‘DEC143’ is three. The Hamming distance between ‘ABC123’
and ‘CAB123’ is also three.
The literature includes several methods and recommendations for finding the best match
which has the smallest Edit Distance, but when only one character is assumed invalid, these
methods do not differentiate between the possible matches. For example the Hamming
distances between ‘FNG’ and both ‘EMC’ and ‘OJI’ are three; neither is considered closer to
20
‘FNG’ but common sense suggests that the ‘OJI’ option is less likely to be a misrecordation of
‘FNG’ than the ’EMC’ option. [3]
This study assumes that only one of the characters in mistakenly recorded, therefore the
Hamming distance is always one. Higher distances were deemed unlikely; moreover, if license
plates with Hamming distance of two are matched, the probability of false matches increases.
Oliveira‐Neto et al. (2009) suggest that a probability matrix is needed to be created to provide
an additional help for choosing the better matches when the edit distances are equal. [3] This
study is a step in this direction and tries to find similar characters that are more probable to be
mistaken by a human recorder.
3.2HumanMemoryFactor
Human short‐term memory temporarily stores and manages information. Short‐term
memory has a span of seven chunks of information, plus or minus two. A chunk is referred to as
an integrated piece of information [11]. A chunk can be a digit, a letter, a simple shape, etc.
Studies show that the format of these chunks of information has influence on the ability of
the brain to remember them. For the case of license plate recordation it is directly related to
the format of recordation, e.g., last three digits, one letter and two digits in the middle, etc.
Research conducted on the memorability of license plates showed that the more digits and
letters in a license plate are mixed, the more difficult it is to memorize it [12].
Moreover, while recording the license plates if traffic is heavy and the number of vehicles is
large, the license plates of several vehicles need to be memorized, and they can easily surpass
memory capacity resulting in missed vehicles or wrong license plates.
21
In this study four datasets with three different recordation formats were used to investigate
this issue.
22
CHAPTER4
METHODOLOGY
This chapter (i) describes the method for initial license plate matching; (ii) describes the
methodology for creating algorithms that reduce flawed and unmatched data; and, (iii) explains
the approach taken to compare the results from different algorithms. License plates consist of
characters which include letters and numbers. Special characters such as %,$,! are not found in
license plates. The subsequent discussion focuses on characters, with separate discussion on
letter and number matching.
Several correction algorithms were developed. They were used to improve the percentage
of matched license plates by finding those that were mistakenly recorded. Two hypothetical
reasons for the mistakes were considered, and correction algorithms were created to evaluate
them vis‐à‐vis each other. Figure 4 summarizes the correction algorithms that are described in
this chapter. The structure of the correction algorithms for license plate letters and digits are
very similar; only their targeted characters are different.
23
Figure 4: Correction Algorithms that are used after Initial Matching. Gray Boxes Show the Main Algorithms.
4.1LicensePlateFormat
Since all of the data were collected in Hawaii and the majority of the license plates are
composed of three letters followed by three numbers, the focus was on this type of license
plate. The format of the data records of four license plate datasets available for analysis was
different, as shown below.
All three letters and three digits were recorded (ABC123) – Two sets
Last letter and all the three digits were recorded (C123) – One set
All three letters and the first digit were Recorded (ABC1) – One set
24
The latter two schemes were adopted in order to avoid the monitoring of private property.
For vehicles with license plate format other than ABC123, the whole plate was recorded. These
four datasets were used in analyses described in Chapter 5.
4.2LicensePlateMatching
In order to match the license plates an algorithm was created with Visual Basic for
Applications (VBA) which utilizes the license plate datasets created in Microsoft Excel.
First each dataset was sorted based on the time of entry (primary sorting) and exit
(secondary sorting), and for each data collection station. Although recorded license plates are
automatically sorted by time of entry/exit as they are written down, this sorting is usually not
perfect, particularly when the traffic volume or vehicle speed is high. In these cases data
collectors usually aid each other to minimize missed vehicles. Normally one person reads aloud
the license plate numbers to be written down by the other person, or they both write down
every other license plate number, creating two separate lists. In the latter case the data must
be merged and sorted.
Figure 5 depicts the initial matching process of license plates, where the license plates that
are already identical are matched. Ideally, if all license plates are matched, then no correction
would be necessary, but this is rarely the case when several hundred observations are taken in
the field. The algorithm starts from the first license plate of the entering vehicles (“Ins” list) and
searches for the first license plate among the exited vehicles (“Outs” list) that exactly matches
it. If the exit time is greater than the entry time, then the exited vehicle – whose license plate is
labeled OUT‐LP – is considered to be the same as the entered one – labeled as IN‐LP. After that
the matched OUT‐LP is removed from the original list and is added to the matched list. The
25
process continues until for every IN‐LP, the whole OUT‐LPs list is searched or a matching OUT‐
LP is found; whichever comes first.
Next, the percentage of matched license plates is calculated and IN‐LPs for which no match
is found are saved in the “Unmatched IN‐LPs” list to be processed by the correction algorithms.
For each matched license plate the duration of stay is calculated, and then the average,
standard deviation, and the median of duration of stay for all IN‐LPs is computed.
26
Figure 5: Flowchart of Initial License Plate Matching Procedure.
27
4.3ProcessingofUnmatchedData
Two types of error are considered as the possible cause for unmatched license plates.
Hypothesized reason for errors Suitable Algorithm for correction
Random error in character recognition or recordation Full correction algorithm
Misreading character due to its similarity to another one Algorithm for correction of similar characters
For random misspelling of the characters, it is assumed that when the data collector was
recognizing or recording the license plate, one letter or digit was randomly wrong. It is also
assumed that a letter could be wrongly substituted merely for another letter; and a number for
another number. Errors that mix numbers and letters are assumed to be zero. In other words it
is assumed that, for example, B may be noted instead of T, and 2 instead of 4; but not 6 instead
of E. This assumption is “safe” in Hawaii which has a typical license plate of ABC 123, but less so
in states which use a string of mixed letters and numbers in their license plates, e.g., 1ABC234,
with no spaces, in California.
After the unmatched license plates were separated from those that were matched by the
initial license plate matching algorithm, the “Full Correction Algorithms” we developed was
applied to find matches for mistakenly recorded license plates. These algorithms found the
characters that were mistakenly recorded and the correct characters that resulted in matched
license plates. The characters were retained for all of the new matches, and were analyzed to
see if a considerable number of mistaken records were because of visual similarity among
characters. If yes, instead of the computing intensive (slow) “full correction algorithm”, a much
faster “similar algorithm” was developed for matching unmatched license plates. The two
algorithms require a markedly different volume of computation. However, full algorithms
technically give more complete results since they search both similar and dissimilar characters.
28
4.3.1FullCorrectionAlgorithm
Figure 6 and Figure 7 depict the matching process of unmatched license plates by “Full
Correction Algorithms” that search through all characters – whether similar or dissimilar.
Unconstrained algorithms do not use additional information with the license plate numbers.
Constrained algorithms use information such as vehicle classification to filter out incorrect
matches. Figure 6 shows the unconstrained algorithms and Figure 7 shows the constrained
ones. The full correction algorithms include a subset of five developed algorithms, A through E,
as explained below.
29
Figure 6: Unconstrained Full Correction Algorithms.
30
Figure 7: Constrained Full Correction Algorithms.
31
While performing the character substitutions it is never known which character in the
license plate is misrecorded, so the algorithm needs to check all possibilities. One algorithm
does this for the letters and another does this for the digits as shown in Figure 4.
The five algorithms displayed in Figure 6 and Figure 7 were formed as follows:
A Repeated matches are excluded
B Repeated matches are included
C Used (matched) OUT‐LPs are retained
D The OUT‐LP that yields the closest duration of stay to the median is used
E The OUT‐LP that yields the closest duration of stay to the median, AND has the same
vehicle class as IN‐LP is used
Algorithm A does not count repeated matches for a given IN‐LP and the first match found in
the OUT‐LP list is used. After that, searching stops and the matched OUT‐LP is removed from
the list to avoid it being matched again with another IN‐LP. This is a fast algorithm.
Algorithm B continues the search among all of the OUT‐LPs. This one can find more than
one match for each IN‐LP but any of the matched OUT‐LPs are removed from the list so that
they are not matched again for other IN‐LPs later. This is a slower algorithm.
Algorithm C operates similar to algorithm B, with the exception of retaining the matched
OUT‐LPs in the list, so that they can be matched again. When a match is found for the IN‐LP, it is
not considered as the only correct one. Instead, the search continues until all of the characters
(letters or digits) are replaced with all of the possible replacement characters, and all of the
OUT‐LPs are checked for each of the replacements in each IN‐LP. Possible replacements are 9 if
32
the character is a digit and 25 if it is a letter. After each replacement is performed the OUT‐LP
list is searched to see if any of its data entries match the modified IN‐LP. If no match is found
the current character in the IN‐LP is replaced with the next possible replacement. For instance,
if the character X in XBC123 is currently being replaced and no match is found for it, in the next
step it will be replaced by Y and then again the OUT‐LP list is searched. If no match was found
by replacing the second letter by all 25 letters, then the same process is done for the next
(third) letter – C in this example. This is the slowest algorithm and may result in one OUT‐LP
matching several IN‐LPs.
Algorithm D operates similar to algorithm C but if an IN‐LP has more than one possible
match, then the one that yields a duration of stay closer to the median duration of stay
(calculated based on the whole matched data in the initial matching phase) is selected as the
correct match. This is a more accurate algorithm.
Algorithm E operates similar to algorithm D but other than checking the duration of stay, it
verifies the vehicle class of the pair to be matching as well. This algorithm protects from gross
errors in vehicle plate matching.
If the final results from these algorithms are in agreement with each other, then the
random error for the whole correction procedure is low. If each of the algorithms finds
different matches for a considerable portion of the IN‐LPs and the derived statistics differ
substantially, then the random error is substantial.
4.3.2AlgorithmforCorrectionofSimilarCharacters(“SimilarAlgorithm”)
Figure 8 depicts the matching process of license plates if only similar characters are
searched. Again it is assumed that a letter could only be mistaken with another letter, and a
33
number with another number. This algorithm is suitable if misreading of the characters is
assumed to be mostly due to similarity among characters.
Table 4 show the likely similarities. The similarity between letters and digits was initially
decided upon based on common sense and a list of higher frequency mistakes in Automatic
License Plate Recognition (ALPR) systems [6]. In ALPR systems since OCR software “translates”
the picture of the license plate into data, visual similarity between characters are important and
usually have higher frequencies. Therefore, considerate usage of the higher frequency mistakes
in the case of those systems was applicable for this study. No reference in the field of
psychology to be applicable for this study could be found in the literature.
After processing the unmatched data for each dataset by the full algorithms, a table is
created that shows which letters or digits have been possibly misrecorded, and the substituting
character that yields a matched license plate.
Table 5 shows a sample for letters. For instance, number “4” at the intersection of row E
and column F in the table indicates that in four cases by converting letter “E” in the unmatched
license plates list to “F” a match was found. It indicates that possibly letter F was wrongly read
as E.
Table 3. Similar Letters.
ReferenceLetter
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
R G P F E C N J I I N M Q R O B I V U M
R T W W D F N
L H P
Similar Letters
34
Table 4. Similar Digits.
Table 5. Sample of Substituted Letters Using the Full Correction Algorithm C.
Again, the place of the misrecorded letter or digit is never known in the license plate and
the algorithms check all possibilities. However, the number of possibilities is lower, as the
substitution is done only for similar characters. This significantly improves computing
performance for large databases.
ReferenceDigit
1 2 3 4 5 6 7 8 9 0
7 3 2 6 5 1 3 5 6
8 8 8 5 6 9
9 9 6 8
0 9 0
Similar Digits
A B C D E F G H I J K L M N O P Q R S T U V W X Y ZA 1 1 1 1B 2 1 1 1 2C 3 1 1 1 1 1 1 1D 1 1 2 2 1 9 1 1 1E 1 4 1 2F 1 2 1 1 1 1G 1 1 1 1H 4IJ 1 1 1 1 2 2 1 2 1 1KLM 2N 2 2 1 4 3 1 1 1 1O 2 2P 6 2 1 3 1QR 2 1 1 9 1 1S 1 1T 1 1 1 1 1 1U 1 2 1V 1 1 2 1 1W 1 1X 1 1 3 1Y 1 3 1 4Z 1 2
Su
bst
itu
ted
Let
ter
(Mis
reco
rded
Let
ter)
Substituting Letter Resulted in Matching
35
Figure 8: Correction Algorithm for Similar Letters and Digits.
36
4.4DiscussiononProcessingResults
The results are compared based on the following criteria:
1) Volume of computations and processing speed: The duration of correction process is
measured in terms of minutes and seconds, under same processing conditions, to see how
much a similar character correction algorithm can save time. Also the same comparison is made
for different data recordation styles (ABC123, C123, and ABC1). Finally, the correction time is
compared to initial matching time, to see if the extra time spent for correction is substantial in
the whole license plate matching process or not.
2) Impact on match percentage: For each dataset the numbers of unmatched licenses are
compared, before and after correction algorithms are applied. These are also compared among
different algorithms. Based on these results the contribution of each algorithm is calculated as
reduction percentage for unmatched license plates.
3) Impact on statistics: For each dataset the average duration between two successive
observations of vehicles and the standard deviation of this are compared, before and after the
correction algorithms are applied. Higher percentages indicate greater importance of the
algorithms for traffic surveys.
The results for the full algorithms are also compared with each other. Here, the assumption
is that the results should be close to each other, meaning that the matches that are found are
neither dependent on the direction that the algorithms read and matches the data, nor on the
way the algorithm deals with previously matched licenses. Theoretically, if the correct matches
are found by the correction algorithms, the statistical results from all three algorithms should
be the same.
37
CHAPTER5
DATACOLLECTIONANDANALYSIS
The first section of this chapter describes the collection procedures and the specifications of
the four sets of data, which were analyzed using the algorithms developed as part of this
research.
The results from the analyses are interpreted in the next sections of this chapter. Some of
the analyses are done for each dataset individually in order to evaluate the influence of data
collection format (e.g., all letters and digits recorded); some are done for every algorithm in
each dataset, and some analyses are done with the aggregate data from all datasets.
5.1DataCollection
5.1.1Dataset1(ABC1):ITE
This dataset was collected in the format of three digits and one letter (ABC1) at the
entrance of Waipio Peninsula Soccer Complex. The data were needed for a parking analysis
study for the Institute of Transportation Engineers (ITE).
Data collection started at 7:00 AM and continued until 7:00 PM on Saturday, January 29,
2011. Because of the fairly long duration of data collection, it could not be done continuously
by one person. Therefore, both entering and exiting vehicles datasets were collected by two
people; one person collecting data from 7:00 AM to 12:00 PM, and one from 12:00 PM to 7:00
PM. Four people were involved, three with good to perfect visions and one wearing glasses who
38
collected the exiting vehicles from 12 PM to 7 PM. The sunset time at the location was 6:20 PM.
Therefore, the final 40 minutes of data collection was done in comparatively dimmer light.
However, the volume of the collected data during this period was very small compared to the
whole dataset size. One vehicle entered the park and 30 vehicles exited, out of which 20 were
correctly recorded and matched in the initial matching process. Therefore, no significant impact
may be attributed to this issue.
The distance of the data collectors from the edge of the road (as shown in Figure 9) was
almost between 5 and 10 ft; enough to enable them to read the license plates conveniently and
also be reasonably safe.
Figure 9: Data Collection at Waipio Peninsula Soccer Complex
2182 vehicles were recorded entering the Park and 2206 vehicles exiting. 435 (19.9%) of the
entered vehicles could not be matched initially.
39
However, the maximum number of matched vehicles cannot exceed the minimum of all
entered vehicles and all exited vehicles:
Maximum that can possibly be matched = Min (Entered, Exited)
For this dataset since the number of entered vehicles is smaller than the exited vehicles,
this modification does not make any changes.
Table 6 shows the summary specifications of this dataset.
5.1.2Dataset2(C123):HAVO2009
This dataset was collected in the format of one letter and three digits (C123) at the entrance
of Hawaii Volcanoes National Park (HAVO) in 2009. The data were needed for parking analysis.
Data collection started at 10:00 AM and continued until 4:30 PM on Monday, August 17,
2009. Entering and exiting vehicles each were recorded by one data collector. One with good
vision, and one wearing glasses. For short periods of time (around 10‐15 minutes) a third
person substituted one of the data collectors.
The distance of the data collectors from the edge of the road (as shown in Figure 10) was
almost between 5 and 10 ft; enough to enable them to read the license plates conveniently and
also be reasonably safe.
797 vehicles were recorded entering the Park, 751 vehicles exiting. 423 (56.3% of maximum
possibility = 423÷Min (797, 751)) of the entered vehicles could be matched initially and 43.7%
couldn’t.
40
Figure 10: Data Collection at the Entrance of Hawaii Volcanoes National Park (2009)
Table 6 shows the summary specifications of this dataset.
5.1.3Dataset3(ABC123):HAVO2007–1
This dataset was collected in the format of three letter and three digits (ABC123) at the
entrance of Hawaii Volcanoes National Park (HAVO) in 2007. The data were needed for parking
analysis.
Data collection started at 10:00 AM and continued until 3:00 PM on Saturday, August 11,
2007. Entering vehicles were recorded by two data collectors together and exiting vehicles
were recorded by one collector during the five hours of data collection. The person who
collected the exiting vehicles wore glasses.
41
The exact distance of the data collectors from the road is not known for this dataset, but it
is estimated to be close to that of Dataset 2: between 5 and 10 ft.
771 vehicles were recorded entering the Park, 523 vehicles exiting. 303 (57.9% of maximum
possibility = 303÷Min (771, 523)) of the entered vehicles could be matched initially and 42.1%
couldn’t.
Table 6 shows the summary specifications of this dataset.
5.1.4Dataset4(ABC123):HAVO2007–2
This dataset was collected in the format of three letter and three digits (ABC123) at the
entrance of Hawaii Volcanoes National Park (HAVO) in 2007. The data were needed for parking
analysis.
Data collection started at 10:00 AM and continued until 3:00 PM on Sunday, August 12,
2007. Entering vehicles were recorded by two data collectors together and exiting vehicles
were recorded by one collector during the five hours of data collection. The person who
collected the exiting vehicles wore glasses.
The exact distance of the data collectors from the road is not known for this dataset, but it
is estimated to be close to that of Dataset 2: between 5 and 10 ft.
863 vehicles were recorded entering the Park, 654 vehicles exiting. 360 (55.0% of maximum
possibility = 360÷Min (863, 654)) of the entered vehicles could be matched initially and 45.0%
couldn’t.
Table 6 shows the summary specifications of this dataset.
42
Table 6. Data Collection Specifications
5.2IndividualAnalyses
5.2.1AlgorithmComparison
This analysis was done to investigate the variability of the results when our algorithms are
applied to the same set. The indices of comparison are Difference Percentage, Root‐Mean‐
Square Deviation (RMSD), Coefficient of Variation of Root‐Mean‐Square Deviation (CV(RMSD)),
and Normalized Root‐Mean‐Square Deviation (NRMSD).
The percentage of difference among the algorithms is calculated based on the following
formula:
Difference Percentage = ∑ ∑
∑ ∑∗ Equation 1
Where,
43
M is the count of misreadings of the ith character as the jth character in the Mistakes
Matrix generated by Algorithm X. “I” and “J” are the dimensions of the Mistakes Matrix, 26 for
letters and 10 for numbers. Algorithm E is the Reference Algorithm considered to be the most
comprehensive and accurate because it is the most constrained.
The other indices are calculated as follows:
RMSD= ∑ ∑
∑ ∑∗ Equation 2
CV(RMSD) = ∑ ∑
Equation 3
NRMSD =
Equation 4
5.2.2EvaluationofImpactofSimilarityonErrors
It is assumed that if there is no specific cause for bias the mistakenly recorded characters
should be uniformly distributed in the Mistakes Matrices. For example if we consider the 10x10
Mistakes Matrix for the digits, which is a nest for 90 possible cases of digit misreading (the
diagonal is blank because it corresponds to correct readings), and having a dataset of 900
mistakenly recorded characters, we would anticipate to have a frequency of 10 for each
possible case in the matrix and therefore it should be similar to Table 77. However, if the
resultant Matrix is considerably far from uniformity, this indicates a bias and therefore a cause.
In this research, having noticed that the Mistakes Matrices are not uniform, it was assumed
44
that similarity between characters is probably the cause and similar characters are more
probable to be mistakenly recorded.
Table 7. Uniform Mistakes Matrix for 900 Mis‐recorded Numbers
To test this hypothesis the ratio of similar cases (shown as yellow cells in Table 8 and Table
9) to all cases of mistakes was calculated. For the letters, 30 cases (4.6%) among 26x26‐26=650
cases of mistake in the Mistakes Matrix, and for the digits, 22 cases (24.4%) among 10x10‐
10=90 cases of mistake were considered to be mistaken recording of similar characters. Again,
if the effect of similarity was insignificant, we would observe around 4.6% of mistake counts to
occur in yellow cells for letters; however, the analyses show that in practice it was not the case
and yellow cells typically contained several times that percentage. For the case of digits the
difference was smaller and less significant.
In order to check the significance of this difference between practical results and assumed
case of uniformity (similarity insignificance), a Chi‐Square (χ2) test was performed as follows.
The null and alternative hypotheses were defined as:
H0: Similarity has insignificant effect and observations are distributed by chance.
1 2 3 4 5 6 7 8 9 0
1 10 10 10 10 10 10 10 10 10
2 10 10 10 10 10 10 10 10 10
3 10 10 10 10 10 10 10 10 10
4 10 10 10 10 10 10 10 10 10
5 10 10 10 10 10 10 10 10 10
6 10 10 10 10 10 10 10 10 10
7 10 10 10 10 10 10 10 10 10
8 10 10 10 10 10 10 10 10 10
9 10 10 10 10 10 10 10 10 10
0 10 10 10 10 10 10 10 10 10
45
Ha : Similarity has significant effect and observations are skewed because of it.
χ2 = ∑
= + Equation 5
Where,
ES= Expected value for similar cases (yellow cells in Mistakes Matrices) = percentage of similar
cases1 x data count2
OS= Observed value for similar cases = ∑ (similar cases) = ∑ (yellow cells)
ED= Expected value for dissimilar cases (white cells in Mistakes Matrices) = data count ‐ES
OD= Observed value for similar cases = ∑ (dissimilar cases) = ∑ (yellow cells) = data count ‐ OS
Based on calculated χ2 value and for one degree of freedom the probability value for
acceptance of null hypothesis was calculated.
1 Percentage of similar cases is 4.6% for letters, and 24.4% for digits. 2 Data count in these formulae is the total number of mistakes in recordation, which is the sum of all cells in
the Mistakes Matrix.
46
Table 8. Hypothesized Mistakes Matrix for Letters. Yellow Cells are the Intersection of Similar Letters.
Table 9. Hypothesized Mistakes Matrix for Digits. Yellow Cells are the Intersection of Similar Digits.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
1 2 3 4 5 6 7 8 9 0
1
2
3
4
5
6
7
8
9
0
47
In the following section the results of the main correction algorithms for each dataset is
analyzed. In the interest of conciseness, in this section only the Mistakes Matrices created by
the main algorithm (Algorithm E) are shown; the rest of the matrices can be found in the
Appendix.
Under each Mistakes Matrix, its summary is shown in a table on the left which shows the
count of the non‐empty cells and the summation of the cells for both similar data (yellow cells)
and the whole table. Their ratio is also shown in terms of percentage. On another table on the
right, the results of the χ2 test including the P‐value for acceptance of null hypothesis are
shown.
48
5.2.3AnalysesofeachDataset
5.2.3.1Dataset1(ABC1):ITE
Table 10. Letters Mistakes Matrix for Dataset 1, by Algorithm E
It is obvious that the null hypothesis is rejected and similar letters have a significantly higher
frequency (67÷9=7.4 times more) of being mistaken.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1 1
B 1 1 1 3
C 1 1 1 2 1 2
D 2 3 1 1 12 2
E 1 5 1 1 3
F 1 3 1 1 1 1
G 1 2 1 1 1
H 4
I
J 2 2 2 2 1 1 5 1
K
L
M 1
N 2 1 2 3 1 1 1
O 2 1
P 1 8 2 2 7 1 1
Q
R 2 1 1 1 1 1 8 1 1 2
S 3 1 1 1
T 2 2 1 1 1 1
U 1 1 1 2 1
V 2 1 1 2 1 2 1
W 1 1 1 1
X 1 1 1 2
Y 3 1 4 1 5
Z 1 2
Count 110
Sum 197
Count 21
Sum 67
Count 19.1%
Sum 34.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 67 130
Expected 9 188
Chi-Square 386.652
P-Value 0.000
49
Comparisonofthealgorithms
Table 11. Ratio of Similar Character Misreadings to Total Count of Misreadings
Table 11 outcomes indicate that the counts of mistakes found by the five algorithms are not
so close for this dataset; except for algorithms D and E which are the most comprehensive and
accurate algorithms, they are exactly the same. The great difference between the results of the
algorithms is a negative sign that indicates number of false matches is probably not negligible
for some algorithms and in particular for algorithms A, B and C that are not constrained. The
larger difference between the results of the algorithms compared to the case of the correction
for the digits indicate a greater number of false matches and a fairly poorer performance for
algorithms. But the Sim/All Ratio is still fairly close for three of the algorithms: A, D and E.
Table 12. Deviation of the Mistake Matrices by Algorithms A to D based on E
Table 12 shows that the Mistakes Matrices created by algorithms D and E are exactly the
same. It shows that checking for the classes of vehicles being matched did not add anything to
the process and the classes of vehicles matched by algorithm D were already matched.
Percentage of difference and other indices for algorithms A to C are higher than those for the
digits (discussed in the next part) that indicate more false matches for letters.
A B C D E
Count (All) 164 214 339 197 197
Count (Sim) 56 53 79 67 67
Sim/All Ratio 34.1% 24.8% 23.3% 34.0% 34.0%
Algorithm
A B C D
% Difference 55.3% 68.5% 74.1% 0.0%
RMSD 88.7% 101.0% 127.8% 0.0%
CV(RMSD) 304.4% 346.6% 438.7% 0.0%
NRMSD 7.4% 8.4% 10.7% 0.0%
Algorithm
50
Table 13. Numbers Mistakes Matrix for Dataset 1, by Algorithm E
The null hypothesis is rejected at the 5% significance level. Similar digits have a higher
frequency of being mistaken, but they are only about 1.5 times as frequent as the rest of the
cases.
1 2 3 4 5 6 7 8 9 0
1 3 3 1
2 1 4 2 1
3 2 1 2 1 1 1 1
4 2 1 1 1
5 1 1
6 2 1 1 2 4 2
7 3 1 2 2 1 2
8 1 2 1 2 2 2 1
9 2 1 2 1 1
0
Count 44
Sum 72
Count 12
Sum 26
Count 27.3%
Sum 36.1%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 26 46
Expected 18 54
Chi-Square 5.306
P-Value 0.021
51
Comparisonofthealgorithms
Table 14. Ratio of Similar Character Misreadings to Total Count of Misreadings
Table 14 outcomes indicate that the counts of mistakes found by the five algorithms are
closer compared to the case of the correction for the letters. Again, the counts of found
mistakes by algorithms D and E are exactly the same. The smaller difference between the
results of the algorithms compared to the case of the correction for the letters and specifically
the fairly close Sim/All ratios can indicate a smaller number of false matches and a fairly better
performance for algorithms.
Table 15. Deviation of the Mistake Matrices by Algorithms A to D based on E
Similar to the case for letters,
Table 15 shows that the Mistakes Matrices created by algorithms D and E are exactly the
same; indicating that checking for the classes of vehicles being matched did not add anything to
the process. No judgment can be made based on this dataset in general in terms of accuracy of
the algorithms. For the letters, algorithm C performed least accurately among the
unconstrained algorithms but for the digits it performed better. Percentages of difference
A B C D E
Count (All) 61 67 84 72 72
Count (Sim) 21 25 31 26 26
Sim/All Ratio 34.4% 37.3% 36.9% 36.1% 36.1%
Algorithm
A B C D
% Difference 34.7% 31.9% 16.7% 0.0%
RMSD 69.7% 65.6% 47.1% 0.0%
CV(RMSD) 96.8% 91.1% 65.5% 0.0%
NRMSD 17.4% 16.4% 11.8% 0.0%
Algorithm
52
indicate that the majority of the cells in Mistake Matrices created by these four algorithms are
identical to those of E.
53
5.2.3.2Dataset2(C123):HAVO2009
Table 16. Letters Mistakes Matrix for Dataset 2, by Algorithm E
The null hypothesis is not rejected. Similar letters do not have a significantly higher
frequency of being mistaken.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1
B 1 1 1
C 1 1 1
D 1 1 1
E 1 1
F 1 1 1
G 1
H 1
I 1
J 1 1
K 1 1
L
M 1
N 1 1
O 1 1
P 1
Q
R 1
S 1
T 1 1 1 1
U 1 1 1
V 1
W 1
X 3 1
Y 2 1
Z 1 1 1 1 1 1 1
Count 53
Sum 56
Count 3
Sum 3
Count 5.7%
Sum 5.4%Similar to All Ratio
AllData
Similar Data
Similar Dissimilar
Observed 3 53
Expected 3 53
Chi-Square 0.070
P-Value 0.791
54
Comparisonofthealgorithms
Table 17. Ratio of Similar Character Misreadings to Total Count of Misreadings
Table 17 outcomes indicate that the counts of mistakes found by the five algorithms are
fairly close and probably the false matches are not so numerous. The sim/all ratio is very low
suggesting insignificance of the impact of similarity on the mistakes count. However, the
difference between the results from algorithms D and E is considerable. Moreover, for this set
of data a four‐class scheme was used to classify vehicles and a great majority of vehicles were
personal vehicles, nonetheless considering the class of vehicles while matching them seems to
be beneficial to some extent. If there is a good spread among classes, this constraint can filter
out a greater portion of wrong matches.
Table 18. Deviation of the Mistake Matrices by Algorithms A to D based on E
Difference percentage in Table 18 shows the difference among Mistakes Matrices resulted
from algorithms are fairly close. They are lower than those for the digits that indicate less false
matches for letters. CV(RMSD) is the second highest among all datasets while RMSD is not so
big; the reason is a small denominator ∑ ∑
which is caused by the sparseness of the
Mistakes Matrix created by Algorithm E. This becomes obvious by observing Table 16 which is
A B C D E
Count (All) 58 62 76 71 56
Count (Sim) 3 3 3 4 3
Sim/All Ratio 5.2% 4.8% 3.9% 5.6% 5.4%
Algorithm
A B C D
% Difference 53.6% 50.0% 42.9% 26.8%
RMSD 73.2% 70.7% 65.5% 51.8%
CV(RMSD) 883.5% 853.6% 790.3% 624.8%
NRMSD 24.4% 23.6% 21.8% 17.3%
Algorithm
55
mostly filled with “ones” and results in a large ∑ ∑
value. It indicates that the data
are not enough for judgment between the algorithms.
Table 19. Numbers Mistakes Matrix for Dataset 2, by Algorithm E
Null hypothesis is not rejected and similar digits do not have a significantly higher frequency
of being mistaken.
1 2 3 4 5 6 7 8 9 0
1 1 1 1
2 1 1 1 1
3 3
4 1 2 1 1 2 1
5 1 2 1 1 2
6 2 1 1 3
7 1 1 1 2
8 1 1 1
9 1 1 1
0
Count 33
Sum 43
Count 7
Sum 12
Count 21.2%
Sum 27.9%
Similar to All Ratio
AllData
Similar Data
Sim UnSim
Observed 12 31
Expected 11 32
Chi-Square 0.279
P-Value 0.597
56
Comparisonofthealgorithms
Table 20. Ratio of Similar Character Misreadings to Total Count of Misreadings
Similar to the case for the letters, Table 20 shows that the difference between the results
counts from algorithms D and E is not negligible, again suggesting that considering the class of
vehicles while matching them is beneficial.
Table 21. Deviation of the Mistake Matrices by Algorithms A to D based on E
Table 21 shows the differences between Mistakes Matrices of numbers resulting from
algorithms are not large suggesting that the false matches are not many. However, the high
value of CV(RMSD) may indicate a degree of sparseness in Table 19 although it is smaller than
that of the letters.
A B C D E
Count (All) 44 56 69 52 43
Count (Sim) 9 18 20 14 12
Sim/All Ratio 20.5% 32.1% 29.0% 26.9% 27.9%
Algorithm
A B C D
% Difference 67.4% 72.1% 79.1% 20.9%
RMSD 87.6% 95.2% 107.8% 45.7%
CV(RMSD) 203.7% 221.5% 250.8% 106.4%
NRMSD 29.2% 31.7% 35.9% 15.2%
Algorithm
57
5.2.3.3Dataset3(ABC123):HAVO2007–1
Table 22. Letters Mistakes Matrix for Dataset 3, by Algorithm E
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 1
C
D 1
E 1 1 1
F 1 1
G 1
H 1 2
I
J 1 1
K
L
M
N 1 1
O 1
P
Q
R
S 1
T 1
U
V 2
W 1
X
Y 1
Z
Count 20
Sum 22
Count 4
Sum 5
Count 20.0%
Sum 22.7%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 5 17
Expected 1 21
Chi-Square 16.393
P-Value 0.000
58
It is obvious that the null hypothesis is rejected and similar letters have a significantly higher
frequency (4.9 times more) of being mistaken.
Comparisonofthealgorithms
Table 23. Ratio of Similar Character Misreadings to Total Count of Misreadings
Table 23 outcomes indicate that the counts of mistakes found by the five algorithms are
close, suggesting that the false matches are probably negligible. However, the difference of the
results between algorithms D and E suggests that considering the class of vehicles during the
matching process is beneficial.
Table 24. Deviation of the Mistake Matrices by Algorithms A to D based on E
Difference percentage in Table 24 shows the difference between Mistakes Matrices
resulted from algorithms are close indicating a small number of false matches. Of course this
data set is comparatively small and Table 22 is extremely sparse. Almost all of the cells are filled
with “ones”. The data are probably insufficient for a comparison between the Mistakes
Matrices created by the algorithms.
A B C D E
Count (All) 24 24 24 26 22
Count (Sim) 6 6 6 7 5
Sim/All Ratio 25.0% 25.0% 25.0% 26.9% 22.7%
Algorithm
A B C D
% Difference 27.3% 27.3% 27.3% 18.2%
RMSD 52.2% 52.2% 52.2% 42.6%
CV(RMSD) 1604.7% 1604.7% 1604.7% 1310.2%
NRMSD 26.1% 26.1% 26.1% 21.3%
Algorithm
59
Table 25. Numbers Mistakes Matrix for Dataset 3, by Algorithm E
The null hypothesis is not rejected and similar digits do not have a significantly higher
frequency of being mistaken.
Comparisonofthealgorithms
Table 26. Ratio of Similar Character Misreadings to Total Count of Misreadings
1 2 3 4 5 6 7 8 9 0
1 1
2 1 1 2 1
3 1 2 1 1 1
4 1 1 3 2 1
5 2 1
6 1 1 1
7 2 1
8 2 1
9 1 1
0
Count 26
Sum 34
Count 5
Sum 6
Count 19.2%
Sum 17.6%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 6 28
Expected 8 26
Chi-Square 0.851
P-Value 0.356
A B C D E
Count (All) 29 30 40 35 34
Count (Sim) 5 5 8 7 6
Sim/All Ratio 17.2% 16.7% 20.0% 20.0% 17.6%
Algorithm
60
Table 27. Deviation of the Mistake Matrices by Algorithms A to D based on E
Table 26 outcomes indicate that the number of mistakes found by the five algorithms are
not greatly different, suggesting that the false matches are probably not many. The slight
difference between the results from algorithms D and E in both Table 26 and Table 27 suggests
that considering the class of vehicles during the matching process does not make a great
contribution to correction for this dataset.
A B C D
% Difference 26.5% 29.4% 29.4% 2.9%
RMSD 56.9% 59.4% 64.2% 17.1%
CV(RMSD) 167.3% 174.7% 188.7% 50.4%
NRMSD 19.0% 19.8% 21.4% 5.7%
Algorithm
61
5.2.3.4Dataset4(ABC123):HAVO2007–2
Table 28. Letters Mistakes Matrix for Dataset 4, by Algorithm E
It is obvious that the null hypothesis is rejected and similar letters have a significantly higher
frequency (7.8 times more) of being mistaken.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 2 1
C 2
D 1
E 2
F 1 2
G
H 1
I
J 1
K
L
M 1 1
N 1 1
O 1
P 1
Q
R
S
T 1 2
U 1
V 1 2
W 1
X 1
Y
Z
Count 22
Sum 28
Count 8
Sum 10
Count 36.4%
Sum 35.7%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 10 18
Expected 1 27
Chi-Square 61.512
P-Value 0.000
62
Comparisonofthealgorithms
Table 29. Ratio of Similar Character Misreadings to Total Count of Misreadings
Table 30. Deviation of the Mistake Matrices by Algorithms A to D based on E
Table 29 outcomes indicate that the number of mistakes found by the five algorithms are
close, suggesting that the false matches are few. The small difference between the results from
algorithms D and E in both Table 29 and Table 30 suggests that considering the class of vehicles
during the matching process does not make a great contribution to correction for this dataset.
The small RMSDs and Large CV(RMSD)s indicate insufficiency of the data for a comparison
among the algorithms.
A B C D E
Count (All) 28 29 31 30 28
Count (Sim) 9 9 10 10 10
Sim/All Ratio 32.1% 31.0% 32.3% 33.3% 35.7%
Algorithm
A B C D
% Difference 14.3% 17.9% 10.7% 7.1%
RMSD 37.8% 42.3% 32.7% 26.7%
CV(RMSD) 912.5% 1020.2% 790.3% 645.2%
NRMSD 18.9% 21.1% 16.4% 13.4%
Algorithm
63
Table 31. Numbers Mistakes Matrix for Dataset 4, by Algorithm E
The null hypothesis is not rejected and similar digits do not have a significantly higher
frequency of being mistaken.
1 2 3 4 5 6 7 8 9 0
1 1 1
2 1 1
3
4 1 1 1
5 1 1 2 1
6 1 2 1 2
7 2 1 1 1
8 1
9 1 1 1
0
Count 23
Sum 27
Count 6
Sum 9
Count 26.1%
Sum 33.3%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 9 18
Expected 7 20
Chi-Square 1.155
P-Value 0.282
64
Comparisonofthealgorithms
Table 32. of Similar Character Misreadings to Total Count of Misreadings
Table 33. Deviation of the Mistake Matrices by Algorithms A to D based on E
Table 32 outcomes indicate that the counts of mistakes found by the five algorithms are not
much different and probably the false matches are not many. The results from Algorithms D
and E are fairly close, both in Table 32 and Table 33. It suggests that considering the class of
vehicles during the matching process does not make a great contribution to correction for this
dataset.
5.3AggregateAnalyses
5.3.1ProcessingTimeofAlgorithms
The processing time is measured in terms of minutes and seconds for two different cases: 1)
if search is made only among the similar characters, 2) if search is made among all characters.
All of the algorithms A to E search among all characters, whether similar or dissimilar, and they
A B C D E
Count (All) 26 30 41 31 27
Count (Sim) 8 8 15 10 9
Sim/All Ratio 30.8% 26.7% 36.6% 32.3% 33.3%
Algorithm
A B C D
% Difference 48.1% 63.0% 66.7% 14.8%
RMSD 74.5% 83.9% 108.9% 47.1%
CV(RMSD) 276.1% 310.7% 403.2% 174.6%
NRMSD 37.3% 41.9% 54.4% 23.6%
Algorithm
65
are in the second category. In fact, similar and dissimilar characters in the Mistakes Matrices
are simply separated based on manual color coding (Table 8 and Table 9) and not by the
algorithms. Therefore for this comparison a lateral algorithm was created. This algorithm only
searches among the defined similar characters and neglects dissimilar ones. The resultant
Mistakes Matrix by this algorithm would only have its yellow cells filled. However, its Mistakes
Matrix is not used and shown in this text, because this algorithm is created only to measure its
processing speed; this speed (represented by processing time) is compared to that of each of
the main algorithms (A to E). This algorithm is named “Similar Algorithm” in Table 34 and Table
35.
Table 34. Processing Time of Different Correction Algorithms
Table 35. Ratio of Processing Time for ‘Similar Algorithm’ to other Full Algorithms
A B C D E
DatasetEntry
CountUnmatched
Checked
CharacterSimilar No‐Rep Rep Rep‐All Median Class
Letter 52 sec 16:25 20:19 21:12 25:23 41:35
Digit 32 sec 2:26 2:42 2:42 3:23 5:32
Letter 3 sec 1:46 1:54 1:52 2:43 3:21
Digit 27 sec 2:1 2:11 2:10 3:11 3:34
Letter 12 sec 4:55 5:0 5:0 7:30 10:30
Digit 25 sec 1:52 1:55 1:55 2:43 4:1
Letter 16 sec 6:28 6:40 6:35 10:47 12:46
Digit 34 sec 2:29 2:33 2:33 3:52 5:1
2182 435
374797
863
771
503
468
HAVO
200927:16
HAVO
2007 ‐ 18:42
HAVO
2007 ‐ 214:11
Processing Time (Min:Sec)
Original Matching
ITE 89:45
A B C D E
DatasetChecked
CharactersNo‐Rep Rep Rep‐All Median Class
Letter 5.3% 4.3% 4.1% 3.4% 2.1%
Digit 21.9% 19.8% 19.8% 15.8% 9.6%
Letter 2.8% 2.6% 2.7% 1.8% 1.5%
Digit 22.3% 20.6% 20.8% 14.1% 12.6%
Letter 4.1% 4.0% 4.0% 2.7% 1.9%
Digit 22.3% 21.7% 21.7% 15.3% 10.4%
Letter 4.1% 4.0% 4.1% 2.5% 2.1%
Digit 22.8% 22.2% 22.2% 14.7% 11.3%
HAVO
2007 ‐ 1
HAVO
2007 ‐ 2
ITE
HAVO
2009
Ratio of Processing Time for "Similar Algorithm" to other Full Algorithms
66
Table 34 shows that the processing time for the ‘Similar Algorithm’ is a small fraction of that
of the full algorithms. It can do the corrections at least four times and up to 60 times faster,
depending on collected data format and whether letters or digits are being corrected. Although
this speed is welcome when doing corrections for very large datasets, it is more of a trade‐off
because not all of the mistakes are due to similarity and therefore this algorithm will miss some
of the other mistakes and lose opportunities for more matches.
5.3.2InfluenceonPercentageofMatchedVehicles
The number of unmatched license plates that were matched after correction by one of the
algorithms is called Algorithm Contribution in this text. Percentage of Algorithm Contribution is
the ratio of Algorithm Contribution to number of unmatched license plates.
Matched/unmatched license plates are calculated based on maximum number of possible
matches:
Maximum that can possibly be matched = Min (Entered, Exited)
67
Table 36. Contribution of each Algorithm to Percentage of Matched Vehicles – Letters and Digits Separately
Table 37. Ratio of Contribution to Number of Initial Unmatched License Plates
A B C D E
DatasetCorrected
Characters
Only
SimilarNo‐Rep Rep Rep‐All
Median
Checked
Class
Checked
Letter 73 165 138 199 199 199
Digit 35 77 76 93 93 93
Overlap ‐ 39 40 52 48 48
Total ‐ 203 174 240 244 244
Letter 3 58 56 68 71 56
Digit 17 54 50 60 64 53
Overlap ‐ 17 10 21 19 11
Total ‐ 95 96 107 116 98
Letter 6 24 24 24 26 22
Digit 8 33 33 43 45 37
Overlap ‐ 5 5 1 1 1
Total ‐ 52 52 66 70 58
Letter 10 28 28 30 30 28
Digit 13 30 29 36 35 31
Overlap ‐ 5 4 2 3 3
Total ‐ 53 53 64 62 56
HAVO
2007 ‐ 2388 863 654
HAVO
2007 ‐ 1324 771 523
HAVO
2009348 797 751
Exited
Vehicles
ITE 435 2182 2206
Algorithm Contribution
Initial
Unmatched
LPs Count
Entered
Vehicles
A B C D E
DatasetCorrected
Characters
Only
SimilarNo‐Rep Rep Rep‐All
Median
Checked
Class
Checked
Letter 16.8% 37.9% 31.7% 45.7% 45.7% 45.7%
Digit 8.0% 17.7% 17.5% 21.4% 21.4% 21.4%
Total ‐ 46.7% 40.0% 55.2% 56.1% 56.1%
Letter 0.9% 16.7% 16.1% 19.5% 20.4% 16.1%
Digit 4.9% 15.5% 14.4% 17.2% 18.4% 15.2%
Total ‐ 27.3% 27.6% 30.7% 33.3% 28.2%
Letter 1.9% 7.4% 7.4% 7.4% 8.0% 6.8%
Digit 2.5% 10.2% 10.2% 13.3% 13.9% 11.4%
Total ‐ 16.0% 16.0% 20.4% 21.6% 17.9%
Letter 2.6% 7.2% 7.2% 7.7% 7.7% 7.2%
Digit 3.4% 7.7% 7.5% 9.3% 9.0% 8.0%
Total ‐ 13.7% 13.7% 16.5% 16.0% 14.4%
HAVO
2007 ‐ 1
HAVO
2007 ‐ 2
Percentage of Algorithm Contribution
ITE
HAVO
2009
68
The algorithms can reduce the number of unmatched license plates from almost 50% to a
negligible amount depending on the format of license plate data collection.
5.3.3InfluenceonStatisticalIndices
Although the number of matched license plates may increase by more detailed algorithms, the
statistical indices derived from the revised set of data may not change noticeably. The index of
“duration of stay” was selected for comparisons based on Average, Standard Deviation, and
Median.
Table 38. Average for the Duration of Stay
A B C D E
DatasetChecked
Characters
Only
SimilarNo‐Rep Rep Rep‐All Median Class
Letter 2:04 2:08 2:09 2:09 2:06 2:06
Digit 2:05 2:07 2:07 2:07 2:07 2:07
Letter 2:09 2:10 2:09 2:08 2:08 2:08
Digit 2:09 2:10 2:09 2:10 2:08 2:07
Letter 1:41 1:38 1:38 1:38 1:38 1:37
Digit 1:40 1:40 1:40 1:39 1:40 1:40
Letter 1:49 1:49 1:50 1:49 1:49 1:49
Digit 1:47 1:46 1:46 1:46 1:46 1:46
Average Duration of Stay (hr:min)Initial Values
(before
correction by
Algorithms)
ITE 2:04
HAVO
20092:09
HAVO
2007 ‐ 11:41
HAVO
2007 ‐ 21:49
69
Table 39. Standard Deviation for the Duration of Stay
Table 40. Median for the Duration of Stay
The range of alteration of average values for the five algorithms is fairly low. It varies from
2.3% (in the second dataset) to about 4% (in the third dataset).
The range of alteration of standard deviation values for the five algorithms is fairly low. It
varies from 1.4% (in the third dataset) to about 4.5% (in the first dataset).
A B C D E
DatasetChecked
Characters
Only
SimilarNo‐Rep Rep Rep‐All Median Class
Letter 1:27 1:31 1:31 1:30 1:27 1:27
Digit 1:29 1:30 1:30 1:30 1:30 1:30
Letter 1:29 1:31 1:30 1:30 1:30 1:29
Digit 1:30 1:31 1:31 1:31 1:30 1:30
Letter 1:11 1:11 1:11 1:11 1:11 1:11
Digit 1:11 1:11 1:10 1:10 1:10 1:10
Letter 1:03 1:03 1:03 1:03 1:03 1:03
Digit 1:03 1:02 1:02 1:02 1:02 1:02
Standard Deviation of Duration of Stay (hr:min)
Initial Values
(before
correction by
Algorithms)
ITE 1:26
HAVO
20091:30
HAVO
2007 ‐ 11:11
HAVO
2007 ‐ 21:02
A B C D E
DatasetChecked
Characters
Only
SimilarNo‐Rep Rep Rep‐All Median Class
Letter 1:57 2:00 2:01 2:00 1:59 1:59
Digit 1:57 1:58 1:58 1:58 1:58 1:58
Letter 1:56 1:58 1:55 1:55 1:55 1:55
Digit 1:55 1:57 1:55 1:57 1:55 1:52
Letter 1:38 1:35 1:35 1:35 1:35 1:34
Digit 1:38 1:38 1:38 1:37 1:37 1:36
Letter 1:48 1:47 1:48 1:48 1:47 1:47
Digit 1:46 1:43 1:43 1:43 1:43 1:43
Median Duration of Stay (hr:min)
Initial Values
(before
correction by
Algorithms)
HAVO
2007 ‐ 21:48
ITE 1:57
HAVO
20091:55
HAVO
2007 ‐ 11:38
70
The range of alteration of median values for the five algorithms is fairly low. It varies from
3.4% (in the first dataset) to about 5.2% (in the second dataset).
Algorithms D and E use the Match that yields closest duration of stay to the median if more
than one match is found for a license plate, and therefore inherently reduce the range of
change particularly for the median and average. The data prove this. For algorithm (E) the
corrections of average, standard deviation and median are 1.8%, 1.2% and 2.2%, respectively
(average of all datasets).
The largest corrections by Algorithm E are 4.0%, 1.6% and 4.1% for average, standard
deviation and median, respectively which are not negligible for such large datasets. Even if
these values are not high enough to necessitate the use of the algorithm, they are more reliable
since the population size has increased after a considerable portion (11% to 56% for the four
datasets) of the unmatched license plates were matched by this algorithm and the statistics are
based on larger population.
Whether the amount of changes is worth the correction depends on the context of usage.
Such statistics may be accepted and used without any correction, and with 0% to 5%
uncertainty. However, for origin‐destination studies, vehicle tracking and travel pattern
recognition, higher matched percentages are usually needed. For such purposes, one missed
license plate may make the tracking profile of a vehicle almost unusable. At toll collection
stations, and more so, for law enforcement purposes where Automatic License Plate
Recognition (ALPR) is used, near perfect accuracy is required and theoretically all license plates
must be matched.
71
5.4EvaluationofImpactofSimilarityafterOneIterationand
RedefinitionofSimilarCharacters
When all of the datasets were analyzed, their Mistakes Matrices were summed and based
on the total summation the high frequency Mistakes among both similar characters and
dissimilar characters were found. Also some updates were performed on the Mistakes matrices
since seven new similar cases were found. Table 41 and Table 42 show the updated blank
Mistakes Matrices for letters and digits. Some of the similar mistake cases had high frequencies
as assumed (remained yellow in the tables), some did not (changed to red from yellow in the
original table). On the other hand some of the dissimilar mistake cases had high frequencies
compared to other cells (changed to blue from white). Also seven similar mistake cases were
found to be missed (changed to green from white) in the first iteration of the analyses.
The missed similar mistake cases that were found after the first iteration were added to the
list of similar cases. Low frequency similar cases that existed in the list were retained. After
adding the new cases (green cells), all of the analyses were repeated for all datasets with
algorithm E. The results in the following pages show that χ2 values increased. The only dataset
that had insignificant effect for similarity in the first iteration (Dataset 2 – HAVO 2009)
produced a P‐value smaller than 5% and therefore the general result after this analysis is that
all of the datasets reject the null hypothesis and indicate a significant effect of similarity
between letters for mistaken recordation of license plates. For digits the null hypothesis could
not be rejected. Only one dataset produced a significant result.
72
Table 41. Updated Blank Mistakes Matrices for Letters
Table 42. Updated Blank Mistakes Matrices for Digits
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Other
possibilities
Dissimilar
letters
with high
frequencies
in 1st
iteration
Similar
letters
with low
frequencies
in 1st
iteration
Similar
letters
which were
missed in 1st
iteration
Similar
letters
with high
frequencies
in 1st
iteration
1 2 3 4 5 6 7 8 9 0
1
2
3
4
5
6
7
8
9
0
Other
mistakes
Dissimilar digits with high
frequencies in 1st iteration
Similar digits with low
frequencies in 1st iteration
Similar digits which were
missed in 1st iteration
Similar digits with high
frequencies in 1st iteration
73
Dataset 1 (ABC1):
Table 43. Updated Letters Mistakes Matrix for Dataset 1, by Algorithm E (Second Iteration)
For this dataset the data were collected in two shifts and only in one of them (the second
one) a data collector with weaker eyes (wearing glasses) was involved. Therefore, in order to
investigate probable dependency of the frequencies of human mistakes (among similar letters)
on vision, the two shifts were compared after the second correction iteration. The results
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1 1
B 1 1 1 3
C 1 1 1 2 1 2
D 2 3 1 1 12 2
E 1 5 1 1 3
F 1 3 1 1 1 1
G 1 2 1 1 1
H 4
I
J 2 2 2 2 1 1 5 1
K
L
M 1
N 2 1 2 3 1 1 1
O 2 1
P 1 8 2 2 7 1 1
Q
R 2 1 1 1 1 1 8 1 1 2
S 3 1 1 1
T 2 2 1 1 1 1
U 1 1 1 2 1
V 2 1 1 2 1 2 1
W 1 1 1 1
X 1 1 1 2
Y 3 1 4 1 5
Z 1 2
Count 110
Sum 197
Count 33
Sum 95
Count 30.0%
Sum 48.2%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 95 102
Expected 13 184
Chi-Square 536.418
P-Value 0.000
74
showed the impact of good vision to be small since it only reduced the percentage of human
mistakes by 1.8%. On the other hand, in the second shift the percentage of human mistakes
increased only by 1.2%.
Table 44. Updated Numbers Mistakes Matrix for Dataset 1, by Algorithm E (Second Iteration)
1 2 3 4 5 6 7 8 9 0
1 3 3 1
2 1 4 2 1
3 2 1 2 1 1 1 1
4 2 1 1 1
5 1 1
6 2 1 1 2 4 2
7 3 1 2 2 1 2
8 1 2 1 2 2 2 1
9 2 1 2 1 1
0
Count 44
Sum 72
Count 15
Sum 29
Count 34.1%
Sum 40.3%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 29 43
Expected 21 51
Chi-Square 4.546
P-Value 0.033
75
Dataset 2 (C123):
Table 45. Updated Letters Mistakes Matrix for Dataset 2, by Algorithm E (Second Iteration)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1
B 1 1 1
C 1 1 1
D 1 1 1
E 1 1
F 1 1 1
G 1
H 1
I 1
J 1 1
K 1 1
L
M 1
N 1 1
O 1 1
P 1
Q
R 1
S 1
T 1 1 1 1
U 1 1 1
V 1
W 1
X 3 1
Y 2 1
Z 1 1 1 1 1 1 1
Count 53
Sum 56
Count 7
Sum 9
Count 13.2%
Sum 16.1%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 9 47
Expected 4 52
Chi-Square 7.678
P-Value 0.006
76
Table 46. Updated Numbers Mistakes Matrix for Dataset 2, by Algorithm E (Second Iteration)
1 2 3 4 5 6 7 8 9 0
1 1 1 1
2 1 1 1 1
3 3
4 1 2 1 1 2 1
5 1 2 1 1 2
6 2 1 1 3
7 1 1 1 2
8 1 1 1
9 1 1 1
0
Count 33
Sum 43
Count 10
Sum 15
Count 30.3%
Sum 34.9%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 15 28
Expected 12 31
Chi-Square 0.752
P-Value 0.386
77
Dataset 3 (ABC123):
Table 47. Updated Letters Mistakes Matrix for Dataset 3, by Algorithm E (Second Iteration)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 1
C
D 1
E 1 1 1
F 1 1
G 1
H 1 2
I
J 1 1
K
L
M
N 1 1
O 1
P
Q
R
S 1
T 1
U
V 2
W 1
X
Y 1
Z
Count 20
Sum 22
Count 5
Sum 6
Count 25.0%
Sum 27.3%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 6 16
Expected 1 21
Chi-Square 14.655
P-Value 0.000
78
Table 48. Updated Numbers Mistakes Matrix for Dataset 3, by Algorithm E (Second Iteration)
1 2 3 4 5 6 7 8 9 0
1 1
2 1 1 2 1
3 1 2 1 1 1
4 1 1 3 2 1
5 2 1
6 1 1 1
7 2 1
8 2 1
9 1 1
0
Count 26
Sum 34
Count 7
Sum 9
Count 26.9%
Sum 26.5%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 9 25
Expected 10 24
Chi-Square 0.097
P-Value 0.756
79
Dataset 4 (ABC123):
Table 49. Updated Letters Mistakes Matrix for Dataset 4, by Algorithm E (Second Iteration)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 2 1
C 2
D 1
E 2
F 1 2
G
H 1
I
J 1
K
L
M 1 1
N 1 1
O 1
P 1
Q
R
S
T 1 2
U 1
V 1 2
W 1
X 1
Y
Z
Count 22
Sum 28
Count 11
Sum 14
Count 50.0%
Sum 50.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 14 14
Expected 2 26
Chi-Square 82.917
P-Value 0.000
80
Table 50. Updated Numbers Mistakes Matrix for Dataset 4, by Algorithm E (Second Iteration)
1 2 3 4 5 6 7 8 9 0
1 1 1
2 1 1
3
4 1 1 1
5 1 1 2 1
6 1 2 1 2
7 2 1 1 1
8 1
9 1 1 1
0
Count 23
Sum 27
Count 8
Sum 11
Count 34.8%
Sum 40.7%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 11 16
Expected 8 19
Chi-Square 1.846
P-Value 0.174
81
CHAPTER6
CONCLUSION
The purpose of this research was to investigate the accuracy of license plate matching
methods for vehicle tracking and travel time data collection, and provide correction algorithms
to improve the results; also to investigate the role of human mistakes because of similarity
between recorded characters. Four datasets were used in three different recordation styles;
ABC1, C123 and ABC123. Five algorithms were developed to process the unmatched license
plates in the datasets by substitution of mistakenly recorded letters and digits in the license
plates with the correct ones. The most comprehensive and accurate algorithm is the
constrained Algorithm E. It checks the vehicle classification between the pair being matched,
and also if more than one match is found it uses the one that yields duration of stay closer to
the dataset median.
This research showed that after the initial matching phase is done, a considerable increase
in the percentage of matched license plates can be attained. For the most comprehensive and
accurate algorithm that is introduced, this gain ranges from 11% to 56% depending on the style
of recordation of the license plates (e.g. ABC1 or C123), and whether letters are dealt with or
digits.
To a smaller degree the algorithms can improve the statistical values of the license plate
recordation datasets such as average, standard deviation and median of travel time and/or
duration of stay. Based on the four case studies in this research it can be said that the travel
time values can change between 0% and 5% after the processing of unmatched license plates.
The highest corrections by Algorithm E are 4.0%, 1.6% and 4.1% for average, standard deviation
and median, respectively.
Using the classification of vehicles did improve the matching process. The Mistakes Matrices
by Algorithm D and E, which are the same but D is not constrained by vehicle class, were 0% to
82
25% different. The classification scheme was a four or five vehicle class, depending on the
dataset. The better spread of the classes among collected vehicles the more helpful the
classification data can be to increase the accuracy of re‐matching. Since such spread varied
among the datasets used in this study, knowledge of the classes of vehicles was to some extent
more beneficial for some datasets and less for others. If the FHWA scheme with 13 classes is
used, better results may be obtained since it breaks up the personal vehicle class. The utility of
classification in an application like HAVO where buses and minibuses are a substantial share of
traffic is more useful. The same may not be true on e.g. H‐1 freeway where about 98% of the
traffic consists of light duty vehicles.
This study also shows that a significant portion of mistakenly recorded letters while
recording the license plates are visually similar letters, that by itself demonstrates the human
actor in the accuracy of the method. Digits however are not so significantly probable to be
mistaken due to visual similarity.
The five highest repetition of the mistakes based on the aggregate data are for the following
cases.
For the letters (average repetition for all mistakes = 5.2):
D‐P: 100
D‐R: 72
F‐E: 59
X‐Y: 50
U‐V: 48
For the digits (average repetition for all mistakes = 21.0):
6‐8: 55
4‐6: 52
1‐7: 48
2‐8: 42
83
2‐3: 38
For example, letters D and P were recorded mistakenly instead of each other 100 times,
while average frequency of mistakes for all possible cases (650 cells in the Mistakes Matrices)
was only 5.2.
All top five cases for the letters are similar letters; so is three of the top five cases for the
digits.
The style of recordation is also proved to be significant. More letters to be recorded, results
in more human errors. For this reason it is suggested that during the license plate recordation if
the recorders are not close enough to the road, the last four digits of the license plates be
recorded since this style showed lower recordation errors.
Finally the “similar algorithm” that searches only for the similar characters is definitely
recommended, especially for letters. This algorithm most of the times could find around 50% of
the matches, with 100% as its maximum, while its processing time is only 1.5% to 11% of the
most comprehensive algorithm (E). This is highly recommended for large databases.
84
REFERENCES
1. Pline, J.L., Traffic Engineering Handbook. 4th ed1992: Prentice‐Hall Inc.
2. Hauer, E., Correction of license plate surveys for spurious matches. Transportation
Research Part A: General, 1979. 13(2): p. 71–78.
3. Oliveira‐Neto, F.M., L.D. Han, and M.K. Jeong, Tracking Large Trucks in Real Time with
License Plate Recognition and Text‐Mining Techniques. Transportation Research Record:
Journal of the Transportation Research Board, 2009. 2121: p. 121–127.
4. Turner, S.M., et al., Travel Time Data Collection Handbook, 1998, Office of Highway
Information Management, Federal Highway Administration.
5. Gómez‐Torres, N.R. and D.M. Valdés‐Díaz, Detection Technologies for Dynamic Origin‐
Destination Matrices and Heavy Vehicles’ Road Selection Studies, in Seventh LACCEI
Latin American and Caribbean Conference for Engineering and Technology (LACCEI’2009)
“Energy and Technology for the Americas: Education, Innovation, Technology and
Practice”2009: San Cristóbal, Venezuela.
6. Han, L.D., Myong‐KeeJeong, and F.M. Oliveira‐Neto, License Plate Recognition, 2009,
National Transportation Research Center Incorporated (NTRCI) ‐ University
Transportation Center.
7. Makowski, G.G. and K.C. Sinha, A statistical procedure to analyze partial license plate
numbers. Transportation Research Part A: General, 1976. 10(2): p. 131‐132.
85
8. Neto, F.M.O., Matching Vehicle License Plate Numbers Using License Plate Recognition
and Text Mining Techniques, 2010, University of Tennessee, Knoxville: Tennessee
Research and Creative Exchange.
9. Clark, S.D., S. Grant‐Muller, and H. Chen, Cleaning of Matched License Plate Data.
Transportation Research Record: Journal of the Transportation Research Board,
2002(1804): p. 1‐7.
10. Wagner, R.A. and M.J. Fischer, The String‐to‐String Correction Problem. Journal of the
ACM (JACM), 1974. 21(1).
11. Miller, G., The Magical Number Seven, Plus or Minus Two. Psychological Review, 1956.
63: p. 81‐97.
12. Jan Maarten Schraagen, a.K.v.D., Designing a licence plate for memorability.
Ergonomics, 2005. 48(7): p. 796‐806.
13. C. Bisdikian, An overview of the Bluetooth wireless technology, IEEE Communications
Magazine 2001. 39: p. 86 ‐ 94.
14. J. Hallberg, M. Nilsson, K. Synnes, Positioning with Bluetooth, ICT 2003: 10th
International Conference on Telecommunications, Feb 2003, Papeete, French 2, 2003. p.
954 ‐ 958.
15. M. Lu, W. Chen, X Shen, H. Lam, J. Liu, Positioning and tracking construction vehicles in
highly dense urban areas and building construction site. Automation in Construction,
2007. 16(5): p. 647–656
86
16. A. M. Steane, Error Correcting Codes in Quantum Theory. Physical Review Letters, 1996
77(5): p. 793‐797
17. J. Landt, The history of RFID. IEEE Potentials, 2005. 24(4): p. 8‐11
18. Erick C. Jones, Christopher A. Chung, RFID In Logistics: A Practical Introduction, 2008,
CRC Press
87
AppendixA
Algorithms
InitialMatchingAlgorithm
Option Explicit
Sub find_plates_new()
Dim i, j, k, l As Integer
Dim sngStartTime As Single
Dim sngTotalTime As Single
sngStartTime = Timer
For k = 5 To 2187
Cells(k, 15) = Cells(k, 6)
Next k
For j = 5 To 2187
For i = 5 To 2211
If Cells(j, 2) = Cells(i, 6) Then
If Cells(j, 3) <= Cells(i, 7) Then
Cells(j, 5) = Cells(i, 7)
Cells(i, 6) = "Used and excluded"
Exit For
End If
Else: Cells(j, 5) = "No match found!"
End If
Next i
Next j
88
sngTotalTime = Timer - sngStartTime
MsgBox "Time taken: " & Round(sngTotalTime, 2) & " seconds"
Cells(4, 4) = Round(sngTotalTime, 2)
End Sub
89
AlgorithmA
Option Explicit
Sub letter_table_no_repeat()
Dim i, j, k, l, n, w, a, b, hplace, vplace As Integer
Dim beforechange, afterchange As Variant
Dim break As String
Dim sngStartTime As Single
Dim sngTotalTime As Single
sngStartTime = Timer
For k = 5 To 2186
break = "no"
If Cells(k, 5) = "No match found!" Then
For j = 1 To 3
For l = 1 To 26
beforechange = Mid(Cells(k, 2), j, 1)
Cells(k, 14) = Replace(Cells(k, 2), Mid(Cells(k, 2), j, 1), Cells(l, 16))
afterchange = Cells(l, 16)
For i = 5 To 2210
If Cells(k, 14) = Cells(i, 6) And Cells(k, 3) <= Cells(i, 7) Then
Cells(k, 5) = Cells(i, 7)
Cells(i, 6) = "used and excluded"
hplace = charactervalue(beforechange)
vplace = charactervalue(afterchange)
If hplace > 0 And vplace > 0 Then
Cells(vplace + 13, hplace + 17) = Cells(vplace + 13, hplace + 17) + 1
break = "yes"
90
End If
Exit For
End If
Next i
If Cells(k, 14) = Cells(i, 6) And Cells(k, 3) <= Cells(i, 7) Then
Cells(k, 5) = Cells(i, 7)
Cells(i, 6) = "used and excluded"
hplace = charactervalue(beforechange)
vplace = charactervalue(afterchange)
If hplace > 0 And vplace > 0 Then
Cells(vplace + 13, hplace + 17) = Cells(vplace + 13, hplace + 17) + 1
break = "yes"
End If
Exit For
End If
If break = "yes" Then
Exit For
End If
Next l
If Cells(k, 14) = Cells(i, 6) And Cells(k, 3) <= Cells(i, 7) Then
Cells(k, 5) = Cells(i, 7)
Cells(i, 6) = "used and excluded"
hplace = charactervalue(beforechange)
vplace = charactervalue(afterchange)
If hplace > 0 And vplace > 0 Then
Cells(vplace + 13, hplace + 17) = Cells(vplace + 13, hplace + 17) + 1
break = "yes"
End If
Exit For
91
End If
If break = "yes" Then
Exit For
End If
Next j
End If
Next k
sngTotalTime = Timer - sngStartTime
MsgBox "Time taken: " & (sngTotalTime \ 60) & " minutes, " & (sngTotalTime Mod 60) & " seconds"
MsgBox "Time taken: " & Round(sngTotalTime, 2) & " seconds"
Cells(4, 4) = Round(sngTotalTime, 2)
End Sub
92
AppendixB
MistakesMatricesbyAlgorithmsAtoD
Dataset 1 (ABC1): ITE
Algorithm A: Removed matched LPs
Algorithm B: Retained matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1
B 2 1 1 1 2
C 3 1 1 1 1 1 1 1
D 1 1 2 2 1 9 1 1 1
E 1 4 1 2
F 1 2 1 1 1 1
G 1 1 1 1
H 4
I
J 1 1 1 1 2 2 1 2 1 1
K
L
M 2
N 2 2 1 4 3 1 1 1 1
O 2 2
P 6 2 1 3 1
Q
R 2 1 1 9 1 1
S 1 1
T 1 1 1 1 1 1
U 1 2 1
V 1 1 2 1 1
W 1 1
X 1 1 3 1
Y 1 3 1 4
Z 1 2
Count 102
Sum 164
Count 18
Sum 56
Count 17.6%
Sum 34.1%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 56 108
Expected 8 156
Chi-Square 324.872
P-Value 0.000
93
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1
B 1 2 1 1 2
C 1 3 1 1 1 3 1 1
D 1 1 1 2 2 1 8 1 3 1 1
E 5 2 1 1 1
F 1 2 1 1 1 1 1
G 2 1 2 1
H 4
I
J 1 4 2 2 2 2 3 1 1
K
L
M 2
N 2 2 1 1 5 2 1 1 1 1 1
O 3 2
P 6 3 1 3 1 1 1
Q
R 2 2 2 2 6 1 1 1 2
S 1 1 1 1 1 1 1
T 3 1 1 1 1 1 1
U 2 1 1 1 1
V 1 1 1 2 2
W 1 1 1
X 3 1 1 3 1
Y 2 1 1 5 6
Z 2 1 1 1 1 1
Count 127
Sum 214
Count 19
Sum 53
Count 15.0%
Sum 24.8%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 53 161
Expected 10 204
Chi-Square 197.387
P-Value 0.000
94
Algorithm C: Full correction
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1 1
B 1 2 1 1 1 2 3
C 3 3 2 1 5 1 4 1 1 1
D 2 1 1 3 2 1 12 2 1 3 1 1
E 1 7 2 1 1 3
F 4 1 3 1 2 2 1 2 1 1 2
G 1 2 3 2 1 1
H 4
I
J 1 5 2 2 3 3 1 3 1 1 5 2 2
K
L
M 2
N 2 2 2 2 6 3 1 1 2 1 1 1
O 3 2
P 1 9 3 2 7 1 1 1 1
Q
R 2 2 2 1 1 2 10 1 1 1 2
S 1 3 1 1 1 1 1 1 1 1
T 1 3 1 2 3 1 1 1 2
U 2 1 1 1 3 1
V 3 1 1 2 2 2 1
W 2 1 1 1 1 1
X 9 1 2 1 1 3 1
Y 3 1 1 8 1 7
Z 3 4 1 1 1 2 2 1
Count 165
Sum 339
Count 22
Sum 79
Count 13.3%
Sum 23.3%
Similar Data
AllData
Similar to All Ratio
Similar Dissimilar
Observed 79 260
Expected 16 323
Chi-Square 268.943
P-Value 0.000
95
Algorithm D: Closest to mean
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1 1
B 1 1 1 3
C 1 1 1 2 1 2
D 2 3 1 1 12 2
E 1 5 1 1 3
F 1 3 1 1 1 1
G 1 2 1 1 1
H 4
I
J 2 2 2 2 1 1 5 1
K
L
M 1
N 2 1 2 3 1 1 1
O 2 1
P 1 8 2 2 7 1 1
Q
R 2 1 1 1 1 1 8 1 1 2
S 3 1 1 1
T 2 2 1 1 1 1
U 1 1 1 2 1
V 2 1 1 2 1 2 1
W 1 1 1 1
X 1 1 1 2
Y 3 1 4 1 5
Z 1 2
Count 110
Sum 197
Count 21
Sum 67
Count 19.1%
Sum 34.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 67 130
Expected 9 188
Chi-Square 386.652
P-Value 0.000
96
Algorithm A: Removed matched LPs
1 2 3 4 5 6 7 8 9 0
1 1 1 2 1
2 1 5 2 1
3 1 2 1 2 1 1
4 2 1 1 1
5 1 1
6 2 1 1 2 4 1
7 3 1 1
8 1 2 1 2 1 3
9 2 1 1 1 1
0
Count 40
Sum 61
Count 11
Sum 21
Count 27.5%
Sum 34.4%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 21 40
Expected 15 46
Chi-Square 3.291
P-Value 0.070
97
Algorithm B: Retained matched LPs
1 2 3 4 5 6 7 8 9 0
1 1 1 2 1
2 1 6 2 1
3 1 2 1 2 1 1
4 2 1 1 1
5 1 1
6 2 1 1 2 4 1
7 3 1 1 1
8 1 2 1 1 2 2 2
9 3 1 1 1 2 1
0
Count 43
Sum 67
Count 12
Sum 25
Count 27.9%
Sum 37.3%
Similar to All Ratio
AllData
Similar Data
Similar Dissimilar
Observed 25 42
Expected 16 51
Chi-Square 6.008
P-Value 0.014
98
Algorithm C: Full correction
1 2 3 4 5 6 7 8 9 0
1 1 3 3 1
2 1 6 2 1 1
3 2 1 2 1 2 1 1
4 2 1 1 1
5 1 1
6 2 1 1 2 4 2
7 3 2 2 2 1 2
8 1 2 1 2 2 3 3
9 3 1 2 1 2 1
0
Count 47
Sum 84
Count 13
Sum 31
Count 27.7%
Sum 36.9%
Similar to All Ratio
AllData
Similar Data
Similar Dissimilar
Observed 31 53
Expected 21 63
Chi-Square 7.061
P-Value 0.008
99
Algorithm D: Closest to mean
1 2 3 4 5 6 7 8 9 0
1 3 3 1
2 1 4 2 1
3 2 1 2 1 1 1 1
4 2 1 1 1
5 1 1
6 2 1 1 2 4 2
7 3 1 2 2 1 2
8 1 2 1 2 2 2 1
9 2 1 2 1 1
0
Count 44
Sum 72
Count 12
Sum 26
Count 27.3%
Sum 36.1%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 26 46
Expected 18 54
Chi-Square 5.306
P-Value 0.021
100
Dataset 2 (C123): HAVO 2009
Algorithm A: Removed matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1
B 1 1 1 1
C 1 1 1
D 1
E 1 1
F 1 1 1
G 1 1
H 1
I 1
J 1
K 1 1 1
L
M 1 1 1
N 1 1
O 1
P 1 1
Q
R 1 1
S 1
T 1 1 1
U 1 1 1
V 1 1 1
W 1 1
X 3 1
Y 2 1
Z 1 1 1 1 1
Count 55
Sum 58
Count 3
Sum 3
Count 5.5%
Sum 5.2%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 3 55
Expected 3 55
Chi-Square 0.041
P-Value 0.840
101
Algorithm B: Retained matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1
B 1 1 1 1
C 1 1 1
D 1
E 1 1
F 1 1 1
G 1 1
H 1
I 1
J 1
K 1 1 1
L
M 1 1 1
N 1 1
O 1
P 1 1
Q
R 1 1
S 1
T 1 1 1
U 1 1 1 1
V 1 1 1
W 1 1
X 4 1
Y 2 1
Z 1 1 1 1 1 1 1
Count 58
Sum 62
Count 3
Sum 3
Count 5.2%
Sum 4.8%
Similar Data
Similar to All Ratio
AllData
Similar Dissimilar
Observed 3 59
Expected 3 59
Chi-Square 0.007
P-Value 0.933
102
Algorithm C: Full correction
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1 1 1
B 1 1 1 1 1
C 1 1 1
D 1 1 1
E 1 1
F 1 1 1
G 1 1 1
H 1
I 1
J 1 1
K 1 1 2 1
L
M 1 1 1 1
N 1 1
O 1 1
P 1 1
Q
R 1 1
S 1
T 1 1 1
U 1 1 2 1
V 1 1 1
W 1 1 1
X 4 1
Y 2 1
Z 1 1 1 1 1 1 1
Count 70
Sum 76
Count 3
Sum 3
Count 4.3%
Sum 3.9%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 3 73
Expected 4 72
Chi-Square 0.077
P-Value 0.781
103
Algorithm D: Closest to mean
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 1 1 1 1
B 1 1 1 1 1
C 1 1 1
D 1 1 1
E 1 1
F 1 1 1
G 1 1
H 1
I 1
J 1 1
K 1 1 2
L
M 1 1
N 1 1
O 1 1
P 1 1
Q
R 1 1
S 1
T 1 1 1 1
U 1 1 2 1
V 1 1
W 1 1 1 1
X 3 1
Y 2 1
Z 1 1 1 1 1 1 1
Count 66
Sum 71
Count 4
Sum 4
Count 6.1%
Sum 5.6%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 4 67
Expected 3 68
Chi-Square 0.167
P-Value 0.683
104
Algorithm A: Removed matched LPs
1 2 3 4 5 6 7 8 9 0
1 1 2 1
2 1 1 2
3 1 4
4 1 2 1 2 1 1 1
5 2 1
6 3 1 1 2
7 1 1 1 2
8 1 1
9 1 1 1 2
0
Count 31
Sum 44
Count 5
Sum 9
Count 16.1%
Sum 20.5%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 9 35
Expected 11 33
Chi-Square 0.379
P-Value 0.538
105
Algorithm B: Retained matched LPs
1 2 3 4 5 6 7 8 9 0
1 1 1 1 2 1
2 1 1 2
3 1 4
4 1 2 1 3 1 1 1
5 2 1 1 1 1
6 4 1 1 3
7 1 1 1 2
8 1 1 2 1 1
9 1 2 1 1
0
Count 39
Sum 56
Count 11
Sum 18
Count 28.2%
Sum 32.1%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 18 38
Expected 14 42
Chi-Square 1.797
P-Value 0.180
106
Algorithm C: Full correction
1 2 3 4 5 6 7 8 9 0
1 1 1 1 2 1
2 2 1 1 2 1
3 2 4
4 1 2 1 4 1 1 1
5 2 1 1 1 2
6 1 5 2 2 3
7 1 1 2 2
8 1 1 2 1 1
9 1 1 2 1 2
0
Count 43
Sum 69
Count 11
Sum 20
Count 25.6%
Sum 29.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 20 49
Expected 17 52
Chi-Square 0.770
P-Value 0.380
107
Algorithm D: Closest to mean
1 2 3 4 5 6 7 8 9 0
1 1 1 2 1
2 1 1 1 1 1 1
3 3
4 1 2 1 1 2 1 1 1
5 1 2 1 1 2
6 2 1 1 3
7 1 1 1 1 2 1
8 1 1 1 1
9 1 1 1
0
Count 41
Sum 52
Count 9
Sum 14
Count 22.0%
Sum 26.9%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 14 38
Expected 13 39
Chi-Square 0.173
P-Value 0.677
108
Dataset 3 (ABC123): HAVO 2007 ‐ 1
Algorithm A: Removed matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 1
C
D 1
E 1 1 2
F 1 1
G 1
H 1 2 1
I
J 1
K
L
M
N 1 1 1
O 1
P
Q
R
S 1
T 1
U
V 3
W
X
Y 1
Z
Count 20
Sum 24
Count 4
Sum 6
Count 20.0%
Sum 25.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 6 18
Expected 1 23
Chi-Square 22.653
P-Value 0.000
109
Algorithm B: Retained matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 1
C
D 1
E 1 1 2
F 1 1
G 1
H 1 2 1
I
J 1
K
L
M
N 1 1 1
O 1
P
Q
R
S 1
T 1
U
V 3
W
X
Y 1
Z
Count 20
Sum 24
Count 4
Sum 6
Count 20.0%
Sum 25.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 6 18
Expected 1 23
Chi-Square 22.653
P-Value 0.000
110
Algorithm C: Full correction
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 1
C
D 1
E 1 1 2
F 1 1
G 1
H 1 2 1
I
J 1
K
L
M
N 1 1 1
O 1
P
Q
R
S 1
T 1
U
V 3
W
X
Y 1
Z
Count 20
Sum 24
Count 4
Sum 6
Count 20.0%
Sum 25.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 6 18
Expected 1 23
Chi-Square 22.653
P-Value 0.000
111
Algorithm D: Closest to mean
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 1
C
D 1
E 1 1 2
F 1 1
G 1
H 1 2 1
I
J 1 1
K
L
M
N 1 1 1
O 1
P
Q
R
S 1
T 1
U
V 3
W 1
X
Y 1
Z
Count 22
Sum 26
Count 5
Sum 7
Count 22.7%
Sum 26.9%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 7 19
Expected 1 25
Chi-Square 29.390
P-Value 0.000
112
Algorithm A: Removed matched LPs
1 2 3 4 5 6 7 8 9 0
1
2 1 1 2 1
3 1 2 1 1
4 1 1 1 1 1
5 1 1
6 1 1 1 1
7 2 1
8 1 1
9 1 1 1
0
Count 26
Sum 29
Count 4
Sum 5
Count 15.4%
Sum 17.2%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 5 24
Expected 7 22
Chi-Square 0.815
P-Value 0.367
113
Algorithm B: Retained matched LPs
1 2 3 4 5 6 7 8 9 0
1
2 1 1 2 1
3 1 2 1 1
4 1 1 1 1 2
5 1 1
6 1 1 1 1
7 2 1
8 1 1
9 1 1 1
0
Count 26
Sum 30
Count 4
Sum 5
Count 15.4%
Sum 16.7%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 5 25
Expected 7 23
Chi-Square 0.983
P-Value 0.322
114
Algorithm C: Full correction
1 2 3 4 5 6 7 8 9 0
1 1
2 1 1 2 1
3 1 2 1 1
4 1 1 2 3 3
5 2 1
6 1 1 1 1
7 2 1
8 2 2 1 1
9 1 1 1
0
Count 29
Sum 40
Count 6
Sum 8
Count 20.7%
Sum 20.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 8 32
Expected 10 30
Chi-Square 0.428
P-Value 0.513
115
Algorithm D: Closest to mean
1 2 3 4 5 6 7 8 9 0
1 1
2 1 1 2 1
3 2 2 1 1 1
4 1 1 3 2 1
5 2 1
6 1 1 1
7 2 1
8 2 1
9 1 1
0
Count 26
Sum 35
Count 5
Sum 7
Count 19.2%
Sum 20.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 7 28
Expected 9 26
Chi-Square 0.374
P-Value 0.541
116
Dataset 4 (ABC123): HAVO 2007 ‐ 2
Algorithm A: Removed matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 2 1
C 1 1
D 1
E 2
F 1 2
G 1
H 1
I
J 1
K
L
M 1 1
N 1
O 1
P 1
Q
R
S
T 1 2
U 1
V 1 2
W 1
X 1
Y
Z
Count 23
Sum 28
Count 7
Sum 9
Count 30.4%
Sum 32.1%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 9 19
Expected 1 27
Chi-Square 48.195
P-Value 0.000
117
Algorithm B: Retained matched LPs
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 2 1
C 1 1
D 1
E 2
F 1 2
G 1
H 1
I
J 1
K
L
M 1 1
N 1
O 1
P 1
Q
R
S
T 1 2
U 1
V 1 2
W 1
X 1
Y 1
Z
Count 24
Sum 29
Count 7
Sum 9
Count 29.2%
Sum 31.0%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 9 20
Expected 1 28
Chi-Square 45.978
P-Value 0.000
118
Algorithm C: Full correction
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 2 1
C 1 2
D 1
E 2
F 1 2
G 1
H 1
I
J 1
K
L
M 1 1
N 1 1
O 1
P 1
Q
R
S
T 1 2
U 1
V 1 2
W 1
X 1
Y 1
Z
Count 25
Sum 31
Count 8
Sum 10
Count 32.0%
Sum 32.3%
Similar Data
Similar to All Ratio
AllData
Similar Dissimilar
Observed 10 21
Expected 1 30
Chi-Square 53.807
P-Value 0.000
119
Algorithm D: Closest to mean
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A
B 2 1
C 1 2
D 1
E 2
F 1 2
G 1
H 1
I
J 1
K
L
M 1 1
N 1 1
O 1
P 1
Q
R
S
T 1 2
U 1
V 1 2
W 1
X 1
Y
Z
Count 24
Sum 30
Count 8
Sum 10
Count 33.3%
Sum 33.3%Similar to All Ratio
AllData
Similar Data
Similar Dissimilar
Observed 10 20
Expected 1 29
Chi-Square 56.201
P-Value 0.000
120
Algorithm A: Removed matched LPs
1 2 3 4 5 6 7 8 9 0
1 1 1 1
2 1 1
3
4 1 1 1 1
5 1 2 1 1
6 1 1 1
7 1 1
8 1 1
9 1 1 1 2
0
Count 24
Sum 26
Count 7
Sum 8
Count 29.2%
Sum 30.8%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 8 18
Expected 6 20
Chi-Square 0.563
P-Value 0.453
121
Algorithm B: Retained matched LPs
1 2 3 4 5 6 7 8 9 0
1 2 1 1
2 1 2
3
4 1 1 1 1
5 1 1 1 1 1
6 1 1 1 1 1
7 1 1
8 1 1
9 1 1 1 2
0
Count 27
Sum 30
Count 7
Sum 8
Count 25.9%
Sum 26.7%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 8 22
Expected 7 23
Chi-Square 0.080
P-Value 0.777
122
Algorithm C: Full correction
1 2 3 4 5 6 7 8 9 0
1 2 1 1
2 1 2
3
4 1 1 2 1
5 1 1 2 1 4
6 1 1 3 1 2
7 1 1 1 1
8 2 1
9 1 1 1 2
0
Count 29
Sum 41
Count 8
Sum 15
Count 27.6%
Sum 36.6%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 15 26
Expected 10 31
Chi-Square 3.272
P-Value 0.070
123
Algorithm D: Closest to mean
1 2 3 4 5 6 7 8 9 0
1 1 1
2 1 1
3
4 1 1 1
5 1 1 2 1
6 1 2 1 2
7 2 1 1 1
8 2 1
9 1 1 1 2
0
Count 25
Sum 31
Count 6
Sum 10
Count 24.0%
Sum 32.3%
AllData
Similar Data
Similar to All Ratio
Similar Dissimilar
Observed 10 21
Expected 8 23
Chi-Square 1.025
P-Value 0.311