Data for Machine Learning1190881/FULLTEXT01.pdf · big subject such as machine learning requires a...
Transcript of Data for Machine Learning1190881/FULLTEXT01.pdf · big subject such as machine learning requires a...
Data for Machine LearningData generation and simulation of a logistics operation for machine learning
Erik Hedman
Högskoleingenjör, Datorspelsutveckling
2017
Luleå tekniska universitet
Institutionen för system- och rymdteknik
Abstract
In the logistics business, a priority is to deliver packages at the right time in the right place. Mistakes
can happen in any task that a human makes a decision. In this project, a simulation is developed of a
logistics operation, used to generate data for machine learning algorithms. This project is one part of
a bigger project. The algorithm will be trained to discover abnormalities in the flow of packages, with
the goal to reduce the amount of wrongfully handled packages. Machine learning algorithms and
training is parts of the bigger project and will not be covered in this paper. This project was brought
forth by IT-consulting Company Data Ductus.
Sammanfattning
En prioritet I logistik branschen är att leverera paket vid rätt tid på rätt plats. När en Människa tar ett
beslut om en uppgift så kan de hända misstag. I det här projektet utvecklas en simulering av ett
logistik system som genererar data vilket ska användas till en algoritm för maskininlärning. Det här
projektet är en del av ett större projekt. Algoritmen ska bli tränad för att upptäcka avvikande
beteende i flödet av paket, med mål att reducera mängden av felaktigt hanterade paket. Algoritmer
för maskininlärning och träning av algoritmer är delar av det stora projektet och kommer inte att
förklaras i denna artikel. Det här projektet var framtaget av IT-konsultföretaget Data Ductus.
Abbreviations and Terms AI – Artificial Intelligence
IoT – Internet of Things
IT – Information Technology
Contents
1 Introduction ......................................................................................................................................... 1
1.1 Data Ductus ............................................................................................................................. 1
1.2 Goals & Purpose ...................................................................................................................... 1
1.3 Limitations............................................................................................................................... 2
1.4 Background ............................................................................................................................. 2
1.4.1 Artificial Intelligence ....................................................................................................... 2
1.4.2 Machine Learning ............................................................................................................ 3
1.4.3 The Swedish Mail Format ................................................................................................ 3
1.5 Social, Ethical and Environmental Considerations .................................................................. 4
1.6 Method ................................................................................................................................... 5
1.6.1 Python ............................................................................................................................. 5
2 Design and Implementation ................................................................................................................. 6
2.1 Setting up the data base ............................................................................................................... 6
2.2 Generating the data ...................................................................................................................... 6
2.3 Simulating the data ....................................................................................................................... 7
3 Results .................................................................................................................................................. 9
3.1 The database ................................................................................................................................. 9
3.2 Data generator .............................................................................................................................. 9
3.3 Simulation ..................................................................................................................................... 9
3.4 Result summary ............................................................................................................................ 9
4 Discussion ........................................................................................................................................... 11
5 Conclusion .......................................................................................................................................... 12
Appendix ............................................................................................................................................... 14
1
1 Introduction Logistics is the implementation of an operation i.e. the management of the flow of packages
between a start and end location. The goal of the operation is to get the packages to the right
location at the right time. If a package is delivered to the wrong location the business can be in
danger of losing its trust. A late package can lead to a bad reputation. Thus, making sure packages
get to the right place at the right time is crucial to the operation.
In the postal system, each package undergoes a few different tasks, first of the package is prepared
with a name and address. The package is placed in a mailbox that is later discharged by a postal
worker and delivered to the closest mail terminal. The packages get sorted twice, once for the first
three numbers in the postal code and once for the last two. The first sort is a rough sort and the
second is a more precise one. The package is then sent to one of the 750 dispensing offices and later
of to the destined mailbox [13].
Within this system mistakes can happen [5], which can lead to delivery delays or even misplaced
packages. A product that can decrease the number of packages that are handled wrongfully would
support such a business. The product must be able to understand the logistics operation and how
the packages should be handled, to understand when a mistake is made and how to correct it. To
escape coding a solution for all mistakes that can be done, an application that can learn is what is
necessary. This sort of software if done correctly could be applied to lots of different logistics
businesses.
In recent years, machine learning has become more popular when dealing with predictions of data
[6], due to the calculating power of current computers. Creating this kind of artificial intelligence
requires data used for training and testing. The data, in this case, will be in the form of letters with
the Swedish mailing address format. When using only manufactured data to train a model, it is
important that the data used is created to look as real as possible. Otherwise, the transition
between simulation to real world situation can be difficult.
This report is based on a project done for IT-consulting firm Data Ductus with the goal of improving
logistics operations using software that can detect and correct mistakes done in a supply chain.
1.1 Data Ductus Data Ductus is a multinational IT consulting firm specialised in technically advanced solutions. They
help their customers succeed with their businesses by combining deep technical expertise and
business knowledge, creating tailor made solutions with tangible benefits.
Their services include system development and integration, network service management solutions
& orchestration, IoT-expertise, as well as operation, management and support.
They can provide services within a vast range of industries due to their highly skilled engineers and
project managers and are known for their ability to adapt and meet changing business requirements
quickly.
With offices in both Sweden and US, they offer their services globally since 1989. Their customers
range from large international groups to small start-ups.
1.2 Goals & Purpose The goal of this project was:
1. Generate data in the form of letters.
2
2. Creating a simulation of a flow of packages.
3. Feed the generated data to the simulation.
4. Merge the system with a model that observes and learns from the simulation. This model is
the product of two other projects.
5. The simulation shall work for batch and real-time streaming.
The purpose of the project is to create software that can find and correct errors in a logistic system.
These errors are the outcome of human mistakes. The software will be designed as a machine
learning software i.e. a model that will learn how the system works and what a correct sent letter
looks like versus an incorrect one.
The model will be created such that it can be applied to real-world logistics operations. Thus, the
software shall reduce the number of wrongly handled packages.
1.3 Limitations A big limitation was the time limit and prior knowledge about the subject. The implementation time
was greatly reduced by all the research that was necessary to understand the subject and how to go
about the implementation. Since this is one part of a bigger project a limitation is to be compatible
with the other parts, the formatting of the data input was influenced by the other parts of the
project.
1.4 Background This section covers the background information of the most important concepts used in this paper. A
big subject such as machine learning requires a bit of background to understand.
1.4.1 Artificial Intelligence A machine that can accomplish tasks that humans require intelligence to do is often seen as artificial
intelligence [8]. The Turing test is an approach to defining intelligence [1]. The test consists of a
human interrogator and a machine. The machine must answer some written questions. If the
interrogator can’t tell whether the response comes from a computer or a human, then the computer
passes the test. The computer requires four skills to pass the test:
Natural language processing to communicate with the interrogator.
Knowledge representation to remember what is observed.
Automated reasoning to use its memory to answer questions.
Machine learning to detect patterns and adapt to new circumstances.
The Turing test avoids physical contact on purpose because physical contact is unnecessary for
displaying intelligence. The total Turing test is a supplement for the original Turing test which adds a
video signal so that the interrogator can test the subject’s visual abilities; this test also includes a
hatch for the interrogator to pass objects to the subject. To pass the total Turing test, the machine
will need two additional attributes,
Computer vision to see objects that the interrogator hands the subject.
Robotics to receive and manipulate the objects that are given from the interrogator.
The six attributes described above represent the most of AI, although AI researchers spent little time
focusing on the test. It is believed to be more important to understand the underlying parts of
intelligence rather than designing a machine specifically made to complete these tasks [1].
3
1.4.2 Machine Learning Machine learning is everywhere. Spam filters, recommenders and self-driven cars are examples on
what machine learning can accomplish. Machine learning simply put is a machine that improves its
performance when completing future tasks after doing observations on the environment. Building a
machine learning solution is a complex task. Creating such a solution, following a specific workflow is
a good idea. Usual workflow consists of the following:
1 Defining the problem
Describe the problem and list similar problems and assumptions. Explain why the problem
needs to be solved. Describe how to solve the problem.
2 Preparing the data
Search for the available data, see what can be removed and if something is missing. Get the
data in the right format. Scale the data if needed.
3 Spot check algorithms
Test a lot of different algorithms to check which work for your data.
4 Improving the Results
After the spot check, run an analysis on the parameters of the top algorithms, to push the
algorithms to the limit.
These four steps can be applied to most machine learning problems. The advantages of building a
machine learning application are first off the learning capabilities. If trained on some datasets the
program will eventually learn to represent the data as different features [14]. The old approach
using a data scientist to analyse the data and define the features manually requires more time in
some cases and might not be possible. In recent years, machine learning has been used to find
relevant features in otherwise tangled datasets. Such feature finding can be used for example in face
recognition and speech recognition.
Another advantage is parameter tuning. An advanced neural network can have more than a million
tuneable parameters. A human couldn’t possibly fine tune such a large amount manually to find the
most optimal parameters. Therefore, learning algorithms such as gradient descent can be used to
find the best tuning. A disadvantage of machine learning is that there is no guarantee that all
problems can be solved with a machine learning algorithm.
The large amount of data sometimes required to train a model can be troublesome to work with or
collect. Fortunately, there exists a wide variety of complexity among machine learning algorithms
some that require fewer data, and some require more.
1.4.3 The Swedish Mail Format There are five attributes that describe where the letter should arrive: name, street, street number,
zip code and city [9].
4
Figure 1: Letter case.
An explanation of the attributes in Figure 1:
The name is the name of the receiver in this case “Mottagare Mottagarsson”.
The street is the specified street for the letter to arrive at, which is directly connected with
the last two digits of the zip code.
The street number specifies which number on the street the letter should arrive at; this is
the end of the delivery.
Zip Code is the attribute with the most information. The first two digits are describing a
city/area; the third tells us the delivery form and the last two digits specify a bundle of
streets [10].
City/Area is the city/area that the letter will arrive at, this attribute relates to the first two
digits in the zip code.
Sometimes the data of the sender is also written in the letters. This can be in the same form as the
receiver.
1.5 Social, Ethical and Environmental Considerations Machine learning is a hot topic of discussion, primarily because of automation. This automation has
already replaced a lot of physical labour. This is known, but most of the machines that replaced the
jobs have been programmed for a specific task. Therefore, only jobs that are predictable in nature
can be taken. In more recent years a lot of research has been made on “intelligent” machines. This is
where machine learning enters the picture.
The creation of smarter software will give a result of replacing humans in even more advanced jobs.
This is a huge problem for the economy if no solution is made. The automation is inevitable and
must be adapted to society. In an article [11] the author mentioned an approach which was an all-
automated economy, where everyone had a guaranteed base income. In the same article, the
author also mentioned the Peltzman effect as a counterargument for the argument that lazy people
will not do any work if they get money anyway. The Peltzman effect cited from an article [12] “The
Peltzman Effect is the hypothesised tendency of people to react to a safety regulation by increasing
other risky behaviour, offsetting some or all of the benefit of the regulation.”
5
All data that is used to train the model is manufactured so that no one feels that their names or
addresses are compromised.
1.6 Method The implementation process for this project consisted mostly of research on different subjects about
machine learning and AI. The first phase was to understand what machine learning is and how it can
be applied to logistics. This includes deciding what machine learning library to be used and how to
use it. The next step was to decide what type of data that will be generated and how to introduce
human mistakes in the data. The last two steps of the project were to build the simulation of the
data flow and merge the system with the AI.
1.6.1 Python After the research phase of the project, it was necessary to create a prototype as fast as possible
with the machine learning libraries available. Python was chosen because of its capability of fast
implementation of code and the applicable libraries within machine learning, such as Scikit-Learn
and Tensorflow. We ended up using Scikit-Learn.
6
2 Design and Implementation The general design of the project is to generate data in the form of letters, simulate the letters in a
logistics operation and save data to be used for a machine learning algorithm. The data gets their
properties from a database containing zip code, city, and other information needed for a letter.
Errors are introduced in the simulation representing human mistakes so that the AI can learn how a
mistake can look like. This project can be divided into three parts that will be explained in more
detail.
Figure 2: The general design concept for the simulation. Output 1 is the raw data generated.
Output 2 is in the form of dispatched letters.
2.1 Setting up the database The database consists of a list of names [15], a list of cities with the related zip codes and a list of
cities with the related streets. The database is built to simplify an expansion in data so that it’s easy
to add more locations and names. A list of links for the data centres used in the simulation is also
saved in the database.
2.2 Generating the data Data is generated from the database in the form of letters in the Swedish mail format. The generator
loads the configuration from the database as demonstrated in figure 2. Four arguments are
required: the number of letters to be generated, the percentage of critical errors, the percentage of
non-critical errors and the output file name. Critical errors are “mistakes” that cause the letter to be
defective and can’t be sent. Letters with non-critical errors can still be sent in the simulation. The
letters with critical errors are tagged so that the AI can learn to sort out inadequate letters. The
errors that are introduced in the data are the following:
Missing zip code (critical)
Missing street (critical)
Missing street number (critical)
Wrong zip code length (critical)
Missing name
Missing city
To explain one iteration of the “Data Generator” part in Figure 2, we start with a system with a list
containing a bunch of addresses that where generated from the database. First of letters are
generated by randomly picking an address in the list and a name of one who “lives” there. A user
7
defined amount of letters will now be tampered with. For each error that will be generated a letter
is picked at random that does not already have an error. When a letter that is not altered is found,
information on that letter is affected by one of the available errors. The data is now fully generated
and is sent to the simulation.
2.3 Simulating the data The simulation is built such that each city is represented with a post centre and each post centre is
connected to the city’s streets. Links between post centres are loaded from the database which
decides where the post centres can send their letters. Letters are loaded into the system from the
generated data. The system removes all invalid letters and sends each legitimate letter to a random
post centre, this step represent the mailbox discharge, where a postal worker empties a mailbox and
delivers the mail to the closest sorting terminal. Within each post centre, their letters are sorted by
the zip code.
Figure 3: The design of the main simulation loop.
If the two first numbers on the zip code match the post centres zip code numbers, then the letter is
marked to be sent to its delivery address in the city. If the zip codes don’t match the letter is marked
to be sent to another post centre with the correct zip code.
The system loops through all the post centres sorted letters, and with an error chance that is
decided by the user, a “mistake” is created. The following errors can be manufactured:
The marked address is changed to a different street within the city
The marked address’s street number is changed to a different random number
The marked address is changed to a random different sorting centre
Sorted letters that are tampered with is flagged so that the AI can learn that the delivery is incorrect.
The sorted letters are saved in a file for the AI to observe. All sorted letters are sent to their marked
address whether it is the correct one or not. This process is repeated several times as figure 3 shows.
The number of iterations is defined by the user.
Simulation of one letter in one loop as described in figure 3 will be explained now. The first step is
the data generator. The size of the address list is checked and a random number between 0 and the
size of the list is used to choose the destination of the letter. The generator randomizes errors to the
generated letters, no errors where generated in this case. The system checks if the letter has
adequate information, if that’s the case the letter is distributed to one of the existing sorting
centres. The destined sorting centre is randomized.
8
The letter is sorted by its zip code in the sorting centre. Additional information is added to the letter,
the current location is added as well as the end location. An error in sorting can be made, the chance
of this happening is user defined. The letter happened to get an error; one of the three different
kinds of error gets randomized. An error in the form of wrong city location is generated, now the
data in the letter is tweaked to match a random different city. The data that is tweaked is the newly
added end location. The post centre now sends the letter to the wrong city. The sent letter is saved
as data for the machine learning algorithm. The letter arrives in and goes through the same thing
over again. This time no errors where generated after the sorting process. The letter is sent to the
correct city. Data is saved to the machine learning algorithm. Arriving at the correct city’s sorting
centre the letter is now sent to the right address and is correctly delivered. Data is saved. Each time
letters are sent from the sorting centres data is saved to a file with the purpose of examination by
the machine learning algorithm.
9
3 Results In this section, the results of the project are described in three parts. The database, data generator
and lastly the simulation.
3.1 The database The database consists of 100 different names, 21 cities and 44 streets. The number of street
numbers can be assigned by the user. 200 street numbers make the total amount of unique
addresses 184 800.
3.2 Data Generator Data is generated and saved as a csv file one file is saved per loop the data saved to the file is used
by the machine learning model. The file consists of a header that describes each column; each line
under the header represents one letter. The error percentage can be tuned between 0-100.
3.3 Simulation The simulation can load data that is generated in the form of letters. Letters that are loaded into the
system is simulated as a logistics operation. An error chance can be tuned between 0-100 and
determines the chance a letter is sent to the wrong address from a post centre, all post centres have
the same percentage. The letter then gets a different end location than the address on the letter.
The simulation saves all letters sent in the system to a csv file one file is saved per loop the data
saved to the file is used by the machine learning model. The data saved each transfer is the
following:
Name
Surname
Street
Street number
Zip code
City
Legitimate - Boolean
Start street
Start street number
Start zip code
Start city
End street
End street number
End zip code
End city
Correct Delivery – Boolean
The seven first parameters describe the information on the letter that is sent. The four next
parameters tell us the start position of the letter. Parameters with the “End” prefix tell us the
position the letter is going to have after the delivery; this may not be the correct position. The last
parameter is a true or false statement with says whether the transfer is a correct one or not.
3.4 Result Summary Using a data set with a size of 100 000, 20% critical error, 10% non-critical error and an error chance
in the simulation at 50%. With these settings, the simulation was tested ten times with ten iterations
of the main loop. Table 1 shows the correlation between time and the number of letters generated.
10
The correlation between time and letter simulated in one loop is shown in table 2. The ten test runs
can be found in an attached file named “Sim10Iterations.xlsx”. The average time for each iteration is
described in table 3.
The tests show that the data generator and the simulation work and fulfil almost all the
requirements made from the question formulation. Data is generated and simulated with errors
introduced. The data can be applied to a machine learning algorithm in batch. Functionality that did
not make it to the deadline was real time streaming.
Size Time
10000 0.162
100000 1.65
1000000 16.5 Table 1: The left column “Size” is the number of letters generated, on the right column “Time” shows the time it
took to generate the data in seconds.
Size Time
10000 0.316
100000 3.35
1000000 39.5 Table 2: The left column “Size” is the number of letters simulated, on the right column “Time” shows the time it
takes for the simulation to do one iteration i.e. send all letters in the system one time.
Iteration Time
1 5,99
2 8,33
3 10,4
4 9,53
5 11,9
6 10,3
7 12,4
8 10,1
9 10,1
10 12,7 Table 3: Times for each main loop iteration using a data set size of 100 000. This table is the average of 10 test
runs.
11
4 Discussion One thing that worked great with this project was the communication between the two other
projects which built the machine learning model. We had a lot of fruitful discussions about how the
pieces should be put together. This was especially good at the research phase where we gave
suggestions to each other about good research material.
The amount of time that was soaked into research was so much that it crippled the implementation
and development of the simulation; on the other hand, the research was necessary to understand
such a complex problem as machine learning. All time that was invested early in the research phase
made it easier to understand what sort of data, a machine learning algorithm can use. The reason
why I did so much research was that I wanted to learn as much as possible about machine learning
because I felt that I could get insight on how to format the data and I didn’t want to miss out on
better ways to work with machine learning. I ended up doing more research then needed for my
specific task but I look at it not as wasted time but as good self-improvement.
A lot of improvements can be made for this project; first of, the structure of the database was not
well enough thought through. The database can be redesigned to be understood easier and become
more efficient. A simple improvement would be to rework the different files to a single file with a
more logical structure. The list of names is now locked to a full name that should be two different
categories first name and surname, this was thought of during the implementation but the time
spent to improve the system was not something that was a priority. The letter generation works
great its functionality requirements are met.
Using the solution is a bit troubling if the user wants to tune any of the parameters of the data
generator or the simulation it is required to go into the code. This can be solved by implementing a
user interface, which would make it easier to use and if such an improvement was made then other
functions could also be introduced such as easier integration with different machine learning
algorithms, more options on how the data should be saved etc.
In comparison to a real logistics operation, the simulation is small and does not contain much detail.
A lot of improvements can be done in this area; shipping vehicles can be put into the system so that
the delivery time can be accountable into the simulation. This small change in the system makes it
much more complex and can hopefully make a more applicable AI. Transport timings and transport
damage can be monitored. Fuel optimisation problems could be examined.
4.1 Future Work If I were to continue to work on the project I would add some additional functions. One feature I
would add is errors in the form of misreading. For example if a letter case had a number one or
seven the characters could be “badly written” and be misinterpreted which as a result will be sent to
the wrong address.
Automation for the whole proses would be implemented. As I mentioned earlier in the thesis, the
link between the simulation and the AI only works in batch mode. Adding the machine learning
algorithm to the main loop could automate the process. Another feature that I want to implement is
the sender’s details on the letters. Right now the letters only contain the information about the
receiver therefore no relation between receiver and sender can be made by the AI.
12
5 Conclusion The project was on the right track for success, but it needs some more work to fulfil all the goals of
the project. The simulation works in batch mode and the data that is generated can be applied to a
machine learning algorithm, but since real time streaming is not implemented in the simulation
which was one of the goals, the project cannot be considered finished. A bit more work is needed to
develop the simulation further so that the solution can be used for incremental learning. In this
project, a logistics operation simulation is introduced. The simulation generates data as letters and
letter deliveries. The data produced is later used by a machine learning algorithm to learn how to
correctly send a letter.
13
References
[1] Russell S, Norvig P. (2010). Artificial Intelligence A Modern Approach (3rd edition). New jersey:
Pearson Education.
[2] Fasli M. (2014). Analyzing and modeling complex and big data | Professor Maria Fasli |
TEDxUniversityofEssex. TEDx Talks. Available at:
https://www.youtube.com/watch?v=8DqQCZMawNg [Accessed 31/03 2017].
[3] Creative Punch. (2014). Artificial Dataset Generation for Machine Learning with Python and
Numpy / Theano. Available at: http://creative-punch.net/2014/08/artificial-dataset-machine-
learning-python/ [Accessed 07/04 2017].
[4] Rief M, Shafait F, Dengel A. (2012). Dataset Generation for Meta-Learning. Available at:
http://www.dfki.de/KI2012/PosterDemoTrack/ki2012pd15.pdf [Accessed 07/04 2017].
[5] Human Error. Available at: https://en.wikipedia.org/wiki/Human_error [Accessed 12/04 2017].
[6] Foote K. (2016). Machine Learning: From Then Until Now. Available at:
http://www.dataversity.net/machine-learning-now/ [Accessed 18/04 2017].
[7] Brownlee J. (2013). How to Prepare Data For Machine Learning. Machine Learning Process.
Available at: http://machinelearningmastery.com/how-to-prepare-data-for-machine-learning/
[Accessed 18/04 2017].
[8] Copeland J. (2000). What is Artificial Intelligence?. Available at:
http://www.alanturing.net/turing_archive/pages/reference%20articles/what%20is%20ai.html
[Accessed 20/04 2017].
[9] Swedish Standards Institute. Brevets yttre. Available at:
http://www.sis.se/Documents/TK/TK%20322/Brevets_Yttre.pdf [Accessed 26/04 2017].
[10] Postnummer i Sverige. Available at: https://sv.wikipedia.org/wiki/Postnummer_i_Sverige
[Accessed 26/04 2017].
[11] Ford M. (2015) Rise of the Machines: The Future has a Lot of Robots, Few Jobs for Humans.
Available at: https://www.wired.com/brandlab/2015/04/rise-machines-future-lots-robots-jobs-
humans/ [Accessed 27/04 2017].
[12] Specht P. (2007). The Peltzman Effect: Do Safety Regulations Increase Unsafe Behavior?
Available at: http://www.asse.org/assets/1/7/fall07-feature02.pdf [Accessed 27/04 2017].
[13] Adminen. (2013). Brevets väg genom postsystemet. Available at:
http://www.startsverige.nu/brevets-vag-genom-postsystemet/ [Accessed 04/05 2017].
[14] Bupe C. (2015). What are the advantages and disadvantages of machine learning? Available at:
https://www.quora.com/What-are-the-advantages-and-disadvantages-of-machine-learning
[Accessed 05/05 2017].
[15] Joe. ListOfRandomNames. Available at: http://listofrandomnames.com/index.cfm?textarea
[Accessed 12/05 2017].
14
Appendix Sim10Iterations.xlsx