Classifying Material Defects with Convolutional Neural Networks …1330515/... · 2019. 6. 25. ·...

UPTEC F 19026

Examensarbete 30 hpJuni 2019

Classifying Material Defects with Convolutional Neural Networks and Image Processing

Jawid Heidari

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Classifying Material Defects with Convolutional NeuralNetworks and Image Processing

Jawid Heidari

Fantastic progress has been made within the field of machine learning and deep neuralnetworks in the last decade. Deep convolutional neural networks (CNN) have beenhugely successful in image classification and object detection. These networks canautomate many processes in the industries and increase efficiency. One of theseprocesses is image classification implementing various CNN-models. This thesisaddressed two different approaches for solving the same problem. The first approachimplemented two CNN-models to classify images. The large pre-trained VGG-modelwas retrained using so-called transfer learning and trained only the top layers of thenetwork. The other model was a smaller one with customized layers. The trainedmodels are an end-to-end solution. The input is an image, and the output is a classscore. The second strategy implemented several classical image processing algorithmsto detect the individual defects existed in the pictures. This method worked as aruled based object detection algorithm. Canny edge detection algorithm, combinedwith two mathematical morphology concepts, made the backbone of this strategy.Sandvik Coromant operators gathered approximately 1000 microscopical images usedin this thesis. Sandvik Coromant is a leading producer of high-quality metal cuttingtools. During the manufacturing process occurs some unwanted defects in theproducts. These defects are analyzed by taking images with a conventionalmicroscopic of 100 and 1000 zooming capability. The three essential defectsinvestigated in this thesis defined as Por, Macro, and Slits. Experiments conductedduring this thesis show that CNN-models is a potential approach to classify impuritiesand defects in the metal industry, the potential is high. The validation accuracyreached circa 90 percent, and the final evaluation accuracy was around 95 percent,which is an acceptable result. The pretrained VGG-model achieved much higheraccuracy than the customized model. The Canny edge detection algorithm combineddilation and erosion and contour detection produced a good result. It detected themajority of the defects existed in the images.

ISSN: 1401-5757, UPTEC F 19026Examinator: Tomas NybergÄmnesgranskare: Niklas WahlströmHandledare: Mikael Björn

Acknowledgements

First of all, I would like to thank my supervisors Mikael Björn at Sandvik Coromant for guidanceand support. Besides, I would express my gratitude and thanks to the quality lab personal thathelped me with data collection and data categorization. It has been a great time to work at a largeinternational company. Thank you for this fantastic opportunity! I would also like to thank mysubject reader Doctor Niklas Wahlström at Uppsala University. Without all of your feedback andguidance, this thesis would not have become what it is today. Furthermore, I would like to thankmy fellow thesis workers Emil Fröjd and Alving Ljung. I have enjoyed our discussions throughoutthe thesis and thank you for your ideas and help.

Populärvetenskaplig sammanfattning

Under de senaste åren har det gjorts stora framsteg och utveckling inom området maskininlärning.Djupa faltningsnätverk som delvis efterliknar hjärnans neuroner i sin funktionalitet och inlärn-ing har varit särskilt framgångsrika inom bildbehandling, bildklassificering och objektdetektion.Dessa nätverk är mycket komplexa i sin struktur och består av många olika lagrar/komponenter.Den stora kostnadsreducerande potentialen lockar allt fler forskare, industriföretag och it-företagsom Tesla, Google och Facebook som även de själva har drivit fram utvecklingen inom området.Till exempel använder Tesla allt mer maskininlärning för deras självkörande bilar. Detta är in-gen enkel uppgift och många utmaningar återstår fortfarande. Ett annat problem som mångastora industriföretag, så som ABB och Sandvik, möter är att automatisera olika processer ochspara pengar samt reducera repetitiva arbetsuppgifter med hjälp av maskininlärning. I det härexamensarbetet kommer djupa faltningsnätverk och bildbehandling tillämpas för att klassificeraoch detektera tre förekommande materialdefekter i samband med tillverkning av skärverktyg hosSandvik Coromant. Dessa defekter förekommer naturligt under tillverkningsprocessen och bidrartill en försämrad produktkvalitet. För närvarande detekteras olika typer av defekter manuellt avföretagets operatörer. Sandvik Coromant har som mål att automatisera denna process i hop omatt minimera tiden från start till slutprodukt med hjälp av maskininlärning. Testavdelningen harsamlat ca 1000 bilder på tre olika typer av defekter så som Por, Makro och Slits defekter. Bildernaär tagna med konventionella mikroskop och är kapabla att ta bilder upp till 1000 gångers förstor-ing. Det här examensarbetet består av flera olika delar. Första delen är att samla nya bilder ochförberedda dessa. Det andra delen är att välja och träna lämpliga djupa nätverk. För det härprojektet har två olika nätverk tränats. Det första är ett stort och generellt förtränat nätverk somheter VGG19 och den andra nätverket är ett mindre som är skapat enbart för det här projektet.Syftet är jämföra resultatet mellan dessa två nätverk och välja det bästa. De experiment som harutförts under examensarbetet visar att djupa faltningsnätverk är ett bra tillvägagångssätt för attklassificera materialdefekter. Det stora förtränade nätverk visade sig presterade bäst och påvisadeen noggrannhet på ca 95 procent. Den begränsande faktorn var kvantitativt samt kvalitativt därdatamängden och datakvaliteten inte var tillräckligt. För framtida projekt behövs större och bättredatamängder. Det här examensarbetet är det första i sitt slag och kan därför betraktas som ettförundersökningsarbete för framtida projekt. De här resultaten kan användas som en fingervisninginför framtida arbeten. Mycket mer arbete återstår innan detektion av defekter kan automatiserashos Sandvik Coromant.

1

List of AbbreviationsACC Final Accuracy

ADAM Adaptive Moment Estimation

AI Artificial Intelligence

ANN Artificial Neural Network

CNN Convolutional Neural Network

DNN Deep Neural Network

GD Gradient Descent

GPU Graphical Processing Unit

ML Machine Learning

PPV Positive Predicted Values

R-CNN Region Proposal CNN

RMSprop Root Mean Square Propagation

RNN Recurrent Neural Network

SGD Stochastic Gradient Descent

SSD Single Shot MultiBox Detector

TPR True Positive Rate

YOLO You Only Look Once

List of Symbolsβ1, β2 The exponential decay rates for ADAM optimizer

η The first coefficient of momentum

γ Learning rate for SGD

m, v The first and second corrected moments of the gradient loss for ADAM

y The predicted values from a neural network

∇ The nabla operator

2

ψ The gradient direction for Canny filter

σ The activation function

a A vector representation of a full connected neural network layer before activation function

b The bias term in neural network

C Number of channels in an image

D1, D2 Image height and width

Dk Kernel size (K1xK2)

Din Image input size (D1xD2

Dout Feature map size (output size after convolution of X ∗K

E The cost/error function for the backpropagation

G Gaussian kernel for edge detection

g The gradient on current mini-batch in ADAM

g1, g2 The Gaussian kernel size for Canny filter

Gx Sobel filter for horizontal derivative in Canny algorithm

Gy Sobel filter for vertical derivative in Canny algorithm

I Gradient intensity matrix for Canny algorithm

K Number of classes in the softmax function

K The general kernel used in CNN

k 1D kernel

K1,K2 Kernel height and width

l A layer in a network

m, v The first and second moments of the gradient loss for ADAM

nl−1, nl, nl+1 Represents the neurons in three successive layers, used in backpropagation algorithm

P Number of channels in a kernel

p Zero padding size

S Stride size

W The weight matrix in a fully connected layer

w A specific weight in the network

X A general input in a machine learning model

y The true labels/values

Z The feature map after convolution operation

z A vector representation of a full connected neural network layer after activation function

3

Contents

1 Introduction 91.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Short Introduction about Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Project Goal and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Theory of Machine Learning and Neural Networks 122.1 General Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Training, Validation and Test Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Fully Connected Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Tanh Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 ReLU Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.4 Softmax Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Forward Propagation and Loss Functions . . . . . . . . . . . . . . . . . . . . . . . 172.5.1 Square Loss or L2-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.2 Cross Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Backpropagation and Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 172.7 Optimization Methods in Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 182.8 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.8.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.8.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.9 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.9.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.9.2 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.10 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Classifying Material Defects using Neural Networks 243.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Data Collection and Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Neural Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 VGG Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2 Customized Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Network Training and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.5.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.2 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Classifying Material Defects using Image Processing 344.1 Image Processing and Object Detection . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Canny Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Morphological Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Contour Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Implementation and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4

5 Result 385.1 Defect Classification using CNN Models . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Evaluation of CNN Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Defect Detection using Image Processing . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Comparison of CNN models and Image Processing . . . . . . . . . . . . . . . . . . 43

6 Discussion 466.1 Analysis of CNN Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2 Analysis of Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 Comparison of CNN Models and Image Processing . . . . . . . . . . . . . . . . . . 476.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion 49

8 Future Work 50

5

List of Figures

1.1 Images taken in 100 zooming and with white background. Left image contains onlyone Por and right image contains many Por . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Images taken in 1000 zooming and with dark background. Left image contains twoMacro and right image contains one Slits . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 The split of the total data into three subsets, training, validation and test data . . 132.2 A fully connected neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 A node of a fully connected neural network in mathematical terms . . . . . . . . . 142.4 The three different functions mentioned above . . . . . . . . . . . . . . . . . . . . . 162.5 A convolutional neural network, with several blocks . . . . . . . . . . . . . . . . . . 202.6 A convolution operation between the inputX and the kernelK producing the feature

map Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.7 A max pooling with kernel size 2x2 and stride 2 . . . . . . . . . . . . . . . . . . . . 222.8 The dropout effect in a network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.9 The process of early stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 The existence of pors in the images. Left image is specifically of type A00B06 andright image is of type A02B04 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 The horizontal flip of an image, on the left original image and the right flipped image 263.3 The vertical flip of an image, on the left original image and the right vertical flipped

image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 The rotation of an image, on the left original image and the right rotated image . 273.5 The translation of an image, on the left original image and the right translated image 273.6 The cropping of an image, on the left original image and the right zoomed image . 283.7 The Gaussian added noise of an image, on the left original image and the right

noised image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 VGG-16 neural network architecture [1] . . . . . . . . . . . . . . . . . . . . . . . . 293.9 Three different transfer learning strategies, the large rectangle is the main block of

a model and the small rectangle is only the top layers, including the softmax classifier 31

5.1 Training accuracy and cross entropy loss of the pretrained VGG19 using transferlearning, training the top layers only. This figure shows the result without dataaugmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.2 Training accuracy and cross entropy loss of the pretrained VGG19 using transferlearning, training the top layers only. This figure shows the result with data aug-mentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Training accuracy and cross entropy loss of the customized model. This figure showsthe result without data augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4 Training accuracy and cross entropy loss of the customized model. This figure showsthe result with data augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.5 The confusion matrix of the VGG19 model without data augmentation (right) andwith data augmentation (left) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.6 The confusion matrix of the customized model without data augmentation (right)and with data augmentation (left) . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.7 The four performance measures for Por (left) and Slits (right), the + signs representthe model with data augmentation. The blue chart is VGG19, and the orange oneis VGG19+. The green one is the customized model (C), and the red is C+. . . . 42

6

5.8 The four performance measures for Macro (left) and the final accuracy of all models(right), the + signs represent the model with data augmentation. The blue chart isVGG19, and the orange one is VGG19+. The green one is the customized model(C), and the red is C+. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.9 The detected defects by image processing, both Macro and Por defects and somefalse detected defects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.10 The detected defects by image processing, Macro, Por and Slits defects and somefalse detected defects in the edges of the picture on the left . . . . . . . . . . . . . 43

5.11 The confusion matrix of the VGG19 model with data augmentation (left) and imageprocessing (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.12 The F1 score and final accuracy for the image processing approach and VGG19model with data augmentation. The first three charts represent F1 score for eachdefect, and the last chart is the final accuracy for all three combined . . . . . . . . 45

7

List of Tables

3.1 The three defect categories for this thesis . . . . . . . . . . . . . . . . . . . . . . . 243.2 Decomposition of VGG19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Decomposition of customized network . . . . . . . . . . . . . . . . . . . . . . . . . 303.4 Confusion table/matrix for binary classification . . . . . . . . . . . . . . . . . . . . 32

5.1 Training result for CNN-models [VGG19 and customized=C] with -and withoutdata augmentation. In the table data augmentation is denoted by a + sign toreduce the space, so C+ means customized model with data augmentation. Thebest performance metric is indicated by bold text. . . . . . . . . . . . . . . . . . . 39

8

Chapter 1

Introduction

Artificial intelligence (AI) and machine learning (ML) has revolutionized today’s society. Enor-mous progress has been made in this field the last two or three decades. AI and ML are used inmany different areas today, from cancer tumor detection to voice recognition. It is a billion dollarindustry. This tremendous growth is possible thanks to large computational power and a largeamount of data available. The AI and ML algorithms enable computers to learn from data, andeven improve themselves, without being explicitly programmed. These algorithms can performfuture predictions.

The idea of machine learning is not a new concept for us. This field started already around 1950,at the early age of computers. However, the breakthrough came in the 1990s with more prob-abilistic approaches and progresses in the field of computer science. Big companies like Googleand Facebook are investing a large amount of capital in developing new algorithms and platforms.Recently, Google developed TensorFlow, which is an open source library used for deep learningand neural network. Recurrent neural network (RNN) plays an essential role in the field of naturallanguage processing and text analyses, for example, in text classification and voice recognition.The great success and performance of the convolutional neural network (CNN) within the field ofimage classification and object detection is a promising strategy for the industrial companies toautomate industrial processes. Another growing area for CNN-models is the automotive industryfor autonomous driving cars.

1.1 Background

This thesis project was conducted at Sandvik Coromant in Gimo Sweden, which is a world leadingsupplier of tools, tooling solutions, and know-how to the metalworking industry. Sandvik Coro-mant offers products for turning, milling, drilling, and tool holding. Also, Sandvik Coromant canprovide extensive process and application knowledge within machining. Sandvik Coromant is partof Sandvik corporation, which is a global company with 43,000 employees and is active in 150countries around the world.

The manufacturing process of metal cutting tools is very complicated. A cutting tool or cut-ter is used to remove material from the workpiece through shear deformation. It is, for example,used to drill metals or rocks. The process starts by cemented carbide powder and ends with acutting tool product. During this long process, the risk of contamination and impurities is veryhigh. Those undesired substances can reduce the strength and other physical and chemical capac-ities of the products. The company has rules and regulations for specific products to fulfill. Oneof these rules is about the defect size. A defect is unwanted structural damage, contamination,and impurities during the manufacturing process. Therefore, Sandvik Coromant has a testingdepartment that controls and investigates the products. The control is performed by taking andanalyzing microscopic pictures of the products by a conventional microscope. Operators performthis process at the department. The testing team is investigating the occurrence of many differentstructural defect types. This detecting and classifying process is performed manually by operatorstoday. The process is highly tedious and time-consuming. The company wants to improve itsperformance and accelerate this process by an automatic system.

9

1.2 Short Introduction about Data Set

The testing and quality department at Sandvik Coromant in Gimo has gathered microscopicalimages for many different types of defects. In this thesis, however, three types of defects areanalyzed, Por, Macro, and Slits. These defects are of the same kind, but different reasons causethem. Therefore, it is highly essential to classify and distinguish these from each other. Por usuallyis under 25 Micrometer and circles shaped, see figure 1.1. Macro is over 25 Micrometer and circleshaped, but also different shapes, see figure 1.2. Slits is also over 25 Micrometer and long rectangleshaped, see figure 1.2. The dataset consists of approximately 1000 images, with two differentzoomings (100 and 1000) and two different backgrounds (white and dark).

Figure 1.1: Images taken in 100 zooming and with white background. Left image contains onlyone Por and right image contains many Por

Figure 1.2: Images taken in 1000 zooming and with dark background. Left image contains twoMacro and right image contains one Slits

10

1.3 Project Goal and Limitations

The main objective of this master thesis is to develop a machine learning algorithm and imageprocessing to classify three different structure defects, describe below.

• Por, very small defects, under 25 micrometer.

• Macro , bigger defects from 25 micrometer.

• Slits, very small in width but long in height from 25 micrometer.

Two different strategies will be investigated to solve this problem.

1. The first approach is to implement CNN-models to classify these defects. The input is animage, and the output is a classification score.

2. The second approach is to implement an object detection algorithm, where the idea is todetect and classify the individual defects in an image.

This machine learning approach is the first project at Sandvik Coromant for this type of prob-lem. The main question is this: is it possible to implement a machine learning model with sufficientperformance to replace the manual process by operators? We will investigate this problem fromboth the academic and industrial perspective.

The main potential limitations for this project are: Firstly, the contrast and the zooming dif-ference of the image can cause a problem for CNN-models to capture. The optimal start would beif all the image were of the same quality, taken by the same microscope and with a similar back-ground. Secondly, the testing department is using several lenses with different properties. It canintroduce another limitation to the dataset. Thirdly, another limitation is to gather new imageswith the same quality. The Slits and Macro defects are not very common in the test samples, soit would be a time-consuming process to start from scratch and collect completely new images.

The rest of this thesis is divided into eight chapters. Chapters 2 and 3 are related to the firstgoal of this project, the defect classification using CNN models. Chapter 4 is about the secondgoal of this project, the rule-based image processing algorithms to detect individual defects in theimages. All the results and discussions for this thesis are presented in chapters 5 and 6. Chapters7 and 8 are about the conclusion and future works.

11

Chapter 2

Theory of Machine Learning andNeural Networks

2.1 General Introduction to Machine Learning

The fundamental principle of machine learning and statistical modeling is to formulate a mathemat-ically formalized way to approximate reality and make future predictions from this approximation[2]. Machine learning is a sub-domain of computer science. The machine directly learns from theunderlying structure of input data and get smarter, and they are not hardcoded based on somedeterministic rules. Commonly the machine learning algorithms are categorized into three majorcategories: supervised, unsupervised, and reinforcement.

2.1.1 Supervised LearningThe models are trained based on input data X and output data y, and the main idea is for thealgorithm to learn the mapping function f from the input to the output.

y = f(X) (2.1)

In supervised learning, the learning is supervised or watched based on ground truth label or outputdata; a good analogy is to a teacher overseeing the learning process of a student. Based on thecorrect names, the algorithm iteratively makes predictions until it reaches an acceptable perfor-mance level. The two major domains in this area are: Classification is a set of problems wherethe goal is to categorize data points into a set of pre-defined classes based on some attributes ofthe data. Practically these might involve image classification where an algorithm is to describewhatever an image contains a cat or dog, or spam detection where an e-mail is to be classified asspam or not depending on its contents. Often this is discrete values. Support vector machine andrandom forest are two known algorithms. Regression is another major area; in this scenario, theaim is to estimate some continuous hidden variable based on other observable variables. For exam-ple stock price, inflation rate, or housing price. Linear regression is a famous algorithm in this field.

The first goal of this project is a supervised strategy, where CNN models will be implementedto classify material defects. Many deep learning algorithms are supervised learning, especially forimage classification and object detection.

2.1.2 Unsupervised LearningUnsupervised learning is where only input data (X) exists, and no corresponding output variables(y). The goal of unsupervised learning is to model the underlying structure or distribution in thedata to learn more about the data. No ground true or correct label exist in this case. Two majorsub-domain are clustering and association problems. K-means [3] is a known algorithm for thiscategory.

2.1.3 Reinforcement LearningThis category is an action based learning [4]. An agent that takes actions to maximize reward ina particular situation. The agent is supposed to find the best possible path to reach the reward.

12

During the training process, the program will favor actions that previously resulted in higherrewards given a similar state. Reinforcement learning requires no data to learn from, as opposedto supervised learning. The input needed in reinforcement learning is instead a function to calculatethe reward. Some applications of reinforcement learning are: Path exploration, this is implementedin computer vision for robots to find a specific room or location. Another example is to navigatingthrough a complicated maze to see the opening. A third example is in game, to find the optimalmoves to win the game.

2.2 Training, Validation and Test Sets

Useful data is the backbone of a machine learning model. Many people think that machine learningis almost some sort of magic, just put in some data into a model, and it will give a fantastic result.But the reality is not that simple. So good data will hopefully provide some useful result withcertain tricks. The deep neural networks (DNN) needs a lot of data not only valuable data.

Figure 2.1: The split of the total data into three subsets, training, validation and test data

Usually, the total available data is split into three different subsets, training, validation andtest set, see figure 2.1. The percentage division is dependent on many factors, for example, thenumber of hyperparameters, the quality of data and amount of data.

Training dataThe training dataset is the most important one. It is the dataset to train the models. The modelsees and learns from this data.

Validation dataValidation data is an important benchmark to monitor how good the training process is for themodels. It is used to cross-validate the performance of the model given some hyperparameters.Overfitting is a common problem in optimization and machine learning problems. Validation datais an excellent tool to check this problem during the training of a model and monitor the modelaccuracy.

Test dataThe test set is used to evaluate the final model after the training and validation steps. This datais independent of the training and validation set. The test dataset is used to obtain the definitiveperformance characteristics such as evaluation accuracy, sensitivity, specificity, F-measure, etc.

2.3 Fully Connected Networks

A fully connected neural network consists of a series of fully connected layers. It is the most simplevariation of neural networks illustrated in figure 2.2. The network consists of an input layer, someintermediate hidden layers, and an output layer. Each layer usually consists of a linear functionfollowed by a non-linear transform. The input layer consists of the feature vector x. The net-work is categorized as a deep network if it consists of the number of hidden layers is more thanone. As seen in figure 2.2 each layer consists of several neurons, where each neuron in one layerconnected to every neuron in the next layer. The weights from a layer are represented by realnumbers and determine the information passing to the next layer. These weights are trained and

13

optimized during training to increase network performance and decrease the error function. Fur-thermore, each layer consists of a bias term, 2.3. These terms make the network more robust andgeneralized. A fully connected network is simply a mapping function f mentioned in equation 2.1.1.

Each hidden layer represents a matrix multiplication of the previous layer. Since every hiddenlayer in a fully connected neural network is an array of neurons, a vector representation is possible.In this case, network 2.2 consisted of a three-layers can be seen as y = a3a2a1(x), where a1 repre-sents the first hidden layer and a2 the second. The final layer, a3 in this case, is usually called theoutput layer, which yields the prediction y. These deep networks are highly non-linear because ofmultiple non-linearities in each hidden layer. The networks can learn very complex patterns in thegiven data.

Input 1

Input 2

Input 3

Input 4

Input 5

Bias

Bias

Output: y

l l+1l-1

Figure 2.2: A fully connected neural network

x2 w2 Σ σ

Activatefunction

y

Output

x1 w1

x3 w3

Weights

Biasb

Inputs

Figure 2.3: A node of a fully connected neural network in mathematical terms

The mathematical formulation for a three-layer fully connected network would look like this.First the input feature vector x is multiplied by a weight matrix W 1 with an added bias vectorb1. The result from this node/layer is then passed through a non-linear activation function σ1, see2.3, after this operation the first activation. This process repeated itself for other nodes to get theoutput result.

a1 = σ1(W 1x+ b1

)a2 = σ2

(W 2a1 + b2

)y = σ3

(W 3a2 + b3

) (2.2)

14

The general mathematical formulation is a following.

zl =W lal−1 + bl

al = σl(zl)

y = aL = σL(zL) (2.3)

Where l and a goes from 1 to L, but a0 = x, the input layer. And zl is the weighted input to theneurons in layer l before activation function, al is activation of neurons in layers after activationfunction, l it the layer number, σ is the activation function, which can be different for every layer,W is the weight matrix for a layer, b is the bias, and finally L is the total number of hiddenlayers in the network. All weights W and biases b are parameters that are optimizable during thetraining process, so that y approaches the actual target y. Each entry in the activation vectors arepresents a node in the network. The weightsW then determine the strength of the links betweenthose interconnected layers. The non-linear transformation σ is introduced to increase the networkcomplexity and capture a more complex data structure. Otherwise, the network would be onlya linear superposition of linear functions, regardless of the number of layers in the network. Theactivation function is hyperparameter as the number of hidden variables. This function can be ageneral function, both scalar and vector. But commonly it is a scalar and monotonically increasingfunction. Note that for the case when σ is a scalar function, it is applied element-wise for vectorinputs. Some standard activation functions are described further later in this section.

2.4 Activation Functions

The use of activation function has some biological connection. The activation function is usuallyan abstraction representing the rate of action potential firing in the cell. It determines if a neuronis firing or not. Therefore, the choice of activation function has proven to be an essential factorwhen training deep neural networks. There exist many different types of activation functions, bothlinear and nonlinear. These function maps the output from a node in a network, between 1 and-1, to check if a node/neuron should fire or not. The most used ones are:

2.4.1 Sigmoid FunctionThe sigmoid function is a monotonic, smooth and differentiable function. Defined for all real values,and bounded in the interval (0, 1). The function is defined as follows

σ(x) =1

1 + e−x

dσ(x)

dx=

e−x

(1 + e−x)2

(2.4)

The sigmoid function has several useful properties. First of all, it is highly nonlinear, whichcan introduce complexity to the network. Secondly, it is very smooth and easy to implement.However, this function has some significant drawbacks. The primary problem is the vanishinggradient for large x-values. According to figure 2.4 the gradients converges to toward zero forlarge absolute values.This property is resulting in slow error convergence during backpropagation,which not desirable. The second problem is the off-zero centered, which makes the gradient updateinefficient.

2.4.2 Tanh FunctionThe hyperbolic tan function (tanh) is closely related to the sigmoid function. It is defined between(−1, 1) and centered around zero, see figure 2.4. Therefore, optimization is easier in this method;hence in practice, it is a good option over sigmoid function. The function is defined as follows.

σ(x) =ex + e−x

ex + e−x

dσ(x)

dx=

4

(ex + e−x)2

(2.5)

15

2.4.3 ReLU FunctionThe Rectified Linear units (ReLU) has become highly popular in the ANN field in the past years.It is a standard function for hidden layers in the CNN-networks. The function is defined as follows

σ(x) = max(0, x)

dσ(x)

dx=

{0, x < 0

1, x > 0

(2.6)

According to equation 2.6 the mathematical form of this function is straightforward and efficient.The big advantage is avoiding the vanishing gradient problematic in contrast to sigmoid andtanh functions. Therefore, the convergence rate is much faster [5], empirically proved in thispaper. The other big advantages are the weights sparsity in the hidden layer, which inducing someregularization effect and reduces overfitting. [6]. However, ReLU has also some minor drawbacks.The first one is that it can only be used in the hidden layers and second it can result in deadneurons.

Figure 2.4: The three different functions mentioned above

2.4.4 Softmax FunctionFor a classification problem, the desired output is a probability distribution between (0, 1). Thisoutput can be seen as class probability, to decide if a picture is a dog, cat or a humane witha certain probability. For this purpose, the softmax activation function implemented in the lastANN-layer, the output layer. It is defined as follows.

σi(x) =exi∑Cc=1 e

xc

dσidxj

=exi∑Cc=1 e

xc

[δij −

exj∑Cc=1 e

xc

] (2.7)

Where, C is the total number of classes, and δ is the Kronecker delta to simplify the equation.Softmax is a generalized version of sigmoid function 2.4, used for binary classification. Softmax

16

extends this idea into a multi-class classification. That is, Softmax assigns decimal probabilitiesto each class in a multi-class problem. Those decimal probabilities must add up to 1.0.

2.5 Forward Propagation and Loss Functions

During the training process of an ANN, the first step is the forward pass in the network, wherethe data is passed through the network, from the input layer to the final layer. The forwardpass calculates the weighted sum, apply an activation function and predict the outcome, andcalculate the error rate, i.e., the difference between the predicted value and the actual value. Aneural network contains millions of weights; hence, weight initialization is very important to avoidlayer activation outputs from exploding or vanishing during a forward propagation through thenetwork. A proper weight initialization will improve the performance and the convergence speed ofthe network. Some common initializers are Zeros, Ones, Random Normal, and Random Uniform.Glorot Normal [7] is another popular method which has shown good results recently. It drawssamples from a truncated normal distribution centered on 0 with adapted standard deviation afterthe input weights and output wights in a layer.

2.5.1 Square Loss or L2-NormInstead, the prediction accuracy is used in the loss function, which is a more intuitive metric. Awidespread loss function is quadratic loss function or square loss. It is defined as following in vectorform

E =1

2‖y − y‖2 =

1

2(y − y)T (y − y) (2.8)

It is a common function for regression. Where y is the ground truth values and y is the predictedvalues.

2.5.2 Cross Entropy LossA more specific loss function for classification is the cross-entropy loss function. It measures theperformance of a classification model based on class probability which is in range 0 and 1. It isdefined as following

E = −C∑c=1

yc ln yc = −yT ln y (2.9)

In a binary classification problem, where C = 2, the cross-entropy loss can be simplified as.

E = −y ln y − (1− y) ln (1− y) (2.10)

The binary cross entropy loss 2.10 plus sigmoid function 2.4 is named cross binary in many existingmachine learning frameworks and softmax function 2.7 plus 2.9 is called categorical cross entropyloss.

2.6 Backpropagation and Partial Derivatives

Backpropagation is the backbone of ANN. It is a supervised learning technique for neural networksthat calculates the gradient of descent for all the weights in the network. It’s short for the back-ward propagation of errors since the error is computed at the final layer and distributed backwardthroughout the network’s layers. The goal is to minimize the predicted error calculated by a for-ward pass. The error is mathematically formulated in terms of a loss function E(y, y), where y isthe predicted label, and y is ground true. This loss function is minimized based on weights andbiases in the network.

The figure 2.2 is showing three consecutive layers of a network, denoted l − 1, l, l + 1 to illus-trates the derivatives to the weights and biases in a network. The neurons in those layers arerepresented by nl−1, nl and nl+1.

The first step is to define a loss function of E. The derivatives to a single weight in layer l arecalculated based on partial derivatives and chain rule, defined as follows.

∂E

∂wlnlnl−1

=∂E

∂zlnl

∂zlnl∂wlnlnl−1

=∂E

∂alnl

∂alnl∂zlnl

∂zlnl∂wlnlnl−1

=

∑nl+1

∂E

∂zl+1nl+1

wl+1nlnl+1

σ′(zlnl)al−1nl−1

(2.11)

17

Where z, and a is defined in equation 2.3. The equation 2.11 represent the derivative with respectto only one specific weight. The summation is due to all contributions from the neurons in layerl+1 have to be accounted for since their value is affecting the error function. In this case the nl−1and nl is fixed constant, the derivative is taken only for one weight. The term ∂E/∂zlnl in 2.11 isnormally called the “error signal”.

δlnl =∂E

∂zlnl=

∑nl+1

∂E

∂zl+1nl+1

wl+1nlnl+1

σ′(zlnl) (2.12)

It encapsulates the errors concerning each node or layer in the network. The derivatives for biasesare defined as follows.

∂E

∂blnl=

∂E

∂zlnl

∂zlnl∂blnl

=∂E

∂zlnl= δlnl (2.13)

These four equations incorporate the entire backpropagation algorithm for the general matrix form[8]

δL = ∇aE � σ′(zL)

δl =((wl+1

)Tδl+1

)� σ′(zl)

∂E

∂blnl= δl

∂E

∂wlnlnl−1

= al−1nlδl

(2.14)

This first term in equation 2.14 compute the error in the final layer, where zL is defined in equation2.3. Where � Hadamard product or element-wise multiplication. The second term represents theerror in hidden layers or intermediate layers, in a recursive way. The third term is the gradients forthe biases in the hidden layers. The last term calculates the propagating error gradients betweentwo successive layers.

2.7 Optimization Methods in Neural Networks

The fundamental part of a machine learning algorithm is to minimize some given loss function andreduce the error. It is also the case for a neural network algorithm. During the training of anANN, the cost or loss function is optimized based on weights and biases. The main idea is to findthe global minimum. The ideal scenario is to calculate the first and second derivative of a functioncompute the global minimum analytically, as in a calculus exam question. But, that is not thecase for the majority real-world problem. As described in the earlier sections, a neural networkis very complex and highly nonlinear. The solution to this problem is iterative methods, wherewe take advantage of the gradient. It exists many different types of gradient-based algorithms.Some common ones in this field are: Gradient descent(GD) is iteratively moving in the directionof steepest descent as defined by the negative of the gradient, to reach the minimum. In this case,it would update the weights to find a minimum. The algorithm is defined as follows.

wt+1 = wt − γ∇E(wt) (2.15)

Backpropagation and GD calculates the gradient and updates the weights. Where t is the iterationindex, and γ is a hyperparameter, which is called the learning rate. The entire dataset sometimes isused for these updates. For a neural network, the dataset can be extensive, and it can be costly andinefficient to use the whole dataset. Mini-batch approach, combined with GD is another trainingprocedure. It is called stochastic gradient descent (SGD), where the complete dataset is dividedinto a smaller portion. This approach is much faster and more efficient. However, SGD can inducesome fluctuation for parameter updates because of this stochasticity. Another problem with SGDand GD is that it can get stuck in a local minimum, it can oscillate around a local minimum [9].Momentum [10] is a more refined method that helps accelerate SGD in the appropriate directionand dampens oscillations.

υt = ηυt−1 − γ∇E(wt)

wt = ηwt + υt(2.16)

Where, η is called the first coefficient of momentum and υ is called retained gradient. Typically ηis set around 0.9.

18

Nesterov accelerated gradient is more precise method that is based on momentum [11]. Thismethod adds a new term to momentum algorithm that is looking forward before updating. There-fore, it is smarter and more accurate.

υt = ηυt−1 − γ∇E(wt − ηυt−1)wt = ηwt + υt

(2.17)

Where wt − ηυt−1 is used to compute an approximately derivative for the next update. This an-ticipatory update prevents the updates a big jump, which make the algorithm more efficient.

Many other, more efficient algorithms exist nowadays. These algorithms have an adaptive learningrate to the parameters. We will not mention all of those, and it is too many! However, the twomost used ones are: RMSprop and Adam. Root Mean Square Propagation (RMSprop) is an un-published optimization algorithm designed for neural networks. It is quite interesting because it isofficially not published yet, but still it is used. Geoffrey Hinton proposed this in a lecture “NeuralNetworks for Machine Learning” [12]. We will not go in much detail for this algorithm. The mainprinciple is that it is using an exponentially decaying learning rate using the root mean squared ofthe gradients and first momentum as NAG. It adapts the learning rate after the dataset. Anotherwidely used algorithm is adaptive moment estimation (Adam). D. Kingma and J.Ba proposed itin 2015 [13]. Adam is adapting the parameter learning rates based on the average first moment(the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gra-dients (the uncentered variance). To estimates the moments, Adam utilizes exponentially movingaverages, computed on the gradient evaluated on a current mini-batch.

mt = β1mt−1 + (1− β1)gtvt = β2vt−1 + (1− β2)g2t

(2.18)

Where m and v are moving averages, g is gradient on current mini-batch, β1 and β2 exponentialdecay rates for the moment estimates are hyperparameters. 0.9 and 0.999 are standard values,respectively. The vectors of moving averages are initialized with zeros at the first iteration. Addi-tionally, a bias-corrected moment estimates m and v are introduced to account for the bias in thegradient towards zero as an effect of the exponential moving average which is evident in the earlystages of training.

mt =mt

1− βt1vt =

vt1− βt2

(2.19)

Currently, it is the most used optimizer in neural networks, because it has shown a good empiricalresult. The complete algorithm is outlined below [13].

Algorithm 1: Adaptive moment estimation (Adam) optimizerRequire: α: StepsizeRequire: β1, β2 ∈ [0, 1) : Exponential decay rates for the moment estimatesRequire: f(θ): Stochastic objective function with parameters θRequire: θ: Initial parameter vectorm0 ← 0 (Initialize 1st moment vector)v0 ← 0 (Initialize 2end moment vector)t← 0 (Initialize timestep)while t not converged dot← t+ 1gt ← ∇θf(θt−1) (Get gradients w.r.t. stochastic objective at timestep t)mt ← β1mnt−1 + (1− β1)gt (Update biased first moment estimate)vt ← β2vt−1 + (1− β2)g2t (Update biased second raw moment estimate)m← mt/(1− βt1) (Compute bias-corrected second raw moment estimate)v ← vt/(1− βt2) (Compute bias-corrected second raw moment estimate)θt ← θt−1 − αm/(

√v + ε) (Update parameters)

end whilereturn θt (Resulting parameters)

19

Batch training is a standard approach for all these algorithms. This strategy is much morecomputationally efficient are more resistant to sample noises. The total data is randomly dividedinto smaller batches, and the final result is averaged over all these batches.

2.8 Convolutional Neural Networks

Convolutional neural network (CNN) is a class of deep neural networks. It is a powerful techniquefor image visualization, image classification, and object detection. The human vision system in-spired CNN, the primary visual cortex in the brain. When we see an object, the light receptors inthe eyes send signals via the optic nerve to the primary visual cortex, the central processing forinputs. A newborn child does not know the difference between a car and a buss. A child learnsto recognize objects from their direct environment and parents. They see these objects a milliontimes during their childhood. The brain learns the specific features of these objects. All of thisseems very simple and natural for us; however, this is a very complex process. A computer is not asflexible and complex as the brain. A computer algorithm needs millions of pictures before it learnsall the features and can generalize the input and make predictions for images it has never seenbefore. CNN models have been proven to be very good this task. Y.LeCun introduced this conceptin 1989 [14]. CNN architecture is somehow different from fully connected networks described inthe previous section. Convolutional neural networks do not connect every neuron in each layerto every neuron in the next layer but instead take advantage of weights sharing. The weights areconnected locally, meaning one node only connects to spatially adjacent nodes in the next layer;the network is considering the spatial structure of the data. This concept is especially useful forimages because it is reasonable to assume that every pixel has some correlation to the neighbors.The concept of convolution in mathematics, especially in the field of signal processing and Fourieranalysis is similar to CNN. The mathematical convolution in 1D (one dimension) discrete functionis defined below.

(k ∗ x)[n] =∞∑

i=−∞k[i]x[n− i] (2.20)

Where, k(t) is called the kernel, and x(t) is the input in the field of deep learning. The convolutionprocess produces a third function which in somehow a similarity measure between two functions.

Figure 2.5: A convolutional neural network, with several blocks

Convolutional neural networks are constructed by concatenating several individual blocks thatachieve different tasks. Every layer of a CNN transforms one volume of feature map to anotherthrough a nonlinear activation function. Figure 2.5 illustrates a network with a convolutional layer,pooling layer, and fully-connected layer. The input data is images, and output data is mainly classscore for every image.

2.8.1 Convolutional LayersAn image is represented as matrices or tensors in the computer, by their RBG values. Therefore,the convolutional filters or kernels, in this case, are matrices. The convolution operates by lettingthe kernel slide over the input while computing the output on a patch of the input at a time, takinginto account the spatial relation of the elements in the input. The equation below represents a

20

general The convolutional operation for images.

Zij = (X ∗K)(i, j) =

K1∑l=1

K2∑m=1

C∑n=1

X(i+ l, j +m,n)K(l,m, n) (2.21)

The input image is a 3D tensor with size [D1, D2, C] where D1 is the height, D2 is width and C isthe number of channels. The corresponding kernel is also a 3D tensor with size [K1,K2, P ], whereK1 is the height, K2 is width and P are the output channels of convolution. According to theequation, 2.21 number of image channels are the same as the number of Kernel output, the numberof channels is conserved. This convolution operation is only spatial coordinates dependent. Theoutput from this convolution is a feature map called Z.

×1 ×0 ×1

×0 ×1 ×0

×1 ×0 ×1

0 1 1 1 0 0 0

0 0 1 1 1 0 0

0 0 0 1 1 1 0

0 0 0 1 1 0 0

0 0 1 1 0 0 0

0 1 1 0 0 0 0

1 1 0 0 0 0 0

∗

1 0 1

0 1 0

1 0 1

=

1 4 3 4 1

1 2 4 3 3

1 2 3 4 1

1 3 3 1 1

3 3 1 1 0

X K X ∗K

Figure 2.6: A convolution operation between the input X and the kernel K producing the featuremap Z

As seen in figure 2.6, the kernel filter slides over the entire input image and produce an output.The input image data is 7x7, and the output is 5x5, which is reduced by factor 2. The focus pointof a kernel is always the center weight; in this case, the filter ignores or misses the image edges.The kernel would end up outside the image if it is placed directly in at the edges. To handle theedges during convolution zero padding must be introduced. Zero paddings add some extra zeros inthe image to increase the size. However, those zeros do not add some additional features. In thiscase, the input size increases to 9x9, and output would be 7x7 as an original image, the desiredeffect. The following hyperparameters determine the size of the output image.

Dout =Din −Dk + 2p

S+ 1 (2.22)

Where, Din is the input size, Dk is the kernel size, p is the zero padding, and S is called stride.The stride parameter decides the filter overlap for each spatial coordinate. In this case, the strideis one. It is a filter shifting, how much the filter shifts per each iteration. If the stride is the sameas the kernel size then only one input pixel is affected by this filter, then the spatial correlation isignored.

p =Dk − 1

2(2.23)

After each convolutional layer, it is a convention to apply a nonlinear activation function. Thepurpose of this layer is to introduce nonlinearity to a system that has just been computing linear.For more details, read the previous sections.

2.8.2 Pooling LayersThe pooling or downsampling layer implements for reducing the spacial size of the activation maps.In general, they are used after multiple stages of other layers (i.e., convolutional and non-linearitylayers) to reduce the computational requirements progressively through the network as well asminimizing the likelihood of overfitting. , see 2.5. A pooling layer has two hyperparameters, oneis the filter size, and the other is stride number. There are several techniques for this, and one iscalled max pooling, which often used, see figure 2.7. Max pooling operates by finding the highestvalue within the filter size region and discarding the rest of the values.

21

7 9 3 5 9 4

0 7 0 0 9 0

5 0 9 3 7 5

9 2 9 6 4 3

2× 2 max pooling

9 5 9

9 9 72

2

Figure 2.7: A max pooling with kernel size 2x2 and stride 2

Other options for pooling layers are average pooling and L2-norm pooling. Max pooling hasdemonstrated faster convergence and better performance in comparison to the average pooling andL2-norm [15].

2.9 Regularization

Overfitting is a significant problem in the field of data science and machine learning. An overfittedmodel is a statistical model that contains more parameters that can be justified by the data. Anoverfitted model perform exceptionally good on train data but is not able to produce well predictionin test data. The trained model captures the noise or irrelevant information in a dataset during thetraining step. This phenomenon is more probable for a complex model. The deep neural networksare very complex models, which is more prone to overfitting. Regularization is a technique whichcan reduce this effect. Some minor modifications introduce to the learning algorithm such thatthe trained model generalizes better. It, in turn, improves the model’s performance for futurepredictions. There exist several of these regularization techniques — two important ones describedin the next sections.

2.9.1 DropoutDropout is a popular regularization technique these days in the field of neural networks [15]. Thekey idea is to drop units from the neural network during training randomly, set those to zero. Itprevents groups from co-adapting too much. The number of weights is decreasing and also themodel complexity.

dropout ××

×

×

×

×

×

Figure 2.8: The dropout effect in a network

The dropout technique has a hyperparameter between ∈ [0, 1], which is a percentage. Dropoutcan be implemented after each layers in the network if necessary.

2.9.2 Early StoppingEarly stopping is another powerful and used regularization technique in deep learning [16]. Usually,the available data is divided into three different categories: training data, validation data, and test

22

data. The training data is used to train the model. The validation set is used to validate theperformance of the model during each training step. Errors of both sets are monitored at eachstep. When the performance on the validation set is getting worse or has not improved after somesubsequent epochs, the training is immediately stopped, see figure 2.9. It prevents the modeloverfitting and improves generalization.

Figure 2.9: The process of early stop

2.10 Artificial Neural Networks

Artificial neural networks (ANN) was first introduced around 1943 by W.McCulloch & W.Pitts(MP) [17]. Their works were inspired by the human brain and character of nervous activity. Theycreated the first computational model for neural networks based on propositional logic or thresholdlogic, according to the notion of "all-or-none" character of the nervous activity in their article. D.O. Hebb worked on a theory called Hebbian theory where it discusses the mechanism of neuralplasticity and interaction between neurons. Rochester, Holland, Habit, and Duda developed neuralnetworks and performed simulations at IBM computer, [18, 19]. Finally, in 1958 F.Rosenblatt heintroduced the perceptron model [20], it extended PM:S work and built on the notion by introduc-ing the concept of association cells. A perceptron is simple a single layer neural network in todaynotation, and a multi-layer perceptron is the building block of today’s deep neural networks. Eventhough Rosenblatt’s’ methodology has closely linked to the current structure of deep neural net-works, they would not become fully applicable and competitive with other, much simpler methods,until many decades later. In 1969 Minsky and Papert discovered to the limitation of computa-tional machines that performed perceptron on that time. The first important factor was that itwas incapable of processing the exclusive-or circuit. And the second major factor was that in factthat neural networks required vast amounts of data and computing power which was not availableat the time [21]. Another critical component that was missing was an efficient way of trainingthe networks which were first introduced in 1974 by Werbos [22], the idea of backpropagation.The errors were propagated backward through the networks system by differentiation applyingthe chain rule. The more general technique is called automatic differentiation, and Werbos finallyapplied it to neural networks in 1982 [23] in the form used today. In recent years, neural networkshave become a popular approach to many machine learning problems, especially within the fieldof computer vision. These networks apply to both supervised and unsupervised learning.

23

Chapter 3

Classifying Material Defects usingNeural Networks

This chapter is about the more practical implementation part about two different CNN-models fordefect classification related to the first goal in this project. In section 3.1 and 3.2 the dataset anddata processing part is outlined. Section 3.3 discusses the different data augmentation techniques toincrease the original dataset for CNN training. The neural network architectures and the trainingprocess are presented in section 3.4.

3.1 Dataset

As described in section 1.2, the total gathered dataset is circa 1000 images in three categories/de-fects. The defect’s geometrical properties define these defects, see table 3.1.

Table 3.1: The three defect categories for this thesis

Final categoriesCategories Size

Por < 25µm [0.06, 0.6] volume %Macro > 25µm & Length/Width < 5:1Slits > 25µm & Length/Width > 5:1

In the original dataset, the Por category is divided into type A and B. However, during thisproject no distinction is made between type A and B as long as the volume area of defects ismore significant than 0.06 volume percent. In fact, in some images, both of those types co-exist,see figure 3.1. Sometimes it is difficult to make a clear distinction between those two types. Theessential factor is the total volume percentage in this case. In figure 3.1, the volume percentage is0.02 for type A and 0.04, and 0.06 for type B. The volume percent is an indicator of how manydefects exists per image. For the categories Slits and Macro, the vital factor is, however, thesize of the defect. If the defect size ( height or width) is more significant than 25 µm, then itis considered unacceptable. The risk for failure is much higher for the products in case of largerdefects. Therefore, the requirement and standard are more strict. The distinction between Macroand slits is only the ratio of height and width. If the rate of height: width is equal or more than5:1, the defects are classified as Slits otherwise as Macro, as shown in the table 3.1. The Slits aresmall and long. The Macros are thick and short, see figure 1.2. In total it exists 440 Macro, 360Por, and 340 Slits.

24

Figure 3.1: The existence of pors in the images. Left image is specifically of type A00B06 andright image is of type A02B04

3.2 Data Collection and Data Cleaning

The first phase of this project was to understand the problem, break it down into smaller pieces, andmake a preliminary time plan. In this phase, the data collection and data cleaning were performed.This part was completed in collaboration with the colleagues in Sandvik Coromant’s quality lab.The gathered data was stored in C-drives at the computer lab in different files. The data was notvery structured and sorted into different categories. The three defects analyzed in this project arejust a few of defects occurring in the manufactured products. In total, it existed approximately30000 images. This step was quite tedious and time-consuming. After this process, the collectedimages were classified into three categories for this project with the help of an operator. Wemanaged to find circa 1000 useful images. The majority of those images were of type macro andslits, taken in 1000 zooming. Also, during this process, some new images were gathered, mainly oftype Por in 100 zoomings. The operators managed to find test samples that had a high probabilityof containing these three defects. The Macro and Slits were very rare to find. After a few weeksof work, approximately 1100 images were available.

3.3 Data Augmentation

The deep neural networks are highly complex, with several millions of weights that must be op-timized and fitted to the training data. It is therefore not only important a good model for thespecific problem, but it is of at least equal importance that the data set in question is of highquality and sufficient size. In most deep learning applications, however, the number of parametersin a neural network exceeds the number of data points in the dataset, sometimes with a largemargin. It becomes problematic if the network is not trained well since it is easy for the modelto overfit to the training data. Several regularization methods have been developed in the pastyears to solve this problem. Data augmentation is one of the most implemented methods. Themain idea is to increase the data set by modifying it in various ways and so that a more extensivevariety of data is shown to the network. This approach effectively increases the size of the dataset, reducing the risk of overfitting. In this section, several basic data augmentation methods aredescribed. For this part Keras, Imgaug [24] and Augmentor [25] is used.

Horizontal and Vertical FlipBoth horizontal and vertical flipping is very efficient and common data augmentation methods, seefigure 3.2 and 3.3. These methods are implemented so that the trained model becomes invariant tomirrored objects. This is extra useful for this project, because the probability is high that mirroredimages occurs. Approximately 50 % of the images are effected by these methods, so the modelsare trained equally.

25

Figure 3.2: The horizontal flip of an image, on the left original image and the right flipped image

Figure 3.3: The vertical flip of an image, on the left original image and the right vertical flippedimage

RotationRotation is another efficient and intuitive augmentation method. The rotation angle is uniformlydistributed between [−θmax, θmax]. For this case in the figure 3.4 the maximum rotation angle isset to 45 degrees. The main reason for this method is that the network has to recognize the objectpresent in any orientation. This finer angles rotation can create problem for some applications.The rotated image can add extra background noise, see the extra four edges created in figure. Ifthis background noise is very different compare to other part of the image i will certainly createproblem, because the networks can learn false features. However, this extra edges can be veryuseful in this case, because they have the same intensity and color as the rest of the image. Thesecreated edges can be seen as real edges for the network. The probability is very high that thefuture images for prediction have edges. The rotation has been implemented for all the imageswith random rotation angles.

26

Figure 3.4: The rotation of an image, on the left original image and the right rotated image

TranslationTranslation is another common augmentation method. This method is helps the network to rec-ognize the object present in any part of the image. Also, the object can be present partially in thecorner or edges of the image. For this reason, we shift the object to various parts of the image,see figure 3.5. This method can also create extra edges with noise as rotation and later createproblems. But this feature can be good in this case. It creates extra edges that can be consideredas real for the networks. This is good generalizing the networks for future prediction that containsreal edges.

Figure 3.5: The translation of an image, on the left original image and the right translated image

ZoomingZooming is a good augmentation method to make the models invariant to the size of objects, seefigure 3.6.Zoom in the range [0.8,1.0] which means zoom by a maximum 20. Having differentlyscaled object of interest in the images is a important aspect of image diversity. This method maybe extra helpful for this specific data set due to the zooming problem. As mentioned earlier inthis section the images are taken in two different zooming, 100 and 1000. Implementation of thismethod may cancel this effect.

27

Figure 3.6: The cropping of an image, on the left original image and the right zoomed image

Gaussian NoiseThis augmentation method is a powerful technique to reduce overfitting in some extend and makethe models more robust to injected noises, see figure 3.7. This method makes also difficult forthe network to learn very small-pixel values which are relevant for the problem. Gaussian noise acommon approach, but it exists many other noises, like Laplace, Poisson etc. Addition of salt andpepper noise, which presents itself as random black and white pixels spread through the image isanother possibility. This is similar to the effect produced by adding Gaussian noise to an image,but may have a lower information distortion level.

Figure 3.7: The Gaussian added noise of an image, on the left original image and the right noisedimage

3.4 Neural Network Architectures

The theory of CNN was presented in section 2.10. As described, these networks contain many dif-ferent parts. In this section, these ingredients are connected to build a complete architecture. Tofind a suitable architecture is not a trivial task; it exists many hyperparameters to cross-validate.New improved and complex architectures are presented every year. The ImageNet challenge [?] isan annual challenge used as the benchmark. The purpose of this challenge is to classify millionsof images in many different categories. During the last few last years, several different successfularchitectures improved the accuracy further. The breakthrough for CNN models came in 2012,

28

where the network called AlexNet won the competition by a big marginal compared to other meth-ods [5]. This significant progress motivated many researchers and big companies to invest timeand money in this field. Subsequently, in the following years, new architectures were invented, andCNN continued to dominate, decreasing the error rate each year.

Two different designs are investigated in this thesis. One of these architectures was the win-ner of the ImageNet challenge in 2014, VGG-network. The other design was a customized one,much smaller compare to VGG.

3.4.1 VGG ArchitectureVGG network is relatively compact and straightforward architecture using small 3×3 convolutionallayers with stride one stacked on top of each other in increasing depth. The 2x2 max pooling withstride two is added to reduce the image volume and number of weights. A softmax classifier thenfollows two fully-connected layers, each with 4096 nodes. The final softmax output is 1000, whichcorresponds to 1000 classes of ImageNet competition. Simonyan and Zisserman [26] suggestedthis network., see figure 3.8. The CNN layers have a much smaller size, and receptive filed inVGG-net compare to previous nets, like AlexNet. The previous ones usually had 7x7 kernels andlarger receptive field. However, several smaller kernels stacked together give the same effect. Forexample, 3x3 kernels have the receptive field as a 7x7 kernel. Smaller kernel enhances the networknonlinearity and complexity, which can capture more sophisticated features. It is a very efficientway to performance and has the same amount of weights. It exists two different VGG-nets, one16 layers called VGG16 and one 19 layers called VGG19. However, both have the same structure.In table 3.2 a model summary is presented. The number of channels per layer is 64, 128, 256,and 512. The number of channels starts with three input channels and goes from 64 to 512 inincreasing order. An additional dropout layer is added to the network to reduce the overfitting.And also the original 1000 classes softmax is replaced by 3 class for this thesis.

Figure 3.8: VGG-16 neural network architecture [1]

Table 3.2: Decomposition of VGG19

Layer name Number of layer Number of channelsConvolution 12 [64, 128, 256, 512]Max pooling 4 [64, 128, 512]

Fully connected 3 [4096, 512, 3]Dropout 2 [4096, 512]

3.4.2 Customized ArchitectureAll the architectures developed for ImageNet competition are large networks, and they becomeeven more extensive. VGG-net is one of the smallest ones. Two other vital networks are residual

29

network (ResNet) developed by He et al. [27] in 2016 and DenseNet developed by Huang et al. [28]in 2017. It requires a large amount of training data to train these networks with millions of weights.Those amount of data is out of scope for this thesis. For this thesis, a smaller and customizedarchitecture has been implemented to compare the result with VGG19. The architecture is similarto VGG but with fewer layers and batch normalization for hidden layers. The concept of batchnormalization was introduced in 2015 by S.Ioffe et al. [29]. It came after the VGG-nets, whichexplains why they do not have it. Batch normalization is a powerful technique for improving thespeed, performance, and stability of artificial neural networks. In table 3.3, a model summary forthis customized network is presented. The number of channels per layer is 16, 32, 64, and 128.The number of channels starts with three input channels and goes from 16 to 128 in increasingorder. An additional two dropout and five batch normalization layer are added to the networkto reduce the overfitting. Finally, a three-class softmax function is added to classify the defects.Algorithm 2: Batch Normalizing Transform, applied to activation x over a mini-batchRequire: Values of x over a mini-batch: B = {x1...m}; Parameters to be learned: γ, βµβ ← 1

m

∑mi=1 xi (mini-batch mean)

σ2 ← 1m

∑mi=1(xi − µβ)2 (mini-batch variance)

xi ← xi−µβ√σ2+ε

(normalize)yi ← γxi + β ≡ BNγ,β(xi) (scale and shift)return y = BNγ,β(xi) (Result)

This algorithm calculates the mean and variance for each mini-batch data first and then nor-malized and finally perform a shift and scaling. Notice that γ and β are learned during training,along with the original parameters of the network. In the algorithm, ε is a constant added tothe mini-batch variance for numerical stability. BNγ, β is called Batch Normalizing Transform.Batch normalization has other great benefits: the network can use a higher learning rate withoutvanishing or exploding gradients. It reduces overfitting because it has a slight regularization effect,similar to dropout. But in this case, no information is lost in contrary to dropout.

Table 3.3: Decomposition of customized network

Layer name Number of layer Number of channelsConvolution 5 [16, 32, 64, 128]Max pooling 3 [16, 32, 64]

Fully connected 3 [1024, 512, 3]Dropout 2 [64, 512]

Batch Normlize 5 [16, 32, 64, 128]

3.5 Network Training and Software

The field of deep learning is trendy and state of the art research field. It is the dominating ma-chine learning currently area. Therefore, a wide variety of deep learning software tools is publiclyavailable. Most of them are open-source and distributed under licenses that allow commercial use.These frameworks are often driven by leading Universities and companies like Google, Facebook,Microsoft, Amazon, etc. Theano, which was developed by the University of Montreal. It is onethe first and most used one libraries [30]. Caffe is another powerful framework developed by UCBerkeley [31]. Pytorch is another popular framework developed by Facebook [32]. However, Ten-sorflow is the most used framework for deep learning today. It was developed by Google [33], itrunning in several languages including Python. According to StackOverflow questions and Githubuser, TensorFlow has the largest user community.

For this part of the project, Keras and TensorFlow are used. TensorFlow is based on flow graphs,and tensors operations for the dataflow and backpropagation derivatives, which make is it verycomputationally efficient. It provides a powerful framework for building machine learning modelswith a variety of abstraction levels. Implementing low-level APIs to construct a model by defininga series of mathematical operations, which is more time consuming and more in-depth in term ofcoding and learning rate. The other option is to use a high-level API such as Keras [34] to specifypredefined architectures. This higher level abstraction is, however, easier to use but less flexible.Keras is an open-source deep learning library written in Python and is capable of using TensorFlow,Theano, and Microsoft Cognitive Toolkit (CNTK) as back-end. It is very user-friendly and allowsfast experimentation with different models and hyperparameters. Extensibility and modularity are

30

two other great advantages of Keras. It is capable of running in both CPU and GPU as TensorFlow.

During this project, several different image resolutions were tested to see what level of pixel detailwas necessary. The dataset contained images with different resolution depending on the micro-scope, mix of 2448x2048 and 1592x1196. The default input size for VGG19 model is 224x224.After some iteration 224x224 input size was implemented for both networks. The first approachincluded randomly cropping an original image into several smaller images. However, this methodhad a major drawback. For this classification strategy, the entire image is of interest, not onlya small part of it. Therefore, the images were instead down-sampled to keep the entire image.Additionally, the customized network was tested with grayscale images. However, the pretrainedVGG19 is fine-tuned and adapted for color images, three channels. It does not accept grayscaleimages. Therefore the grayscale test could not be performed. It is possible to convert the VGG19weights for single channel input, but this is not done in this project. It is worth mentioningthat grayscale input reduces the number of trainable weights drastically, two channels are simplydropped. It can enhance computational efficiency.

The complete data was divided into three sets: training data (80 %), validation data (15 %),and test data (5%), read section 2.2. The models were trained in mini-batches of the trainingdata. Different batch sizes with factor 2 were tested for both training and validation. The mostcommon one for the training were 32 and 16, 16, and 8 for the validation part. The images inthe batches were shuffled and selected randomly by Keras. The dataset is biased; it exists moreMacro images than Slits and Por. When training neural networks, it is important to have balancedclasses so that the networks are exposed to equally many images of each class, to prevent themfrom becoming biased towards the class with more samples. Two different approaches were testedto solve this problem. The first method was to adjust the data after the class with minimumdata, which was slits in this case around 300 images. This method is called undersampling, whichmeans removing some instance of over-represented classes. And after that applied the data aug-mentation techniques equally for each class. The other method was to use all the available dataand adjust the augmentation methods to produce the balanced final dataset. In this scenario, theslits class was augmented much more to compensate for the original data. This method is calledoversampling, which means repeating the instance of underrepresented class/classes. These twoapproaches mentioned above is called data sampling. Additionally, it exist many other methodsto solve imbalanced dataset [35].

3.5.1 Transfer LearningFor the VGG19 model, a method called transfer learning was used. It is a compelling approachto various machine learning problems. This concept was first introduced around 1990 by L.Pratt[36].

Figure 3.9: Three different transfer learning strategies, the large rectangle is the main block of amodel and the small rectangle is only the top layers, including the softmax classifier

31

The main idea is to reuse a trained model to train new models for another similar problem.This technique is especially applicable for deep learning problems, where a large amount of dataand computation capability is required. It is a good approach in this case because the originaldataset (without data augmentation) is small. The gained knowledge is recycled. As mentionedin section 3.4.1, the VGG-nets are trained on ImageNet, millions of images in many differentclasses. In Keras, it is straightforward to implement these networks as a starting point for a newmodel. It takes only a few lines of code. In this case, VGG19, with its trained weights, was usedas the base model to build a new model. The last fully connected and softmax layers were notincluded. Instead, a few customized layer were added to the base model. These layers includedsome regularization effect, dropout layers, see section 2.9, to avoid overfitting. The last softmaxlayer was added to produce class probability for the three different classes. This approach canbe considered as strategy 3 in figure 3.9, where the base model is frozen, and the top layers aretrained. A CNN-model contains many different layers. All these layers are trained to recognizeand learns useful features during training and then generalize for future predictions. The learningprocess of these CNN-model has been researched for a long time, and it still is, the visualizationof feature maps. It has shown that these models learn the low-level features on the image in theearlier layers of the network, such as lines, dots, curves. The high-level features are learned in thenext layer of the network, such as common objects and shapes. Therefore, during the training,various layers were retrained for this project; all other weights are frozen. The low-level featuresrelatively common for many images, hence the focusing point was to retrain the last 5-10 layers ofthe VGG19-net, to learn the high-level features extensively. This method is illustrated by strategy2 in figure 3.9, where some extra layers of VGG are retrained. The customized model was, howevertrained according to strategy 1 in the figure 3.9. The complete model is trained from scratch.

The ADAM optimizer algorithm (see section 2.7) was used for the loss minimization and crossentropy (see section 2.5.2) was implemented as loss function combined with softmax. RegularSGD and RMSprop optimizer were also tested. Additionally, the principal of early stop (see sec-tion 2.9) was used to minimize the overfitting. The ReLU activation function 2.4 is used almostexclusively for all the hidden layers for both networks. The models were trained locally in a PCwith an internal GPU (NVIDIA Quadro M2000M) with 1029 MHz GPU speed and 4GB in mem-ory. It worked fine for image sizes smaller than 156x156 for these models used here.

Three image processing steps were performed before the input batches were sen through the net-work. First, images were rescaled between 0 to 1. Second, the sample-wise centering, the datasetwas normalized to 0, the mean value was set to 0. Third, sample-wise standard normalization, thestandard deviation value to was set to 1.

3.5.2 Model PerformanceThe performance measure is an essential part of machine learning model. A model is trained tomake good future predictions. The model accuracy is one the most used metric to measure theperformance. But, accuracy alone is not enough to judge the prediction power. Null Error Rateor Accuracy paradox [37] states that the model performs excellent for the majority class, but verypoor for the minority classes. For example, if the incidence of class Macro is dominant, beingfound in 99% of cases, then predicting that every case is class Macro will have an accuracy of 99%.Confusion matrix or error matrix is a better or complementary method to evaluate the final model[38]. This matrix contains as many rows and columns as the number of classes.

Table 3.4: Confusion table/matrix for binary classification

Predicted True Predicted FalseActual True True positive (TP) False negative (FN)Actual False False positive (FP) True negative (TN)

As shown in table 3.4 the confusion matrix represents the predicted classes versus the actualclasses, for all classes, and not only the majority class. From the confusion matrix some importantperformance measures can be deduced, which is used in this project. True positive rate (TPR) orsensitivity, it is the ability of a classifier not to label an instance positive that is actually negative.For each class it is defined as the ratio of true positives (TPR) to the sum of true and false positives.

32

TPR =TP

TP + FN(3.1)

Positive predicted value (PPV) or recall,it is the ability of a classifier to find all positive instances.For each class it is defined as the ratio of true positives to the sum of true positives and falsenegatives.

PPV =TP

TP + FP(3.2)

F1 score is the harmonic average of the TPR and PPV, it is defines as follows.

F1 =2TP

2TP + FP + FN(3.3)

The final accuracy (ACC) is also included into to the performance analyses, which is defined asfollows,

ACC =TP + TN

TP + TN + FN + FP(3.4)

All these four metrics are used for the final model performance analyses.

33

Chapter 4

Classifying Material Defects usingImage Processing

This chapter is about the more practical implementation part about image processing for defectclassification and defect detection related to the second goal in this project. It is an alternativemethod to the first one. The result final result will be compared to each other. In section 4.1 moregeneral description of object detection is outlined. In section 4.2 and 4.4 Canny edge detectionalgorithm and contour detection algorithms related to image processing are presented. In thesection 4.5 the implementation details are presented.

4.1 Image Processing and Object Detection

In the field of computer vision classification and object detection are two major categories. Both ofthese categories often deal with images as input. But the output is different. For the classificationthe models are often one to one, a model takes an image as input and output a class score, forexample, to determine if an image cat or a dog, the model does not care where the cat is in theimage. In object detection, the network usually takes in two input parameters and output twoparameters. The class score and the coordinates of the object. These models can deal with severalobjects on an image. It exists many different object detection approaches. Some successful deeplearning approaches are: Region Proposals (R-CNN) [39], Single Shot MultiBox Detector (SSD)[40], You Only Look Once (YOLO) [41]. These methods are very sophisticated and efficient andrequire a lot of data. Specially YOLO and SSD are speedy and accurate, and those two are usedcommercially today in self-driving cars and other sectors.

However, in this thesis, we will be focusing on a much simpler strategy, implementing someclassical image processing algorithms to detect these defects (objects) on the images. An edgedetection algorithm combined with contour detection and some other mathematical morphologicaloperations described in the following sections. This approach is somehow rule-based; their pixelsize values will categorize the defects. If the width or height of a defect is smaller than somespecific value, then it is classified as Por or Macro or Slits. Therefore, this method does not con-sider a machine learning approach. The essential approach behind the gradients based detection isthat local object appearance and shape within an image describes by the distribution of intensitygradients or edge directions.

One of the most fundamental image analysis operations is edge detection. Edges are often vi-tal clues toward the analysis and interpretation of image information, both in biological vision andin computer image analysis. Edge detection was an essential part of the machine vision to makegeneral assumptions about the image formation process, a discontinuity in image brightness can beassumed to correspond to a discontinuity in either depth, surface orientation, reflectance, or illu-mination. Edges in images usually have strong links to the physical properties of the world, whichcan help for image segmentation and object detection. Edge detection requires image derivativesto compute. However, differentiation of an image is an ill-posed problem; image derivatives aresensitive to various sources of noise [42]. Smoothing the images is a common approach. An earlyapproach to edge detection involved the convolution of the image f by a Gaussian kernel G or

34

Green function F , followed by the detection of zero-crossings in the Laplacian response [43][44].

∇2 (G ∗ f) = 0 (4.1)

This smoothing approach introduces some undesirable result, for example, false edges, loss ofinformation, and pixel displacement. Many sophisticated methods exist today. All edge detectionalgorithms compute image derivatives, but they use different filters and approaches. We willdescribe one of those in the next section in more detail, called the Canny filter.

|∇f | =

√(∂f

∂x

)2

+

(∂f

∂y

)2

ψ = arctan

(∂f∂x

)(∂f∂y

) (4.2)

The gradient of an image is a vector of its partial derivatives as equation 4.2. The gradient directionis perpendicular to the edge orientation; the second equation in 4.2 is the edge direction.

4.2 Canny Edge Detection

Canny edge detection is a five-stage algorithm to detect a wide range of edges in images. Thisalgorithm was developed by J.Canny [45]. It is a five-step method.

Noise ReductionAs mentioned above, edge detection involves image derivatives. Those derivatives are susceptibleto image noise. Therefore, one way to get rid of the noise on the image is by applying Gaussian blurto smooth it. This process uses an image convolution technique with a specific Gaussian Kernel.The equation for a Gaussian filter kernel of size g1 × g2 is given by.

Gij =1

2πσ2exp

((i− g1)2 + (j − g2)2

2σ2

)(4.3)

The Kernel size is an important hyperparameter which impact the amount of image blurring andnoise sensitivity. Larger Kernel reduce the sensitivity to noise.

Gradient CalculationThe Gradient calculation step detects the edge intensity and direction by calculating the gradient ofthe image using edge detection operators. A well-known operator is the Sobel operator, developedby I.Sobel [46]. Two different Sobel filter exists, one for horizontal and one for vertical gradients.

Gx =

−1 0 1−2 0 2−1 0 1

∗ I, Gy =

−1 −2 −10 0 01 2 1

∗ IGy and Gx are vertical and horizontal derivatives (gradient intensity matrix) for the input

image I approximations respectively.

|∇f | =√

(Gy)2+ (Gx)

2

ψ = arctan

(GyGx

) (4.4)

Then, the magnitude |∇f | and the direction ψ of the gradient calculates according to 4.4.

Non-Maximum SuppressionNon-maximum suppression is an edge thinning technique. It suppresses extra and unwanted edges.Ideally, the final image should have thin and distinct edges. This step is to implement after thegradient calculation step. The principle is relatively simple; the algorithm searches for maximumvalues in the gradient intensity matrix and finds the pixels with the maximum value in the edgedirections.

35

Double ThresholdAfter application of non-maximum suppression, remaining edge pixels provide a more accuraterepresentation of real edges in an image. However, those edges can contain noise and substantialcolor variation. The algorithm divides the pixels into three categories: Strong, weak, and non-relevant. The active contribution to the final edges certainly, the non-relevant are sorted out. Theintermediate pixels processed in the next step of the algorithm, to decide if they should be in thefinal edges.

Edge Tracking by HysteresisBased on the double threshold result, the intermediate pixels transform into strong or non-relevantpixels. If and only if at least one of the pixels around the one being processed is a strong one thecorresponding pixel is ungraded to strong pixel, else sets to zero.

4.3 Morphological Operations

Mathematical morphology is a tool for extracting image components useful in the represent ionand description of region shapes, such as boundaries, skeletons, and convex hulls. Morphologicaloperations apply a structuring element or Kernel to an input image, creating an output image ofthe same size. In a morphological operation, the value of each pixel in the output image is basedon a comparison of the corresponding pixel in the input image with its neighbors. The most basicmorphological operations are dilation and erosion. The mathematical formulas and definitions isdiscarded in this report. The fundamental operations are union, intersection, and complement plustranslation, the essential components of set theory, for more theoretical definitions and proofs readthis article [47].

ErosionThe first effect of this operator on a binary or grayscale image is to erode the boundaries of regionsof foreground pixels (i.e., white pixels). Thus areas of foreground pixels shrink in size, and holeswithin those areas become more substantial. The operation takes two inputs, the first one is theimage, and the second is the structuring element (Kernel). It is an approach to remove the smallwhite noises in the picture.

DilationDilation is principally the opposite of erosion. This operation generally increases the sizes ofobjects, filling in holes and broken areas, and the connecting regions that are separated by spacessmaller than the size of the structuring element. With grayscale images, dilation increases thebrightness of objects by taking the neighborhood maximum when passing the Kernel over theimage. The first operation is erosion to remove the noises, subsequently shrinks the object in theimage. Thereafter dilation is implemented to increase the object size.

4.4 Contour Detection

A contour is a closed curve of points or line segments, representing the boundaries of an object in animage. In other words, contours represent the shapes of objects found in an image. The contoursare a useful tool for shape analysis and object detection and recognition. Contour detection isusually after edge detection. Ramer–Douglas–Peucker algorithm [48], developed 1972 and SatoshiSuzuki algorithm developed 1985 [49] are two contour finding algorithms.

4.5 Implementation and Software

The first idea for this part was to implement an object detection algorithm like YOLO [41] orR-CNN [50] to detect and draw a box around the different defects. Both classify the defects on animage and also detect the coordinates. YOLOV3 [51], which is a region selective algorithm, wasimplemented for this project. These object detection algorithms require two input parameters,the input image and the coordinates of the object on the image. Therefore, much more work

36

is needed to prepare and process the data. A graphical object annotation program was used toannotate all the images manually. This process was very tedious and time-consuming. A transferlearning approach was used in this case with the darknet neural network framework weights forYOLOV3. A new softmax layer was added in the top of the existing network for the propose ofthis project. After all these endeavors, however, the final result was not very promising for thisproject. Firstly, the defects were too small for the network to detect. Secondly, the training datawas limited because it was difficult to find a good data augmentation method to both augment thedata and simultaneously keep track of all the annotated boxes, the coordinates. R-CNN methodswas another possible approach. It is better for detecting smaller objects.

After all these works, a much simpler concept was implemented to detects the defects on theseimages. An edge -and contour detection method were used instead. The Canny edge detectionalgorithm (see section 4.2) combined with some morphological operations (see section 4.3) wastested to detect these defects. This approach was selected because the images in this project arerelatively simple. The form and color of these defects are very distinctive from the rest of the image.However, this approach is applicable only images without etching process, the images with clearwhite background, see figure 3.1. Therefore, detect edges and contours would be simple enough.OpenCV (CV2) which is a standard library in Python for image processing contains all thesealgorithms. The complete strategy is described below. It contained many different components.

1. The original image was converted to grayscale

2. The original images were cropped about 80 % to remove the image corners

3. A Gaussian filter was added to remove some noises

4. The image was send through a Canny filter to detect edges

5. Dilation and erosion operations were added to make the edges more distinct and clearer

6. The contour function in CV2 was added to detect the contours, connected all the edgestogether from previous steps

7. Another function in CV2 was added to find the bounding boxes detected by the contourfunction

8. A rectangle draw function was added to draw the bounding boxes on the images

9. Classified the defects based on the detected height and width

10. Finally saved the resulted images with bounding boxes locally. Additionally, saved the height,width and corresponding files in an Excel spread sheet to see how many boxes were detectedper image

37

Chapter 5

Result

In this chapter, all the results will be presented. The performance comparison between two CNN-models implemented in this project. In section 5.1 the overall training strategy for the two CNNmodels are outlined, including some figures related to the training process. The model evaluation isdiscussed in section 5.2, with some statistical performance measures. These two sections are relatedto the first goal of this project, the classification task using CNN models, linked to chapters 2 and3. Section 5.3 is correlated to the second goal of this project, using rule-based image processing todetect and classify defects, linked to chapter 4. In the last section 5.4 a small comparison betweenthese two strategies or goals are presented.

5.1 Defect Classification using CNN Models

The training process of neural networks involves many hyperparameters that have to be tuned.A few of these parameters are learning rate, regularization, batch size, image resolution, dataaugmentation techniques, etc. Parameter tuning is a trial and error process. It requires manyexperiments and iterations until a satisfying result is achieved. Finding the optimal parametersfor a specific problem is simply an optimization task in itself. Therefore, during this project, manyiterations and experiment are performed to find hyperparameters that produced somehow goodresult. The final image resolution was decided to 224x224 according to VGG19 model trainingimages. The available GPU could not handle higher resolution, and lower resolution reduced theaccuracy. The final batch size for training was set to 32 and 16 for validation. The mini forvalidation i because of the computational power of GPU. Overfitting was a significant problemat the beginning of the training process for both CNN models. Two regularization techniquessolved this problem, feature map dropout, and early stop. The scheduled learning rate was anotherstrategy that improved convergence and training stability significantly. The initial learning rate was1e-2 and then reduced by factor 0.75 after four steps. Five different data augmentation techniqueswere performed for this project, described in section 3.3. The augmentation was applied equally;the training dataset increased by a factor of five. The models were trained using Keras with GPUbased TensorFlow as backend. The models were trained both with and without data augmentation.Input samples were shuffled at the beginning of each epoch, and the model parameters are updatedafter each epoch. The validation process is done at the end of each. Every training epochs takesaround 20-80 seconds. For more details about the network implementation, see section 3.5. Thefinal results for five different training strategies are summarized in table 5.1, two for VGG19 andthree for the customized model. Figures 5.1 and 5.2 presents the cross-entropy loss, training -andvalidation accuracy for the VGG19 model, both with and without data augmentation. Figures5.3 and 5.4 presents the cross-entropy loss, training -and validation accuracy for the customizedmodel, both with and without data augmentation.

38

Table 5.1: Training result for CNN-models [VGG19 and customized=C] with -and without dataaugmentation. In the table data augmentation is denoted by a + sign to reduce the space, so C+means customized model with data augmentation. The best performance metric is indicated bybold text.

Model Training loss Validation loss Training accuracy Validation accuracyVGG19 0.1020 0.9012 0.9540 0.9010VGG19+ 0.0010 0.5510 0.9950 0.9503

C 0.1520 0.6540 0.9560 0.7850C+ 0.2230 0.8505 0.9430 0.7890

C grayscale 0.2530 0.9040 0.9430 0.7450

Figure 5.1: Training accuracy and cross entropy loss of the pretrained VGG19 using transferlearning, training the top layers only. This figure shows the result without data augmentation.

Figure 5.2: Training accuracy and cross entropy loss of the pretrained VGG19 using transferlearning, training the top layers only. This figure shows the result with data augmentation.

39

Figure 5.3: Training accuracy and cross entropy loss of the customized model. This figure showsthe result without data augmentation.

Figure 5.4: Training accuracy and cross entropy loss of the customized model. This figure showsthe result with data augmentation.

All the networks were trained 100 epochs, as shown in the figures above. The training -andvalidation accuracy increases very fast in the beginning. After approximately 20 to 30 epochs, theystart to stabilize and grow very slowly. The cross-entropy loss follows the same pattern; it decreasesvery rapidly and then stabilizes. Figures 5.1 and 5.2 illustrates that the accuracy convergence ismuch faster for the pretrained VGG19-network compare to the customized model in figures 5.3 and5.4. The accuracy goes from zero to almost 70 % after five to 10 epochs. This rapid growth is dueto the transfer learning strategy, as described in section 3.5.1. VGG19 network is already trainedon the millions of images and can capture low-level features of the training samples because thosefeatures are common for all the images, the edges, lines, etc. The customized model, however, istrained from scratch. As seen in figures, the overall accuracy for VGG19 is significantly highercompared to the customized model.

As mentioned previously; the networks are trained both with -and without data augmentationto compare the overall performance and the effect of augmentation. Figure 5.2 and 5.4 shows thenetworks with five different data augmentation techniques. The augmentation strategy improvedthe overall accuracy by ca 5 % for the VGG19 model. However, this strategy did not impact thecustomized model greatly. The training accuracy increased by ca 5 % , but the validation accuracysuffered by this approach slightly. The cross-entropy loss is fluctuating more as the result of dataaugmentation for both networks. As shown in the table 5.1, the VGG19 with data augmentationreached the highest accuracy and lowest loss.

5.2 Evaluation of CNN Models

During the training, the best-trained weights per epochs are saved into a final model that is usedto evaluate the model and make future predictions. The models are reviewed by the test dataset,

40

which is 5 % of the entire dataset. The complete test set contains 105 images, 50 of category Macro,34 of Por and 21 of Slits. It is a biased set, as described in section 3.5.2 it is essential to have reliableperformance metrics to avoid so-called accuracy paradox. Therefore, for this project, four differentperformance indicator is implemented, True Positive Rate (TPR), Positive Predicted Rate (PPV),F1 score, and final accuracy (ACC). All these four are calculated using the confusion matrix foreach model. The confusion matrices are represented in figures 5.5 and 5.6 which gives an indicationof number of misclassified images per classes. As shown in the figures, the misclassification rate ishigher for the customized model compare to VGG19 both with -and without data augmentation.The Slits are misclassified more often as Macro. For the customized model, 7 of 43 Slits are labeledas Macro. Both models perform very well for Por defects, and they are mostly labeled correct.It is not a surprising result Macro, and Slits have similar features. Therefore, it is difficult todistinguish those. However, Por defects are easily distinguishable because of their characteristicfeatures. In figures, number 0 represents Por, number 1 Macro, and number 2 Slits. The fourmatrices represent the VGG19 and customized model with -and without data augmentation.

Figure 5.5: The confusion matrix of the VGG19 model without data augmentation (right) andwith data augmentation (left)

Figure 5.6: The confusion matrix of the customized model without data augmentation (right) andwith data augmentation (left)

41

Figure 5.7: The four performance measures for Por (left) and Slits (right), the + signs representthe model with data augmentation. The blue chart is VGG19, and the orange one is VGG19+.The green one is the customized model (C), and the red is C+.

Figure 5.8: The four performance measures for Macro (left) and the final accuracy of all models(right), the + signs represent the model with data augmentation. The blue chart is VGG19, andthe orange one is VGG19+. The green one is the customized model (C), and the red is C+.

Figures 5.7 and 5.8 illustrates the four performance metrics for the models used in this project.All the four metrics for each defect type is summarized in one chart to make the compressioneasier. The first three charts represent metrics for Por, Macro, and Slits. The final graph showsthe total accuracy for all the models, four different combinations. All these charts reflect theprevious results, which indicates that VGG19 performs better compared to the customized modelfor all three defects. VGG19 with data augmentation performs best of all. F1, which is a weightedaverage of TPR and PPV shows that Por types are recognized with almost no errors for bothmodels; the recognition rate is close to 100 %. The overall result for Macro example is also veryhigh, especially for VGG19 with data augmentation. However, the overall results for Slits aresignificantly lower for Slits, especially for the customized model data augmentation. F1 score isonly 78 % compare to over 90 % for other model combinations. The final accuracy is somehowcoherent with the other metrics. It shows that VGG19 is the clear winner, with almost 100 %accuracy.

5.3 Defect Detection using Image Processing

In this section, the result related to the second goal of this project is presented, the theory andspecific implementation details are written in chapter 4. The second of this project was to usea machine learning approach to detect and classify the individual defects, in contrast to the firstgoal where the entire image was classified. Primarily YOLOV3, which is a selective region algo-rithm based on CNN, was implemented for this task. It is a powerful real-time object detectionalgorithm. However, after some experimentation with this algorithm, the produced result was notsatisfying. Three main limitations made this approach challenging to overcome, firstly the limited

42

dataset, secondly data augmentation techniques, and thirdly the defects are too small.

In the end, the machine learning approach with YOLOV3 was disregarded. Instead, a ruled basedimage processing method is chosen, for more details read section 4.5 — an edge -and contour de-tection algorithm combined with some other image processing methods, implemented in OpenCV.This strategy is completely hard-coded rule-based. The classification rule is based on the table3.1. If a defect is under 25 pixel it is labeled as Por, otherwise Macro or Slits depending on thelength/width ratio. Defect bigger than 25 pixels and length/width ratio of 5 : 1 is recognized asSlits and vice verse. A few detection results are illustrated in figures 5.9 and 5.10.

Figure 5.9: The detected defects by image processing, both Macro and Por defects and some falsedetected defects

Figure 5.10: The detected defects by image processing, Macro, Por and Slits defects and some falsedetected defects in the edges of the picture on the left

In figures 5.9 and 5.10 it is shown the algorithm is capable of detecting the majority of defectsexisting on the images. One problem with is method is, however, that it can identify areas onimages that are not irrelevant. For example, in figure 5.9 on the right image the algorithm hasdetected three boxes as Macro that is not a defect, they are just line created by the microscopeduring the test. Another similar problem illustrated in figure 5.10 on the left image on edge, threeor four long boxes detected as Slits; those are neither a defect. This false detected boxes cancreate problems if the entire image is classified based on the number of boxes per defect category.During the implementation phase, the images were cropped by 80 % to reduce the effect of falsecontributed boxes on the edges and corners of the picture.

5.4 Comparison of CNN models and Image Processing

An overall comparison between defect classification using a CNN model and defect detection usingimage processing is difficult to make because they are focusing on the same problem but froma different point of view. One possible approach to compare those two methods is to count the

43

number of detected defects per image for image processing algorithm and classify the image aftermajority defected defect type. For example, in figure 5.10, the right image is classified as Slitsand left the image as Por. For this comparison, 38 images with a white background are used,13 Por, 10 Macro, and 15 Slits VGG19 with data augmentation which performed best for theclassification task is selected for this purpose. The VGG19 model is not retrained, it is the same asthe previous results. Instead of 105 images as previous evaluation dataset, only 38 images have theright background for the image processing algorithm, which only works with white background.First, the images are sent through the VGG19 model and the confusion matrix are calculated.The same procedure is repeated for the image processing algorithm. The confusion matrices arepresented in figure 5.11. F1 score and overall accuracy (ACC) is used as performance metrics.The result is presented in figure 5.12. As shown in the figure, VGG19 outperforms the imageprocessing algorithm in all aspects. ACC is 89 % for VGG19 and 69 % for image processing. Inparticular, VGG19 is predicting 100 % for Macro type and image processing only 67 %. However,image processing performance slightly better for Por and Slits, around 76 %, but it is still lessthan VGG19, which predict around 90 %. From the confusion matrix for image processing, it canbe seen that many Macro and Slits are misclassified as Por, this reduces the performance metrics.One interesting remark about VGG19 accuracy is that it reduces by 10 % from the earlier results.It goes from 98 to 89 %, see figure 5.8. One explanation is that the image background plays animportant role in the overall accuracy, which is not highly surprising. The majority of Macroand Slits images have a dark background in which the networks learns as an essential feature.Therefore, it is vital to have a homogeneous and balanced dataset.

Figure 5.11: The confusion matrix of the VGG19 model with data augmentation (left) and imageprocessing (right)

44

Figure 5.12: The F1 score and final accuracy for the image processing approach and VGG19 modelwith data augmentation. The first three charts represent F1 score for each defect, and the lastchart is the final accuracy for all three combined

45

Chapter 6

Discussion

In this chapter, all the discussion and analysis will be presented. The chapter disposition isaccording to previous result chapter. In section 6.1 the overall analysis and discussion about CNNmodels are made, which used for the classification task, related to first of goal of this projectand presented in sections 5.1 and 5.2. Section 6.2 discusses the image processing strategy usedfor defect detection, which is related to the second goal of this project, the result is presented inthis section 5.3. In the next section 6.3 the comparison between these two goals/strategies arediscussed, related to section 5.4. Finally, in section 6.4, the main limitations of this project isdiscussed.

6.1 Analysis of CNN Models

From the overall summarized results in table 5.1 in section 5.1 for the classification part of thisproject, the pretrained VGG19 network outperforms the smaller customized ones. The training-and validation accuracy of it reaches 99 % and 95 % respectively. The corresponding resultsfor the customized network are 95 % and 80 %. The advantage of VGG19 is because of the useof transfer learning, where only a few top layers of an extensive existing network with millions ofweights are retrained, the rest of the network is frozen and reused. This large network is trained ona massive amount of images in a variety of categories, around 1000. Transfer learning is a powerfultechnique, which these results show, it is especially practical for a small amount of available dataas in this project. As in figures 5.2 and 5.1 the accuracy rises exponentially at the beginning ofthe training process, it jumps from 0 to almost 70 %. This sharp increase is the effect of transferlearning, where the network already recognizes the low-level features of the images used in thisproject. Figures 5.3 and 5.4 shows the same results for the customized model, it is clear thatthe convergence is not as fast as VGG19. Additionally, these training figures illustrates that bothnetworks convergence relatively fast after a few epochs and then stabilize and increases slowly afterapproximately 20-40 epochs. This fast increase is probably a reflection of the training images. Theimages in this project are not highly complex in structure, and therefore, it is easy for the modelsto learn the pattern quickly.

During this project, the networks were trained both with and without data augmentation. Dataaugmentation is an efficient method to increase the dataset, reduce overfitting, and hopefully gen-eralize the models. However, the result of data augmentation is surprisingly limited for this project,especially for the customized model. The use of five augmentation techniques improved the train-ing -and validation accuracy of VGG19 by approximately 5 %, which is a good achievement. But,this strategy did not impact the customized model greatly. The training accuracy increased by ca5 %, but the validation accuracy suffered by this approach slightly. As shown in the table 5.1, theVGG19 with data augmentation reached the highest accuracy and lowest loss. One explanationfor why data augmentation is not beneficial for the customized model is that the model is toosmall and cannot capture the relevant information from data variation. VGG19, which is a muchbigger network, is capable of capturing this. But I still consider data augmentation worth applyingespecially for an extensive network.

The final performance measures of the two CNN models illustrated in figures 5.7 and 5.8 re-flects the training -and validation accuracy. VGG19 with data augmentation produces the highestTPR, PPV, F1 score, and the overall final accuracy. The F1 score for Por is close to 100 %, 95

46

% for Slits and 98 % for Macro. The same result for the customized model is F1 score for Poris close to 100 %, 78 % for Slits, and 87 % for Macro. Both models perform well for Por, whichis not very surprising because Por has distinctive features that stand out from Macro and Slits.The customized model misclassify many Slits as Macro and vice verse, see confusion matrix 5.6.It is not a surprising result since Macro and Slits have many similarities. The majority of Macroand Slits images have a dark background and are taken with1000 zooming. Additionally, the dis-tinction between those two defects is not very clear always. Sometimes the operators that havebeen working at Sandvik Coromant for a long time misclassify those. The risk of misclassificationis high already when gathering the dataset before training the networks. I am surprised that theVGG19 with transfer learning and data augmentation produce such a good result despite the lackof data and other challenges.

6.2 Analysis of Image Processing

As described in section 5.3, the first approach to solve this defect detection problem was by amachine learning approach. YOLOV3, which is a CNN based machine learning algorithm, wastherefore implemented for this purpose, but after some experimentation, without a satisfactoryresult, the strategy was changed. After that, a more straightforward image processing strategywas used to the same problem. This rule-based image processing method is capable of detectingthe majority defects, as shown in figures 5.9 and 5.10. However, this method is only applicableto the images with a white background of this type seen in these two figures. It is not capable ofdetecting defects on the image with a dark background. This limitation is due to image processingalgorithms and hyperparameters used in this project. Many experimentation and iteration areperformed during this part of the project to find some good parameter combination that producedsatisfactory results. Additionally, it is very challenging to find a general method that can handleboth types of image, with dark and white background. Therefore, it was deliberately chosen only toadapt the parameters for pictures with a white background. So this detection algorithm is far fromperfect; it is only able to recognize the interesting region of an image based on gradient differences.Therefore, sometimes, it selects areas with irrelevant information. For example, it detected falsedefects on figure 5.9 on the right image, three large rectangles that are not of interest, that is onlyan irregularity on the picture. In figure 5.10 in the left image, the detected edge is not either a pieceof relevant information for the goal of this project. Another major limitation with this algorithmis that the defect classification is based on pixels values and not µm, which is a more accurateand desired result. During this project, a conversion method between pixel and Micrometer wasimplemented without any fruitful result. The images used in this project is taken in differentmicroscopes which have different zooming qualities and lenses. Therefore, it was challenging toconvert pixel to Micrometer, and the pixel values are used for classification.

6.3 Comparison of CNN Models and Image Processing

As mentioned in section 5.4 a direct comparison of CNN models for classification and image pro-cessing detection is difficult to make because they are focusing on a different aspect of the sameproblem. A simple comparison is presented in figures 5.11 and 5.12, VGG19 with data augmen-tation outperform the image processing method to classify an image. F1 score for VGG19 is 93% , 100 % and 83 % for Por, Macro and Slits respectively. The same result for images processingis 76 % , 67 % and 73 %. The performance of the image processing algorithm is closer to thecustomized CNN model. The main purpose of defect detection is to detect the individual defects,not focusing on the entire image. Therefore, it is not fully fair to compare these two strategiesdirectly. Nevertheless, it is not harmful to do this. Object detection is normally a harder problemthan classification.

It is worth to point out that the main focus of this master thesis has been the classification partusing CNN networks. The image processing can be considered as a side project, and therefore,the algorithm is not very optimized. One possible method is to combine these two strategies, firstimplement the image processing algorithm to detect the individual defects and then send thesethrough the CNN network for classification. This combined approach has several advantages.Firstly, the pixel to µm conversion can be removed, which in turn can reduce the effect of the indi-vidual microscopes. Secondly, it can increase the number of training samples by several times, anddata augmentation is probably not necessary then. Thirdly, the training process becomes much

47

more efficient, because irrelevant information and empty part of images are minimized. However,there is not enough time for this thesis to also implement this part. It is left for future projectsand works.

6.4 Limitations

This project has several limitations. Primarily, in this project, only the images with defects (posi-tive) are investigated; in reality, positive and negative samples are needed. In a practical scenario,this would create a problem because a category without defect (negative) is necessary. The modelsin this thesis are only trained to recognize, not negatives. Therefore, images without any defectwill also be classified as one of these three defect categories. It is a significant problem, but thisapproach is chosen deliberately or unwillingly for the scope of this project due to lack of data. Thegathered data by the quality division at Sandvik Coromant contained only images with defects.During the first phase of the project, new images were collected; some of these images did notinclude any defects. At the beginning of the network training process, a binary classifier was im-plemented to just separate defects from no defects, positive or negative. But the produced resultwas not satisfactory, the network was not better than random guessing. One reason was the lack ofgood data. The other reason was the close similarity between Por defects and images with defects.The networks were not merely able to recognize those. To this large amount, good quality data isprobably required.

The second limitation for this project is the contrast and the zooming difference of the images.The majority of the images in categories Macro and Slits have a dark background and are takenwith 1000 zoom. However, it is the opposite for the category Por, white background and 100zoomings. The optimal start would be if all the images were of the same quality, taken by thesame microscope and with a similar background.

The third limitation is that in reality, it exists several subcategories of these three defects an-alyzed in this thesis. But this is ignored in this project because of existing data. Due to timelimitation and other circumstance, it was decided to focus on the available data and classify thethree defects internally.

48

Chapter 7

Conclusion

During this thesis, two different strategies have been implemented to solve the same problem froma different viewpoint. The first strategy was to train Convolutional Neural Networks (CNN) toclassify three categories of images, containing defects occurring in cutting metal production. Itwas done using two different CNN models with - and without data augmentation. Experimentsshowed that CNN:s is a good strategy to classify material defects. The best performing modelreached a final overall accuracy (ACC) around 95 % and F1 score of circa 95 % per category.Additionally, the experiment illustrated that transfer learning, and data augmentation techniquesare two powerful methods to enhance network performance.

The second goal for this project was to detect and draw boxes for the individual defects on theimages, implementing a rule-based defect detection using image processing algorithms. It wasdone using Canny edge detection, a contour detection algorithm, and some other image processingfilters. The result showed that these image processing algorithms are capable of detecting themajority of individual defects.

Finally, the comparison of CNN models and image processing algorithms showed that CNN modelsare better for classification of entire images than rule-based image processing in this case. How-ever, the main focus of this project has been the CNN models; therefore, the result slightly biasedtoward CNN models.

49

Chapter 8

Future Work

As mentioned at the beginning of this thesis, this work is the first of its kind at Sandvik Coromant.Therefore, it is an exploratory project to investigate the potential and limitations. The achievedresults so far show great potential for the future. However, this domain is exciting and challeng-ing. But, there exists much room for improvements related to the limitations, described in section6.4. The most significant factor is the dataset, which has been very limited in this case. Withan extended and diversified dataset, the results would improve much further. The first strategyfor future works is, therefore, gathering completely new images with a similar background andzoomings. Preferably, the images should be taken by the same microscope, and this will reduceerror and biases.

Major improvements are necessary for the data collection at the test division. The data col-lection is highly unorganized currently, and it is difficult for a new person to find all the data andunderstand it. The defect classification by operators seems very arbitrarily and individual. For thefuture works, all the images should be categorized by one person that knows the internal regulationswell. It will reduce the error further. The images are gathered in hard drives, creating a systematicdatabase with clear defect names and defect classes are an optimal suggestion. A database will alsosimplify the work of operators. It is much flexible and comfortable to retrieve information from adatabase. Additionally, the data is safe there, in case of a system crash or other unexpected events.

For the future automatic classification process, a database is highly integrable into a larger camera-or microscope systems. The classification algorithm can easily communicate and retrieve imagesfrom a database and indicate the misclassified images into a separate space in the database.

These methods developed at this thesis are a good base for future projects. The first sugges-tion is that the CNN models must be retrained using better images that are more realistic andrepresentative from a practical point of view. The second suggestion is to integrate the rule-basedimage processing algorithm with a CNN model.

50

Bibliography

[1] David Frossard. VGG16 architecture. https://www.cs.toronto.edu/~frossard/post/vgg16/, April 2019.

[2] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.

[3] James MacQueen et al. Some methods for classification and analysis of multivariate ob-servations. In Proceedings of the fifth Berkeley symposium on mathematical statistics andprobability, volume 1, pages 281–297, 1967.

[4] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: Asurvey. Journal of artificial intelligence research, 4:237–285, 1996.

[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deepconvolutional neural networks. Neural Information Processing Systems, 2012.

[6] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the fourteenth international conference on artificial intelligence and statistics,pages 315–323, 2011.

[7] Dabal Pedamonti. Comparison of non-linear activation functions for deep neural networks onmnist classification task. arXiv, 2018.

[8] Michael A Nielsen. Neural networks and deep learning. Determination press USA, 2015.

[9] Richard Sutton. Two problems with back propagation and other steepest descent learningprocedures for networks. In Proceedings of the Eighth Annual Conference of the CognitiveScience Society, 1986, pages 823–832, 1986.

[10] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks,pages 145–151, 1999.

[11] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate ofconvergence. In Doklady AN USSR, volume 269, pages 543–547, 1983.

[12] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learninglecture 6a overview of mini-batch gradient descent. 2012.

[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InternationalConference on Learning Representations, 2014.

[14] Yann LeCun et al. Generalization and network design strategies. In Connectionism in per-spective. Citeseer, 1989.

[15] AMDominik Scherer and S Behnke. Evaluation of pooling operations in convolutional architec-tures for object recognition. Artificial Neural Networks (ICANN)-Lecture Notes in ComputerScience, pages 92–101, 2010.

[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. book in preparation formit press. 2016. http://www.deeplearningbook.org.

[17] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervousactivity. The bulletin of mathematical biophysics, pages 115–133, 1943.

[18] Nathaniel Rochester, J Holland, L Haibt, and W Duda. Tests on a cell assembly theory of theaction of the brain, using a large digital computer. IRE Transactions on information Theory,pages 80–93, 1956.

51

https://www.cs.toronto.edu/~frossard/post/vgg16/

https://www.cs.toronto.edu/~frossard/post/vgg16/

http://www.deeplearningbook.org

[19] John McCarthy, Marvin L Minsky, Nathaniel Rochester, and Claude E Shannon. A proposalfor the dartmouth summer research project on artificial intelligence. AI magazine, page 12,2006.

[20] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organi-zation in the brain. Psychological review, 1958.

[21] Marvin Minsky and Seymour Papert. An introduction to computational geometry. Cambridge,1969.

[22] Paul J Werbos. New tools for prediction and analysis in the behavioral sciences. Ph. D.dissertation, Harvard University, 1974.

[23] Paul J Werbos. Applications of advances in nonlinear sensitivity analysis. In System modelingand optimization, pages 762–770. Springer, 1982.

[24] Alexander Jung. Imgaug data augmentation. https://github.com/aleju/imgaug, 2018.

[25] Marcus D Bloice. Augmentor data augmentation. https://github.com/mdbloice/Augmentor, 2018.

[26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scaleimage recognition. arXiv, 2014.

[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,pages 770–778, 2016.

[28] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely con-nected convolutional networks. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 4700–4708, 2017.

[29] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network trainingby reducing internal covariate shift. arXiv, 2015.

[30] The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, ChristofAngermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, AnatolyBelikov, et al. Theano: A python framework for fast computation of mathematical expressions.arXiv, 2016.

[31] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast featureembedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages675–678. ACM, 2014.

[32] Nikhil Ketkar. Introduction to pytorch. In Deep learning with python, pages 195–208. Springer,2017.

[33] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scalemachine learning on heterogeneous systems, 2015. Software available from tensorflow. org,1(2), 2015.

[34] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.

[35] Victoria López, Alberto Fernández, Salvador García, Vasile Palade, and Francisco Herrera.An insight into classification with imbalanced data: Empirical results and current trends onusing data intrinsic characteristics. Information sciences, 2013.

[36] L. Y. Pratt. Discriminability-based transfer between neural networks. In Advances in NeuralInformation Processing Systems 5, pages 204–211. Morgan-Kaufmann, 1993.

[37] Stephen V Stehman. Selecting and interpreting measures of thematic classification accuracy.Remote sensing of Environment, pages 77–89, 1997.

[38] David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness,markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37-63, 2011.

52

https://github.com/aleju/imgaug

https://github.com/mdbloice/Augmentor

https://github.com/mdbloice/Augmentor

https://github.com/fchollet/keras

[39] R Girshick, J Donahue, T Darrell, and J Malik. Rich feature hierarchies for accurate objectdetection and semantic segmentation. arXiv, 2014.

[40] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-YangFu, and Alexander C Berg. SSD: Single shot multibox detector. In European conference oncomputer vision, pages 21–37. Springer, 2016.

[41] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified,real-time object detection. In Proceedings of the IEEE conference on computer vision andpattern recognition, pages 779–788, 2016.

[42] Mario Bertero, Tomaso A Poggio, and Vincent Torre. Ill-posed problems in early vision.Proceedings of the IEEE, pages 869–889, 1988.

[43] David Marr and Ellen Hildreth. Theory of edge detection. Proceedings of the Royal Societyof London. Series B. Biological Sciences, pages 187–217, 1980.

[44] Tomaso Poggio, Vincent Torre, and Christof Koch. Computational vision and regularizationtheory. In Readings in Computer Vision, pages 638–643. Elsevier, 1987.

[45] John Canny. A computational approach to edge detection. In Readings in computer vision,pages 184–203. Elsevier, 1987.

[46] Irwin Sobel. An isotropic 3x3 image gradient operator. Presentation at Stanford A.I. Project1968, 1968.

[47] RM Haralick, SR Sternberg, and X Zhuang. Image analysis using mathematical morphology.SPIE MILESTONE SERIES MS, 127:71–89, 1996.

[48] Urs Ramer. An iterative procedure for the polygonal approximation of plane curves. Computergraphics and image processing, 1(3):244–256, 1972.

[49] Satoshi Suzuki et al. Topological structural analysis of digitized binary images by borderfollowing. Computer vision, graphics, and image processing, 30(1):32–46, 1985.

[50] Ross Girshick. Fast R-CNN. In Proceedings of the IEEE international conference on computervision, pages 1440–1448, 2015.

[51] Joseph Redmon and Ali Farhadi. YOLOV3: An incremental improvement. arXiv, 2018.

53

Classifying Material Defects with Convolutional Neural Networks …1330515/... · 2019. 6. 25. ·...

Documents

Transcript of Classifying Material Defects with Convolutional Neural Networks …1330515/... · 2019. 6. 25. ·...