Image Classification Model Based on Deep Learning in ...

Research ArticleImage Classification Model Based on Deep Learning inInternet of Things

Songshang Zou,1 Wenshu Chen,2 and Hao Chen 1

1College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China2College of Computer Science, University of Bristol, Bristol BS8 1QU, UK

Correspondence should be addressed to Hao Chen; [email protected]

Received 17 October 2020; Revised 12 November 2020; Accepted 25 November 2020; Published 8 December 2020

Academic Editor: Hongju Cheng

Copyright © 2020 Songshang Zou et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the environment of Internet of Things, the convolutional neural network (CNN) is an important tool and method of imageclassification. However, the features that are extracted by each layer of CNN are all high dimensional, and the features differamong the layers. In addition, these features contain substantial amounts of redundant information. To prevent the increase inthe computational burden and the decline of the model generalization performance that are caused by high dimensionality, thispaper proposes an improved image classification algorithm based on deep feature fusion, which designs and builds an 8-layerCNN. In addition, it reduces the dimensionality of the features via the principal component analysis (PCA) dimensionalityreduction algorithm and fuses the features that have undergone dimensionality reduction to make the obtained features moretypical and differential. The experimental results demonstrate that the proposed algorithm improves the performance of themodel and achieves satisfactory accuracy.

1. Introduction

In the era of Internet of Things, image classification plays animportant role in multimedia information processing. Theimage classification accepts the given input images and pro-duces output classification for identifying whether the diseaseis present or not. As artificial intelligence technology hasbeen widely applied, image classification and recognitiontechnology have received increasing attention and have beenutilized in an increasing number of fields, such as imageinformation retrieval, real-time target tracking, and medicalimage analysis. In recent years, deep learning has attractedincreasing attention [1]. The previous machine learningmethods have various limitations. For example, when thereare few samples, it is highly difficult to represent complexfunctions. When using deep learning algorithms to representcomplex data distributions, nonlinear network models withdeep layers can be used to learn the deep features from thedata in the case of few samples. Deep learning is a type ofalgorithm and topological structure that can be used to solvegeneralization problems [2]. The combination of deep hier-

archical neural networks and GPU (graphics processing unit)has accelerated the execution of deep learning algorithms.Deep learning has advanced by leaps and bounds, and bigdata has propelled this development momentum. A convolu-tional neural network is a type of feed-forward neural net-work. Its artificial neurons can respond to surroundingunits within the coverage area, and it is suitable for process-ing large-batch image datasets.

CNN continuously extracts and compresses image fea-tures and obtains higher level features. It condenses the orig-inal features repeatedly and obtains more reliable features[3]. Various tasks can be conducted with the features in thelast layer, e.g., classification and regression. CNN has uniqueadvantages in automatic speech recognition (ASR) and imageprocessing due to its special structure of shared local weightsand its similar layout to real biological neural networks [4].Weight sharing lowers the network complexity; since animage with multidimensional input vectors can be directlyinput into the network, the complexity of data reconstructionin feature extraction and classification is avoided [5].Through the study of current image classification and

HindawiWireless Communications and Mobile ComputingVolume 2020, Article ID 6677907, 16 pageshttps://doi.org/10.1155/2020/6677907

https://orcid.org/0000-0001-5902-1824

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1155/2020/6677907

recognition algorithms, it is discovered that various algo-rithms have failed to effectively fuse the multilayered deeplearning features of CNN and that they have poor accuracy.

In the environment of Internet of Things, the convolu-tional neural network (CNN) is an important tool andmethod of image classification. In order to further improvethe classification accuracy of the CNN model, this paperhas effectively fused deep features via a cascading strategyand has increased the diversity and the expressiveness ofthe extracted features to enhance the classification perfor-mance of the network mode. The main contributions of thispaper include the following:

(i) This paper analyzes the structure of CNN, studiesthe principles of activation functions, specifies therole that nonlinear activation functions play in neu-ral networks, and shows that via facilitation by non-linear functions; CNN has stronger featurerepresentation performance and can realize compleximage classification

(ii) To address the problems that conventional imageclassification algorithms that are based on deeplearning cannot effectively fuse multilayered deepfeatures and perform poorly in terms of classifica-tion accuracy, this paper proposes an improve imageclassification algorithm that is based on deep featurefusion and improves the diversity and expressive-ness of the extracted features to improve the classifi-cation performance

(iii) By comparing the classification performances of theCNN model on the Food-101 and Places2 datasetsunder various activation functions, it is demon-strated that the activation function that is used inthis paper can improve the classification accuracyof the model on image datasets and ensure itsconvergence

(iv) This paper conducts a performance analysis andevaluation of the proposed algorithm in comparisonwith other algorithms. The experimental resultsdemonstrate that the algorithm that is proposed inthis paper realizes higher accuracy

The remainder of this paper is organized as follows. Sec-tion 2 discusses related work. The CNN network structureand the activation function performance are analyzed in Sec-tion 3. Section 4 proposes an improved CNN image classifi-cation and recognition algorithm that is based on featurefusion. Section 5 presents the experiment results and analy-sis. Section 6 summarizes the conclusions of the paper anddiscusses future research directions.

2. Related Work

As a highly important research direction in computer vision,image classification and recognition involves knowledgefrom several disciplines and has been applied in multipleresearch fields. With the rapid development of internet tech-

nology, a substantial amount of image data are encounteredin people’s lives, thereby leading to increased demands formachine learning and computer vision techniques and morein-depth research [6]. According to in-depth research thathas been conducted on digital image processing and deeplearning, compared with other neural networks, CNN hasthe following strengths: the input image matches well withthe topological structure of CNN. Feature extraction pro-ceeds simultaneously with pattern classification and genera-tion in the training process, and weight sharing can reducethe number of training parameters, thereby rendering theCNN structure simpler and more adaptive [7]. CNN ismainly used to recognize 2D images with invariance in termsof shifting, scaling, and other forms of distortion [8]. TheCNN feature detection layer learns by training on the data;hence, CNN can avoid explicit feature extraction and learnsimplicitly from data training. In addition, because the neu-rons in the same feature mapping plane share the sameweight, CNN can learn in parallel. This is another advantageof CNN compared to networks that have interconnectedneurons. In the study of image classification, feature extrac-tion will directly affect the classification performance of thenetwork model. In essence, CNN is a mapping from inputto output. In practical applications, it typically uses multi-layer convolution and trains with a fully connected layer.The features that are learnt via one-layer convolution are typ-ically local. In multilayer convolution, the higher the layer,the more global the learnt features are [9].

Neocognitron can be regarded as the first implementa-tion of CNN, and it is also the first application of the recep-tive field in artificial neural networks. It attempts to modelthe visual system and enables it to complete recognition evenif there is shift or slight distortion in the objects [8]. The deeplearning architecture did not emerge until the last twodecades. It has substantially increased the number and typesof problems that can be solved by neural networks. There are5 popular deep learning architectures: recursive neural net-work (RNN), long short-term memory (LSTM)/gated recur-rent unit (GRU), convolutional neural network, deep briefnetwork (DBN), and deep stacking network (DSN). CNN isa type of multilayered neural network, and it is inspired bythe animal visual cortex. The first CNN was built by YannLeCun for handwritten character recognition. As a deep net-work, the early layers mainly recognize features (e.g., edges)and the subsequent layers recombine these features intohigher level input [10]. Therefore, deep learning can beregarded as “deep parameter adjustment.” However, it alsomagnifies the network shortcomings. A limited training data-set easily results in overfitting. The larger the network, themore complex the computation and the more difficult theapplication of the network, and the deeper the network, themore easily the gradient vanishes and the more difficult themodel optimization. In the study of image classification andrecognition, feature extraction will directly impact the classi-fication capacity of the network model. The main problemthat is encountered with available image classification algo-rithms in feature extraction is that they cannot effectively uti-lize various deep features that are extracted by networks [11,12].

2 Wireless Communications and Mobile Computing

3. Convolutional Neural Network (CNN)

A convolutional neural network is a type of feed-forwardneural network that typically specializes in the processingof image data (multiarray data). The design of the CNNstructure can effectively preserve the structure of the originaldata and generate a layered representation. A typical CNNstructure includes multilevel processing layers that areordered from left to right. CNN typically has four types oflayers: convolutional, pooling, fully connected, and classifica-tion layers. Convolutional layers and pooling layers are thecore layers of the design, and they are typically utilized inthe first few phases.

3.1. Convolutional Layers. In CNN, the convolutional layersare the most important layers, which are typically used forfeature extraction. As parts of an image may have thesame statistical properties, feature learning for an imagecan be conducted on randomly selected parts of subi-mages, and the learned features will be used as a filter toscan the entire image and to obtain the feature activationvalues of various positions in the image to complete fea-ture extraction [13].

In a conventional neural network, every neuron must beconnected with every pixel; consequently, numerous weightswill render the network difficult to train. Additionally, thenumber of weights of each neuron in a neural network thathas a convolutional layer is equal to the size of the convolu-tion kernel; this is equivalent to each neuron being connectedwith all pixels. Thus, the number of weights is substantiallyreduced [14].

The convolution computation consists of two steps. Step1 is a linear operation. It processes a group of weights that areconnected with the original input image or the low-level fea-ture map, translates the convolution kernel for multiple con-volutions according to the stride l, and adds the sum of thebias bx and the results of multiple convolutions. Step 2 is anonlinear operation. It uses the activation function f ðxÞ toobtain the output feature map Cx , namely, it performsweighted summation on one neuron through multiple inputsignals and outputs via the activation function [15].

The feature extractor can be replaced by a trained convo-lution kernel. Convolution kernels differ in terms of thestructural features that they extract from an image. To extractmultiple features at the same position, multiple convolutionkernels can be used, and CNN will output the combinationof these features from the convolutional layer. The convolu-tional layer has two properties that can reduce the numberof parameters to be calculated: weight sharing and localperception.

(a) Weight sharing

In weight sharing, the same weight is used by all neuronsin the same feature mapping. If the convolution of a group ofweights and the input image yields edge features, theseweights can be regarded as edge feature extractors, and theycan be used directly to extract edge features of other imageregions [16].

(b) Local perception

Local perception is the connection of part of the network.Similar to the human visual system, CNN’s perception pro-cess of an image is from local to global, and every neuron isconnected with neurons that belong to the previous layer.Therefore, the information of an entire image is perceivedby neurons repeating the process of activation in a smallregion and translating to another region.

A convolutional layer extracts features, and the mostimportant elements in this layer are the trained convolutionkernels. Kernels can detect specified shapes, colors, and con-trasts, among other features, and feature maps preserve thespatial structure after extraction; therefore, the feature mapsthat correspond to convolution kernels represent features ofa corresponding dimension, and with the increase in thenumber of CNN layers, the extracted features becomeincreasingly concrete [17].

Every node in the convolutional layer and pooling layer isconnected with only some nodes in the previous layer, andthe input of each node in the convolutional layer is a smallblock of the previous layer; the size of which is determinedby the size of the window of the convolution kernel. Typi-cally, after being processed by the convolutional layer, thenode matrix will become deeper and the depth will be deter-mined by the number of kernels [18]. Parameter sharing inthe kernel enables the image contents not to be affected bythe positions and can substantially reduce the number ofparameters of the network model and reduce the complexityof the operation; see Figure 1 for an illustration of CNN fea-ture extraction.

In Figure 1, the 1st convolution extracts a low-level fea-ture, the 2nd a mid-level feature, and the 3rd a high-levelfeature.

3.2. Activation Function. In a neural network, every neuronnode will accept the output value from the previous layer asits input value and convey this value to the next layer. Thenodes in the input layer will input and transmit the propertyvalue to the next output layer. In a multilayer neural network,an activation function is used to represent the relationshipbetween the output values of the neuron nodes in the previ-ous layer and the input values of those in the next layer [19].

A nonlinear function is similar to an activation function,through which neural networks realize stronger representa-tion performance and overcome the finite approximationlimitation that is caused by the use of a linear function.

In the early study of neural networks, the sigmoid activa-tion function and the tanh activation function were fre-quently used. In the back-propagation of neural networks,the sigmoid activation function will result in gradient explo-sion and loss, and because the output values of the sigmoidfunction are not of zero mean, the convergence is slow andthe training time is substantially increased in deep neuralnetworks. In deep network learning, learning a large amountof data typically takes a long time; hence, the convergencespeed of the training model is of high importance. When adeep network is trained, zero-mean data can accelerate con-vergence. The ReLU function is very fast in calculation, and

3Wireless Communications and Mobile Computing

its convergence speed is much faster than those of the sig-moid activation function and the tanh activation function.It can also avoid the gradient vanishing that is caused bythe sigmoid function and the tanh function [20, 21].

The common activation functions include the following:

(1) Sigmoid function

The sigmoid function is the most frequently used contin-uous and smooth activation function; it is also known as thelogistic function. It is used in the output of neurons in a hid-den layer, and it can map a real number to the range of (0,1)for binary classification. The formula of the sigmoid functionis

f xð Þ = 11 + e−x

: ð1Þ

The range of Formula (1) is (0,1), and its derivative is

f ′ xð Þ = 11 + e−x

1 − 11 + e−x

� �= f xð Þ 1 − f xð Þð Þ: ð2Þ

In Figure 2, if x = 10 or x = −10, f ′ðxÞ ≈ 0, and if x = 0,f ′ðxÞ = 0:25.

(2) Tanh function

The formula of the tanh function is

f xð Þ = tan h xð Þ = ex − e−x

ex + e−x: ð3Þ

The range of Formula (3) is (-1, 1), and its derivative is

f ′ xð Þ = − tan h xð Þð Þ2: ð4Þ

In Figure 3, if x = 10 or x = −10, f ′ðxÞ ≈ 0, and if x = 0,f ′ðxÞ = 1.

(3) ReLU function

The ReLU function, which is showed in Figure 4, is themost frequently used nonlinear function in neural networks.It is continuous but not smooth, and its formula is

f xð Þ =max 0, xð Þ: ð5Þ

The range of Formula (5) is [0,+∞), and its derivative is

f ′ xð Þ =0, x < 0,1, x > 0,undefined, x = 0:

8>><>>: ð6Þ

The ReLU function has the following properties: unilat-eral inhibition, relatively broad excitement boundary, andsparse activation.

(4) PReLU function

The formula of the PReLU function is

f xð Þ =max αx, 0ð Þ: ð7Þ

In Figure 5, parameter α in the PReLU function is notfixed and can be learned during training. Although it ensuresthat the output result follows a zero-mean distribution, it alsoactivates all feature values in the negative semiaxis. Hence,noises will also be activated, and the final convergence willbe affected.

(5) TReLU function

The formula of the TReLU function is

f xð Þ =x, x > 0,tan h αxð Þ, x ≤ 0:

(ð8Þ

Low-Levelfeature

Mid-Levelfeature

High-Levelfeature

Trainableclassifier

Figure 1: Feature extraction by CNN.


In Figure 6, parameter α is a variable parameter, and it isused to control the unsaturated region of the function. We setthe initial value as 1. The function is almost linear at the ori-gin, and it can yield faster convergence.

The TReLU function overcomes the problem of gradientvanishing. As its derivative is always 1 if x > 0, the TReLUfunction is unattenuated if x > 0. In addition, the TReLUfunction preserves some gradient values in the unsaturatedregion of the negative semiaxis. If the activation value fallsin the unsaturated region, it still can be effectively activated,and it preserves some of the effective features while moreeffectively activating negative value features by controllingthe size of the unsaturated region with parameter α.

(6) Leak ReLU function

The formula of the Leak ReLU function, which is showedin Figure 7, is

f xð Þ =ax, x < 0,x, x > 0:

(ð9Þ

The range of Formula (9) is (−∞,+∞), and its derivativeis

f ′ xð Þ =a, x < 0,1, x > 0,undefined, x = 0:

8>><>>: ð10Þ

3.3. Pooling Layer. A pooling layer typically follows a convo-lutional layer; hence, the output from the convolutional layeris pooled in the pooling layer. The convolutional layerextracts features while the pooling layer reduces the numberof parameters. The pooling layer is mainly used to reduce thedimensionality of the features by compressing the number ofdata and parameters, thereby reducing overfitting andimproving the fault tolerance of the model. Although thepooling layer reduces the dimensions of various featuremaps, it can still preserve most important information.Located between continuous convolutional layers, the pool-ing layer reduces the number of data and parameters andreduces overfitting. The pooling layer has no parameters,and it downsamples the result from the previous layer, whichis known as data compression. In Figure 8, the

–10.0

0.0

0.2

0.4

0.6

0.8

1.0

–7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0

Figure 2: Sigmoid function.

–10.0

–1.00

–0.75

–0.50

–0.25

0.00

0.25

0.50

0.75

1.00

–7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0

Figure 3: Tanh function.

–10.0

0

2

4

6

8

10

12

14

–7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0

Figure 4: ReLU function.

–10.0

0

2

4

6

8

10

12

14

–7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0

Figure 5: PReLU function (α = 0:5).


downsampling process includes max pooling andmean pool-ing operations [18, 22].

Max pooling: define a spatial neighborhood, e.g., a win-dow of size 2 ∗ 2. Extract the largest element from the mod-ified feature map within the window. It has been proventhat max pooling outperforms mean pooling

Mean pooling: define a spatial neighborhood, e.g., a win-dow of size 2 ∗ 2. Calculate the mean value of the modifiedfeature map within the window

The input of every node in the pooling layer is a smallblock of the previous layer (which is typically a convolutionallayer), and the size of this small block is determined by thesize of the window of the pooling kernel. The pooling layerchanges the size of the node matrix instead of its depth. Forimage processing, the pooling operation in this layer can beregarded as transforming a high-resolution image into alow-resolution image. After the convolutional layer andpooling layer, the number of parameters in the networkmodel can be further reduced [23].

3.4. Fully Connected Layer. A fully connected layer has manyneurons, and it is represented as a column vector (single sam-ple). It is typically one of the latter few layers of a deep neuralnetwork in the field of computer vision, and it is used forimage classification. In this layer, all neurons are connectedvia weights, and this layer is typically situated in the rear part

of CNN.When the convolutional layers in the front part haveextracted weights that are sufficient for recognizing theimage, the next task is classification. In the end of CNN, typ-ically, a cuboid is spread into a long vector and sent into thefully connected layer for classification in collaboration withthe output layer [24].

A fully connected layer can be transformed into a convo-lutional layer and vice versa. Any convolutional layer can beconverted into a fully connected layer by converting theweight into a huge matrix. In this matrix, most entries are0, except in designated regions (local perception), and manyregions share the same weight (shared weight). Any fullyconnected layer can also be converted into a convolutionallayer.

A fully connected layer works as a “classifier” in the entireCNN. If the convolutional layer, pooling layer, and activationfunction layer map the original data into the feature space ina hidden layer, the fully connected layer maps the learnt “dis-tributed feature representation” into the sample label space.In practical applications, a fully connected layer can be real-ized via convolutional computation: a fully connected layerthat is fully connected in the previous layer can be convertedinto a convolution with a 1 ∗ 1 kernel, and a fully connectedlayer that has a convolutional layer as its previous layer canbe transformed into a global convolution with an h ∗w ker-nel, where h and w are the height and width of the convolu-tional result of the previous layer [25].

The core operation of full connection is to multiply thevectors in the matrix and, in essence, to linearly transformfrom one feature space to another. In CNN, full connectionis typically found in the final few layers, and it calculates aweighted sum of the features that were designed previously.The previous convolution and pooling are similar to featureengineering, while the full connection in the tail part is equiv-alent to feature weighting [26]. An operation example in thefully connected layer is illustrated in Figure 9.

In Figure 9, the last two columns of small balls representtwo fully connected layers. At the end of the last convolution,the final pooling operation is conducted, which outputs 20images of size 12 ∗ 12, which are converted into 1 ∗ 100 vec-tors by a fully connected layer.

4. CNN Image Recognition and ClassificationBased on Feature Fusion

The model that is proposed in this paper includes 8 layers: 6convolutional layers and 2 fully connected layers. The keydesign details are as follows.

In the feature mapping, nonlinear transformation is con-ducted via the ReLU activation function [27]. It activates theextracted image features and generates the correspondingoutput feature mapping [28]. Pooling layers are introducedbehind the 1st, 3rd, and 6th convolutional layers, and themax pooling operation is used to perform feature dimensionreduction on the output feature mapping. Meanwhile, toimprove the training efficiency and classification perfor-mance of the network, local normalization is conducted afterthe convolution operation in every convolutional layer toaccelerate the convergence.

–10.0

–5

10

5

0

15

–7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0

Figure 6: TReLU function (α = 1).

–10.0

–5

10

5

0

15

–7.5 –5.0 –2.5 0.0 2.5 5.0 7.5 10.0

Figure 7: Leak ReLU Function (a = 0:5).


The 1st convolutional layer is of size 227 ∗ 227 ∗ 3. Theconvolution operation is performed by an 11 ∗ 11 ∗ 96 con-volution kernel, and the convolution kernel moves 4 pixelseach time, namely, stride = 4. The size of every generated fea-ture mapping matrix is ðð227 − 11Þ/4 + 1Þ2 = 552.

Pooling utilizes the sampling operation, and the size ofthe sampling kernel is 3 ∗ 3. It slides 2 pixels in the originalimage each time. After sampling, the size of the generatedfeature mapping matrix is ðð55 − 3Þ/2 + 1Þ2 = 272.

For the 2nd convolutional layer, two pixels are padded onthe edges of the input feature mapping matrix, and the size ofthe mapping matrix becomes 31 ∗ 31. The convolution oper-ation is performed by a 5 ∗ 5 ∗ 256 convolution kernel, whichmoves 1 pixel each time, namely, stride = 1. The size of eachgenerated feature mapping matrix is ðð31 − 5Þ/1 + 1Þ2 = 272.

For the 3rd convolutional layer, one pixel is padded on theedges of the input feature mapping matrix, and the size of themapping matrix becomes 15 ∗ 15 ∗ 1. The convolution oper-ation is conducted by 256 3 ∗ 3 convolution kernels. Aftereach convolution operation, the convolution kernel is moved1 pixel. The size of each generated feature mapping matrix isðð12 − 3Þ/1 + 1Þ2 = 132.

Pooling utilizes the sampling operation. The size of thesampling kernel is 3 ∗ 3. The convolution kernel is moved 2pixels each time after the convolution operation. After sam-

pling, the size of the generated feature mapping matrix isðð27 − 3Þ/2 + 1Þ2 = 132.

For the 4th convolutional layer, one pixel is padded on theedges of the input feature mapping matrix, and the size of themapping matrix becomes 15 ∗ 15 ∗ 1. The convolution oper-ation is conducted by 384 3 ∗ 3 convolution kernels. Aftereach convolution operation, the convolution kernel is moved1 pixel. The size of each generated feature mapping matrix isðð15 − 3Þ/1 + 1Þ2 = 132.

For the 5th convolutional layer, one pixel is padded on theedges of the input feature mapping matrix, and the size of themapping matrix becomes 15 ∗ 15 ∗ 1. The convolution oper-ation is conducted by 3 ∗ 3 ∗ 256 convolution kernels. Aftereach convolution operation, the convolution kernel is moved1 pixel. The size of each generated feature mapping matrix isðð15 − 3Þ/1 + 1Þ2 = 132.

For the 6th convolutional layer, the convolution opera-tion is conducted on the image by a convolution kernel of size3 × 3 to realize feature extraction. The extracted image fea-tures are activated with the ReLU activation function to pro-duce the corresponding output feature mapping.

Pooling utilizes the sampling operation [29]. The size ofthe sampling kernel is 3 ∗ 3. It slides 2 pixels in the originalimage each time. After sampling, the size of the generatedfeature mapping matrix is ðð13 − 3Þ/2 + 1Þ2 = 62.

Max-pooling

0.2

Input X Kernel W Output Y Input X

0.3

00

0

1

0.5

0.6 0.6

Mean-pooling

0.2

Kernel W Output Y

0.3

0.25

0.25

0.250.25

0.5

0.6 0.4

Figure 8: Max pooling and mean pooling.

100 sigmoidneurons

10 neuronoutput layer

(softmax)

Convolutional layer20 24 24

Pooling layer20 12 12

28 28

Figure 9: Operations in a fully connected layer.


This paper adopts the method of cascade and performsfeature fusion on the output features after pooling in the 6th

layer with the 1st and the 2nd fully connected layers to makethe features that are extracted by the network more diverse,expressive, and differential and to improve the classificationperformance of the network model.

Extract image features using the constructed CNNmodelare as follows:

Step 1. Denote the pooling output result of the 6th layer asp8

Step 2. Calculate the outputs of the 1st and 2nd fully con-nected layers according to the formula ylj = f ðyl−1 + bljÞ anddenote the results as FC1, FC2, respectively. Here, l representsthe fully connected layer, yl−1 is the output result of the layerbefore the fully connected layers, and b denotes the bias

Step 3. Select p8, FC1, and FC2 as three deep features of theimage dataset and prepare for the subsequent feature fusion

Principal component analysis is a multivariate statisticalmethod that transforms scalars into several principal compo-nents [30]. These principal components can reflect most ofthe original information, and they are typically representedas linear combinations of the original variables [31]. Toensure that the information that is contained in these princi-pal components is nonoverlapping, these principal compo-nents must be unrelated. Principal component analysis caneffectively reduce the dimensions of data and minimize themean square error between the extracted components andthe original data. It can be used in feature extraction. Theprocess of this algorithm is as follows:

(a) Let X = ½X1, X2,⋯, Xp�T be a p-dimensional randomvector and let μ = EðxÞ and Σ =DðxÞ. The corre-sponding feature vectors to the p feature values of Σ: λ1 ≥ λ2 ≥⋯≥λp are t1, t2,⋯, tp, namely,

Σti = λiti, tTi ti = 1, tTi t j = 0 i ≠ j ; i, j = 1, 2,⋯, pð Þ: ð11Þ

Perform the following linear transformation:

Y1

Y2

⋮

Yn

2666664

3777775 =

L11 ⋯ L1p

⋮ ⋱ ⋮

Ln1 ⋯ Lnp

2664

3775

X1

X2

⋮

Xp

2666664

3777775 =

LT1

LT2

⋮

LTn

2666664

3777775X n ≤ pð Þ:

ð12Þ

If Y = ½Y1, Y2,⋯, Yn�T is expected to be used to describeX = ½X1, X2,⋯, Xp�T , then Y should reflect as much informa-tion of vector X as possible, namely, the larger the varianceDðYiÞ = LTi ΣLi of Yi, the better the description. In addition,to express the original information as effectively as possible,Yi and Y j should not contain repeated content, namely, covðYi, Y jÞ = LTi ΣLj = 0. It can also be proven that if Li = ti, DðYiÞ has maximum value λi and Yi and Y j are orthogonal.

Apple_pie Bread_pudding Caesar_salad Donuts Edamame Falafel

Garlic_bread Hamburger Ice_cream Lobster_bisque Mussels Nachos

Omelette Pad_thai Pancakes Pizza Pork_chop Ramen

Sashimi Spring_rolls Steak Sushi Tacos Tuna_tartare

Onion_rings Hot_dog Fried_rice Dumplings Chicken_wings Baby_back_ribs

Figure 10: Sample image data of Food-101.


(b) Reconstruct samples from the score matrix

In practice, the covariance matrix of global X is typicallyunknown and must be estimated from samples. Let X1, X2,⋯, Xn be samples of global X and let Xi = ½Xi1, Xi2,⋯, Xip�T. Then, the sample measurement matrix is

X =

XT1

XT2

⋮

XTn

2666664

3777775 =

X11 X12 ⋯ X1p

X21 X22 ⋯ X2p

⋮ ⋮ ⋮

Xn1 Xn2 ⋯ Xnp

2666664

3777775: ð13Þ

Each row of matrix X corresponds to a sample and eachcolumn to a variable. Then, the sample covariance matrix Sand the correlation coefficient matrix R are expressed as

S = 1n〠n

i=1Xi − �X� �

Xi − �X� �T = Sij

� �, ð14Þ

R = ðRijÞ,Rij = Sij/

ffiffiffiffiffiffiffiffiffiSiiSjj

p:

Define the score of the jth principal component of sampleXias SCOREði, jÞ = XTit j. It is expressed as follows inmatrix form:

SCORE =

XT1

XT2

⋮

XTN

2666664

3777775 t1, t2,⋯,tp� �

= XT , ð15Þ

Invert Formula (15) and reconstruct the original samplesfrom the score matrix:

X = SCORE · T−1 = SCORE · TT : ð16Þ

Typically, principal component analysis only uses the firstmprincipal components to approximate the original samples.

Alley Bar Barn Cafeteria Church Corridor

Doorway Downtown Escalator Forest Football_field General_store

Gymnasium Gas_station Hospital Jewelry_shop Kitchen Library_indoor

Living_room Lobby Office Operating_room Park Pavilion

Pharmacy Playroom Plaza Museum_ indoor Temple Tower

Figure 11: Sample image data of Places2.

Table 1: Accuracies of various activation functions on Food-101and Places2.

Activation functionAccuracy rate (%)

Food-101 Places2

Sigmoid 67.62 45.74

Tanh 75.94 53.38

ReLU 81.25 60.58

PReLU 83.44 64.36

TReLU 85.16 68.23


Accu

racy

(%)

100908070605040302010

0

Iteration

SigmoidTanhReLUPReLUTReLU

0 50 100 150 200 250

Figure 12: Comparison of five activation functions on Food-101.

100908070605040302010

0

Accu

racy

(%)

0 50 100 150 200 250

Iteration

SigmoidTanhReLUPReLUTReLU

Figure 13: Comparison of five activation functions on Places2.

Table 2: Training times of five activation functions on Food-101and Places2.

Activation functionTime (h)

Food-101 Places2

Sigmoid 2.6 3.5

Tanh 2.3 3.2

ReLU 1.9 2.5

PReLU 2.1 2.9

TReLU 2.2 3.0

Table 3: Classification accuracies on Food-101 and Places2 of fouralgorithms.

MethodAccuracy rate (%)

Food-101 Places2

NIN 85.62 66.90

DSN 88.18 69.53

DBN 88.53 70.34

Method of this paper 89.47 71.56


5. Analysis of the Experimental Result

5.1. Experimental Environment and Testing Datasets. Theexperimental environment that is utilized in this paperincludes the following: CPU: Intel(R) Core(TM) [email protected] GHz; GPU: NVIDIA GeForce GTX 1070; thephysical memory (RAM): 16.0G; and a PC with the deeplearning framework of TensorFlow.

To examine the classification performance of the pro-posed CNNmodel, experiments are conducted on two imagedatasets: Food-101 and Places2. Food-101 is an image datasetthat contains images of food. It includes 101 classes of food(western cuisine), and each class has 1000 images, whichare used to automatically recognize the class of gourmet.Places2 is an image dataset of scenarios. It contains 10 mil-lion images from over 400 classes of scenarios, and it is usedfor visual cognition tasks with scenarios and environments as

the application contents. Figures 10 and 11 show sampleimage data from Food-101 and Places2.

5.2. Accuracy Comparison of Activation Functions. In thestudy of image classification and recognition, activationfunctions are highly important for CNN models. Throughthe nonlinear mapping of an activation function, CNN canrealize stronger feature representation performance for han-dling more complex classification problems. This paper usesthe TReLU activation function to improve the classificationperformance of the CNN model.

To evaluate the performance of the TReLU activationfunction in boosting the classification performance andbased on the CNN that is designed in this paper, comparisonexperiments are conducted on Food-101 and Places2 usingthe TReLU activation function and other common activationfunctions. Food-101 and Places2 include many classes and

0.88 0.

91 0.92

0.89

0.87

0.9 0.91

0.88 0.

90.

940.

890.

950.

85 0.88

0.87

0.92

0.9

0.89

0.83

0.9

0.89 0.

910.

86 0.89 0.

910.

90.

88 0.89

0.87 0.

89

0.75

0.8

0.85

0.9

0.95

1

0 5 10 15 20 25 30 35

Acc

urac

ies

Samples

Apple_pieBread_puddingCheese cakeCup_cakesFrench_toastHot_dog

Baby_back_ribsBreakfast_burritoChicken_curryDonutsFried_riceIce_cream

Beef_carpaccioCaesar_saladChicken_wingsDumplingsGarlic_breadLobster_bisque

Beef_tartareCaprese_saladChocolate_cakeFish_and_chipsGrilled_salmonNachos

Beet_salad

Carrot_cakeClub_sandwich

French_friesHamburgerOmelette

Figure 14: Accuracies on samples of the Food-101 dataset.

0.86

0.9 0.91 0.

930.

870.

850.

91 0.92

0.9 0.

930.

890.

950.

86 0.87 0.

890.

920.

90.

820.

88 0.9

0.89 0.

90.

87 0.89 0.

91 0.92

0.9

0.88

0.87

0.9

0.75

0.8

0.85

0.9

0.95

1

0 5 10 15 20 25 30 35

Acc

urac

ies

Samples

Onion_ringsPork_chopScallopsSushiMiso_soupRavioli

PancakesPrime_ribSeaweed_saladTacosOystersSamosa

Panna_cottaRamenShrimp_and_gritsTiramisuNachosTakoyaki

Peking_duckRisottoSpring_rollsTuna_tartareRed_velvet_cakeSpaghetti_carbonara

PizzaSashimiSteakWafflesPad_thaiSpaghetti_carbonara

Figure 15: Accuracies on samples of the Food-101 dataset.


are of high classification difficulty; see the experimentalresults in Table 1.

According to the experimental results in Table 1, theunsaturated nonlinear activation functions (e.g., ReLU) real-ize lower error rates than the saturated nonlinear activationfunctions (e.g., sigmoid), thereby suggesting that activationfunctions that are similar to biological neurons improve theclassification performance.

In terms of the classification accuracy, the saturated non-linear activation functions, namely, the sigmoid function andthe tanh function, are outperformed by the unsaturated non-linear activation functions, namely, ReLU, PReLU, andTReLU; hence, the activation functions that approximatebiological neurons can improve the classification accuracy.The TReLU activation function exhibits excellent classifica-tion performance on the complex datasets, namely, Food-101 and Places2, and it outperforms the other functions,which further proves that the TReLU activation functioncan improve the classification performance of the CNN

model and yield excellent generalization performance.Figures 12 and 13 compare the classification performancesof the CNNmodel under the five considered activation func-tions more vividly. The TReLU activation function not onlyrealizes higher classification accuracy, but also has higherconvergence speed than those of the other functions.

The experimental results also include the training timesthat were required by the five activation functions on Food-101 and Places2, which are shown in Table 2.

According to the experimental results in Table 2, theTReLU activation function requires almost the same trainingtime as the tanh function. This is acceptable. The key result isthat the TReLU activation function can enhance the accuracyand further increase the convergence speed. The operation ofactivation function is expressed as Formula (17).

xlj = f 〠i∈N j

Wlij ∗ xl−1i + blj

!, ð17Þ

0.76

0.7

0.82

0.65

0.79 0.

83 0.85 0.86

0.61 0.

630.

740.

62 0.67

0.64 0.65

0.78

0.74 0.

77 0.82

0.7

0.68

0.6 0.

630.

71 0.74

0.8

0.59

0.88

0.85

0.7

0.50.55

0.60.65

0.70.75

0.80.85

0.90.95

1

0 5 10 15 20 25 30 35

Acc

urac

ies

Samples

AlleyBookstoreCampsuCotttageClosetElevator

Apartment_buildingBus_stationChurchCorridorCourthouseEmbassy

ArchBridgeCoffee_shopCourtyardDowntownEngine_room

BakeryBedroomComputer_roomCrosswalkDressing_roomFire_station

BalconyCabinConference_roomClothing_storeDining_roomFootball_field

Figure 16: Accuracies on samples of the Places2 dataset.

0.72 0.74

0.84

0.68

0.81

0.7

0.85

0.66

0.77

0.65

0.74

0.63

0.71

0.69

0.64

0.73

0.7 0.72

0.66

0.62

0.76 0.

820.

67 0.68 0.

730.

810.

790.

60.

74 0.78

0.50.55

0.60.65

0.70.75

0.80.85

0.90.95

1

0 5 10 15 20 25 30 35

Acc

urac

ies

Samples

GalleyHouseLighthouseMusic_studioPavilion

Gift_shopJewelry_shopLocker_roomNurseryPharmacy

Gas_stationKitchenMansionOfficePlayroom

GymnasiumLibraryMarketOperating_roomPlaza

HospitalLobbyMosqueOrchardPorch

HotelLecture_roomMuseumPalaceRestaurant

Figure 17: Accuracies on samples of the Places2 dataset.


where xlj represents the jth feature map of the lth layer;

Wlij represents the connection weight of the ith feature map

in the l − 1th layer and the jth feature map in the lth layer;∗ represents the convolution operation; blj represents thebias; and Nj represents the total number of input featuremaps.

5.3. Analysis of the Experimental Results. To evaluate the clas-sification performance of the CNN model that is designed inthis paper, which is based on deep feature fusion, experi-ments have been conducted on two image datasets, namely,Food-101 and Places2, and the results are compared withthose of other image classification methods. Table 3 lists the

classification accuracies. The recognition accuracies for eachaction category in the Food-101 and Places2 datasets areshown in Figures 14–17.

According to the experimental results in Table 3, the pro-posed method can effectively improve the classification per-formance of the network model, and its classificationaccuracy is higher than those of the other algorithms. Net-work in network (NIN) is the predecessor of inception. Itexpands the 1 ∗ 1 convolution kernel behind the convolu-tional layer and replaces the fully connected layer by a globalaverage pooling layer to reduce the number of trainingparameters and to effectively avoid overfitting. DSN differsfrom other traditional deep learning frameworks. It containsa deep network that is a deep set of networks, each of which

500 100 150 200 250 300

Accu

racy

(%)

lteration

0

10

20

30

40

50

60

70

80

90

100

(a) Training accuracy rate

500 100 150 200 250 300

Loss

(%)

lteration

0

10

20

30

40

50

60

70

80

90

100

(b) Training loss rate

Figure 18: Accuracy and loss on Food-101 for the algorithm that is proposed in this paper.


has its own hidden layer. DSN enables isolated training ofeach module for training in parallel; hence, it has high effi-ciency. Supervised training realizes back-propagation in eachmodule, instead of over the entire network. DBN is com-posed of a multilayer unsupervised restricted Boltzmannmachine (RBM) network and a one-layer supervised back-propagation (BP) network, and its training includes pretrain-ing and fine-tuning.

The proposed CNN model is based on deep featurefusion, facilitates training and optimization of the networkmodel via of dimension reduction, utilizes a local normaliza-tion operation to expedite the network training, andimproves its classification performance. Moreover, it effec-tively fuses the deep features and enables the network toextract as much useful feature information as possible. Viathis approach, the classification performance of the modelis enhanced. Figures 18 and 19 plot the training convergenceof the proposed algorithm on Food-101 and Places2,followed by the training accuracy and loss curves.

It is clearly evident from the Figure 18, in the training onFood-101 by the network model that is proposed in thispaper, when the training is iterated to the 150th generation,

the training loss stabilizes and reaches the convergence state.At this time, the accuracy is 89%. As indicated in Figure 19,when the network trains on Places2, its classification perfor-mance is still very high, even though Places2 has higher clas-sification complexity. When the training is iterated to the125th generation, the network has arrived at the convergencestate and the accuracy is 69%.

Through the above experimental results, it can beobserved that the proposed method can realize satisfactoryimage classification performance. In Food-101, the proposedmodel realizes an accuracy of 89.47%, which is 3.85%, 1.29%,and 0.94% higher than those of NIN, DSN, and DBN, respec-tively, and its classification and recognition performances arealso satisfactory. In Places2, its accuracy is 71.56%, which is4.66%, 2.03%, and 1.22% higher than those of NIN, DSN,and DBN, respectively, and it outperforms these methodson classification and recognition.

6. Conclusions

The arrival of the Internet of Things is accompanied by alarge number of multimedia data. The key problems to be

500 100 150 200 250 300

Acc

urac

y (%

)

lteration

20

30

40

50

60

70

80

(a) Training accuracy rate

500 100 150 200 250 300

Loss

(%)

lteration

0

10

20

30

40

50

60

70

80

90

100

(b) Training loss rate

Figure 19: Accuracy and loss on Places2 for the algorithm that is proposed in this paper.


handled by image classification and recognition are to iden-tify and classify the target objects that are contained in theimage regions of interest and to make judgments. With theproperties of local connection and shared weight, it hasstronger robustness in its invariance to translation, rotation,and scaling of the input image data space and realizes stron-ger image classification and recognition performances. Facil-itated by the cascade method, this paper has effectively fusedthe deep features of CNN, reduced the dimensions of the fea-tures using the PCN algorithm, and made the extracted fea-tures more typical and diverse to strengthen itsclassification performance. It has also introduced local nor-malization after every convolutional layer to accelerate theconvergence. The experimental results demonstrate that theproposed algorithm has stabilized and expedited the networktraining, thereby leading to higher classification performanceand accuracy.

Data Availability

The simulation experiment data used to support the findingsof this study are available from the corresponding authorupon request.

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Key R&DProgram of China (2018YFB1402600) and the National Sci-ence Foundation of China (Grant Nos. 61772190,61672221, and 61702173).

References

[1] J. P. CobeñaCevallos, J. M. AtienciaVillagomez, and I. S.Andryshchenko, “Convolutional neural network in the recog-nition of spatial images of sugarcane crops in the Troncalregion of the coast of Ecuador,” Procedia Computer Science,vol. 150, no. 10, pp. 757–763, 2019.

[2] Ş. Öztürk and B. Akdemir, “HIC-net: a deep convolutionalneural network model for classification of histopathologicalbreast images,” Computers & Electrical Engineering, vol. 76,no. 6, pp. 299–310, 2019.

[3] H. T. Mustafa, J. Yang, and M. Zareapoor, “Multi-scale convo-lutional neural network for multi-focus image fusion,” Imageand Vision Computing, vol. 85, no. 5, pp. 26–35, 2019.

[4] B. A. ŞabanÖztürk, “Effects of histopathological image pre-processing on convolutional neural networks,” Procedia Com-puter Science, vol. 132, no. 7, pp. 396–403, 2018.

[5] N. Sharma, V. Jain, and A. Mishra, “An analysis of convolu-tional neural networks for image classification,” ProcediaComputer Science, vol. 132, no. 5, pp. 377–384, 2018.

[6] M. E. Paoletti, J. M. Haut, J. Plaza, and A. Plaza, “A new deepconvolutional neural network for fast hyperspectral imageclassification,” ISPRS Journal of Photogrammetry and RemoteSensing, vol. 145, no. 11, pp. 120–147, 2018.

[7] C. Bai, L. Huang, X. Pan, J. Zheng, and S. Chen, “Optimizationof deep convolutional neural network for large scale imageretrieval,” Neurocomputing, vol. 303, no. 16, pp. 60–67, 2018.

[8] F. Zhang, N. Cai, G. Cen, F. Li, and X. Chen, “Image super-resolution via a novel cascaded convolutional neural networkframework,” Signal Processing: Image Communication,vol. 63, no. 4, pp. 9–18, 2018.

[9] H. Liang, J. Zou, K. Zuo, and K. Muhammad Junaid, “Animproved genetic algorithm optimization fuzzy controllerapplied to the wellhead back pressure control system,”Mechan-ical Systems and Signal Processing, vol. 142, article 106708, 2020.

[10] Y. Duan, F. Liu, L. Jiao, P. Zhao, and L. Zhang, “SAR imagesegmentation based on convolutional-wavelet neural networkand Markov random field,” Pattern Recognition, vol. 64,no. 4, pp. 255–267, 2017.

[11] S. Zagoruyko and N. Komodakis, “Deep compare: a study onusing convolutional neural networks to compare imagepatches,” Computer Vision and Image Understanding,vol. 164, no. 11, pp. 38–55, 2017.

[12] S. Yu, S. Jia, and C. Xu, “Convolutional neural networks forhyperspectral image classification,” Neurocomputing,vol. 219, no. 5, pp. 88–98, 2017.

[13] Z. Huang, X. Xu, J. Ni, H. Zhu, and C. Wang, “Multimodalrepresentation learning for recommendation in Internet ofThings,” IEEE Internet of Things Journal, vol. 6, no. 6,pp. 10675–10685, 2019.

[14] U. Raghavendra, H. Fujita, S. V. Bhandary, A. Gudigar, andU. R. Acharya, “Deep convolution neural network for accuratediagnosis of glaucoma using digital fundus images,” Informa-tion Sciences, vol. 441, no. 5, pp. 41–49, 2018.

[15] Z. Liu, B. Hu, B. Huang, L. Lang, H. Guo, and Y. Zhao, “Deci-sion optimization of low-carbon dual-channel supply chain ofauto parts based on smart city architecture,” Complexity,vol. 2020, Article ID 2145951, 14 pages, 2020.

[16] A. Sharma, X. Liu, X. Yang, and D. Shi, “A patch-based convo-lutional neural network for remote sensing image classifica-tion,” Neural Networks, vol. 95, no. 11, pp. 19–28, 2017.

[17] R. DonidaLabati, A. Genovese, E. Muñoz, and F. S. VincenzoPiuri, “A novel pore extraction method for heterogeneous fin-gerprint images using convolutional neural networks,” PatternRecognition Letters, vol. 113, no. 10, pp. 58–66, 2018.

[18] H. Kandi, D. Mishra, and S. R. K. S. Gorthi, “Exploring thelearning capabilities of convolutional neural networks forrobust image watermarking,” Computers & Security, vol. 65,no. 3, pp. 247–268, 2017.

[19] Y. Zhang, R. Zhu, Z. Chen, J. Gao, and D. Xia, “Evaluating andselecting features via information theoretic lower bounds offeature inner correlations for high-dimensional data,” Euro-pean Journal of Operational Research, 2020.

[20] E. R. S. de Rezende, G. C. S. Ruppert, A. Theóphilo, E. K.Tokuda, and T. Carvalho, “Exposing computer generatedimages by using deep convolutional neural networks,” SignalProcessing: Image Communication, vol. 66, no. 8, pp. 113–126, 2018.

[21] A. Haidar, R. J. S. Almubarak, R. Long, S. Antani, and S. R. Fra-zier, “Convolutional neural network based localized classifica-tion of uterine cervical cancer digital histology images,”Procedia Computer Science, vol. 114, no. 3, pp. 281–287, 2017.

[22] C. Barat and C. Ducottet, “String representations and distancesin deep convolutional neural networks for image classifica-tion,” Pattern Recognition, vol. 54, no. 6, pp. 104–115, 2016.


[23] N. Kumar, R. Verma, and A. Sethi, “Convolutional neural net-works for wavelet domain super resolution,” Pattern Recogni-tion Letters, vol. 90, no. 15, pp. 65–71, 2017.

[24] H. Liang, A. Xian, M. Min Mao, P. Ni, and H. Wu, “A researchon remote fracturing monitoring and decision-makingmethod supporting smart city,” Sustainable Cities and Society,vol. 62, article 102414, 2020.

[25] P. Witoonchart and P. Chongstitvatana, “Application of struc-tured support vector machine backpropagation to a convolu-tional neural network for human pose estimation,” NeuralNetworks, vol. 92, no. 8, pp. 39–46, 2017.

[26] A. Jalali, R. Mallipeddi, and M. Lee, “Sensitive deep convolu-tional neural network for face recognition at large standoffswith small dataset,” Expert Systems with Applications, vol. 87,no. 30, pp. 304–315, 2017.

[27] D. Tomè, F. Monti, L. Baroffio, L. Bondi, and S. Tubaro, “Deepconvolutional neural networks for pedestrian detection,” Sig-nal Processing: Image Communication, vol. 47, no. 9,pp. 482–489, 2016.

[28] H. Choi and K. H. Jin, “Fast and robust segmentation of thestriatum using deep convolutional neural networks,” Journalof Neuroscience Methods, vol. 274, no. 1, pp. 146–153, 2016.

[29] H. Zheng, W. Guo, and N. Xiong, “A kernel-based compres-sive sensing approach for mobile data gathering in wirelesssensor network systems,” IEEE Transactions on Systems,Man, and Cybernetics: Systems, vol. 48, no. 12, pp. 2315–2327, 2018.

[30] H. Liang, D. Zou, Z. Li, M. J. Khan, and Y. Lu, “Dynamic eval-uation of drilling leakage risk based on fuzzy theory and PSO-SVR algorithm,” Future Generation Computer Systems, vol. 95,pp. 454–466, 2019.

[31] Y. Zhou, D. Zhang, and N. Xiong, “Post-cloud computing par-adigms: a survey and comparison,” Tsinghua Science andTechnology, vol. 22, no. 6, pp. 714–732, 2017.


Image Classification Model Based on Deep Learning in ...

Documents

Transcript of Image Classification Model Based on Deep Learning in ...