Prediction of the future trend of e-commerce

17
IN DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS , STOCKHOLM SWEDEN 2021 Prediction of the future trend of e-commerce FREJA ENGSTRÖM DISA NILSSON ROJAS KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Transcript of Prediction of the future trend of e-commerce

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2021

Prediction of the future trend of e-commerce

FREJA ENGSTRÖM

DISA NILSSON ROJAS

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 1

Prediction of the future trend of e-commerce inSweden

Prognostisering av trender inom e-handel i SverigeEngstrom, Freja & Nilsson Rojas, Disa

Abstract—In recent years more companies have invested inelectronic commerce as a result of more customers using theinternet as a tool for shopping. However, the basics of marketingstill apply to online stores, and thus companies need to conductmarket analyses of customers and the online market to be able tosuccessfully target customers online. In this report, we proposethe use of machine learning, a tool that has received a lot ofattention and positive affirmation for the ability to tackle a rangeof problems, to predict future trends of electronic commerce inSweden. More precise, to predict the future share of users ofelectronic commerce in general and for certain demographics.We will build three different models, polynomial regression, SVRand ARIMA. The findings from the constructed forecasts werethat there are differences between different demographics ofcustomers and between groups within a certain demographic.Furthermore, the result showed that the forecast was moreaccurate when modelling a certain demographic than the entirepopulation. Companies can thereby possibly use the models topredict the behaviour of certain smaller segments of the marketand use that in their marketing to attract these customers.

Abstract—Pa senare ar har manga foretag investerat ielektronisk handel, aven kallat e-handel, vilket ar ett re-sultat av att individer i samhallet i storre utstrackninganvander internet som ett redskap. Grunderna for mark-nadsforing galler fortfarande for webb-baserade butiker, ochdarmed behover foretag genomfora marknadsanalyser over po-tentiella kunder och internet-marknaden for att kunna lanserastarka marknadsforingskampanjer. I denna rapport foreslar vianvandning av maskininlarning, ett verktyg som har fatt mycketuppmarksamhet pa senaste tiden for dess formaga att hanteraolika problem kring data och for att prognostisera framtidatrender for e-handel i Sverige. Mer exakt kommer andelenanvandare av e-handel i framtiden prognostiseras, bade generelltoch for enskilda demografier. Vi kommer att implementeratre olika modeller, polynomisk regression, SVR och ARIMA.Resultaten fran de konstruerade prognoserna visar att det finnstydliga skillnader mellan olika demografier av kunder och mellangrupper inom en viss demografi. Dessutom visade resultatenatt prognoserna var mer exakta vid modellering av en vissdemografi an over hela befolkningen. Foretag kan darmedmojligtvis anvanda modellerna for att forutsaga beteendet hosvissa mindre segment av marknaden.

The internet has grown increasingly popular during the lastdecades, and technology is developing to be more convenientto use. With smartphones, computers, and constant connectionto the internet we can search for information, talk to peopleacross the world and buy goods and services whenever andwherever. As the number of internet users increases, so doesthe number of online shoppers and companies interest in elec-tronic commerce. However, all people do not feel comfortableon the internet and some are more reluctant to use the services

it provides. For companies to understand the customer andtheir behaviour while being able to specialize and personalizemarketing, companies need an understanding of the futuremarket of electronic commerce.

Index Terms—ARIMA, demographics, electronic commerce, e-commerce, machine learning, market analysis, polynomial regres-sion, segmentation, support vector regression.

I. INTRODUCTION

TODAY’S society is driven towards technological solu-tions for efficiency, effectiveness, simplicity etc. and the

internet provides more and more services for everyday life.Online shopping has therefore grown steadily over the lastcouple of years as more consumers discover the benefits ofonline transactions. The consumer is no longer bound by storeopening hours, can access the store from wherever, givenmore choices and information, and the possibility of easilycomparing products to mention some advantages [1].

Companies are not far behind in discovering the benefitsof electronic commerce to reach customers, also called e-

commerce. The trend among companies within the EU is thate-commerce is representing a larger part of the annual turnover[2]. While this may be the case for many companies, there arealso companies with little to no revenue from e-commerce.However, because of the emerging technologies and changingconsumer behaviour, the market is changing. Companies needto learn how to take advantage of the opportunities the internetcreates and the new ways of interacting with consumers,especially those companies not yet familiar with e-commerce.

In general, companies need to know the consumer andwhere to find them. Different groups in society will respond toa change from physical stores to online shopping in differentways. Furthermore, marketing strategies differ between phys-ical and online store [3], and as the online market becomesincreasingly competitive when more companies enter the mar-ket, knowledge of the consumer becomes critical for success[1]. Thus, companies benefit from predicting the shoppingbehaviour of the consumer. Information about whether theconsumer is using the internet for shopping can be used as thebasis for selecting communication channels and help improvemarketing campaigns and make them successful.

Last year’s pandemic affected the world in many ways, andamong other things, one consequence was a shift in consumersattitude towards online shopping. One effect of lock-downsand restrictions was that groups in society, which previous tothe pandemic were quite unfamiliar with online shopping and

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 2

the internet in general, had to get familiar with the internet andstart to explore the possible benefits of online stores and otherinternet services. One example is the senior citizens, whereinSweden one out of ten tried online shopping for the first timeduring the year 2020 [4]. Furthermore, Sweden experiencedan increase in e-commerce, as the annual sales revenue frome-commerce grew by 40% during 2020 [5].

A. Problem

One problem closely connected to commerce and e-commerce is for companies to find suitable communicationchannels to reach the consumer. The solution for this problemis often market analysis and segmentation of the market.

When creating and designing a product, companies haveone or more specific groups of consumers in mind. Thus, itbecomes essential to find those consumers and likewise, it isessential to know how to reach them. A market analysis can beconducted in various ways, and one approach is to attempt topredict the future shopping behaviour of consumers. By doingso, companies can better understand how and where to launchnew products for them to be successful with the consumer. Inthe case of this report, the behaviour subject to predictionis online shopping. That is, companies need to be able topredict the rate at which different groups of consumers shiftto online shopping, what groups already have transitioned toe-commerce and what groups will take more time to transition.

To make this prediction of consumer behaviour, the pro-posed approach consists of machine learning algorithms andregular regression. The reason behind implementing multiplemodels is the fact that prediction models are difficult toevaluate due to the lack of future data to compare to. Byimplementing multiple models it is possible to compare thepredictions and thereby evaluate the performance of eachmodel.

One of the main challenges with building an accurate modelis the limitations of the available data. In the case of this report,the data consist of a shorter time series, which makes the datasparse.

B. Purpose

The report will investigate mainly two aspects of consumerand e-commerce in Sweden: what general trends can beidentified and what differences between different demographicgroups can be seen in their approach towards online shopping.

The result of the investigation can then be used as a basisfor market analysis. Companies involved in e-commerce orattempting to enter the online market can use the findingsto increase their knowledge about what demographic groupsare potential consumers. The results could also be used toimprove the understanding of the consumer and how to targetthem.

1) Scientific questions:

• What differences in consumer behaviour are seen whencomparing different demographic groups (geographic lo-cation, education, age and work sector)?

• What future trends in e-commerce can be identified?

• What effects on online shopping can be seen from theperspective of the 2020 pandemic?

• What differences can be identified between differentforecasting models when applied to a limited time series?

2) Hypothesis: The hypothesis is that based on the givenhistoric data regarding e-commerce it is possible to makewell-based predictions that can be used as a basis for marketanalysis.

C. Stakeholders

The expected outcome of the report is multiple forecasts ofthe future trend of online shopping in Sweden in general andover different demographic groups.

This is of interest to all companies who conduct businesswhere the consumer is an end-customer, regardless of whetherthe company is operating from physical or online storestoday. Companies need to be where the customers are, andthe general trend among companies should be to meet thecustomers on the internet. Few companies are completelyunaffected by the transition towards everyday use of internetin the daily life. Lack of execution or strategies for using theinternet to reach customers will likely result in the companylosing significant market share. Thus, companies can benefitfrom the findings of this report to increase the understandingof the consumers, as well as by using the information aboutpossible trends when targeting customers.

D. Sustainability

The report and its findings have the potential to contributeto several of the UN:s goals of sustainable development.No. 9 ”Build resilient infrastructure, promote inclusive andsustainable industrialization and foster innovation”, no. 11”Make cities and human settlements inclusive, safe, resilientand sustainable” and no. 12 ”Ensure sustainable consumptionand production patterns” to mention some. [6]

All the above-mentioned goals can be related to sustainableindustry and sustainable consumption, which can be facilitatedby e-commerce. Consumers no longer have to partake inunsustainable ways of travel to reach stores, instead, they cango online. Therefore, even if the distribution of bought goodsis done via less sustainable ways, the net effect of greenhousegas emissions is reduced. Also, previous shopping areas, malls,large parking lots and other areas built for stores can be usedfor better purposes and the need for deforestation to build newsuch areas disappears.

II. BACKGROUND

A. Market Analysis

For a company to create successful marketing strategies,understanding the market in terms of the consumer,competitors, distributors and suppliers etc. is essential. Togain this understanding, there are many tools available(for example SWOT, PESTLE, Ansoff-matrix) that helpthe company analyse the internal and external environmentas well as the consumer. When the market analysis hasbeen conducted, the company can build on the discovered

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 3

background knowledge to adjust the business idea andformulate a business strategy, where the goal is to appealto the consumers to make them paying customers. Theconsumer is often identified by consumer analysis and market

segmentation. When the market segment is known, thecompany can create and launch efficient marketing campaignsspecific to the segment. [7]

1) Segmentation: The goal for companies is to offer a prod-uct to meet customer demands. However, the problem is thatthere are too many customers for one company to cater to eachof their individual needs. The solution is therefore to clustercustomers with similar demands and process the demands ofthe cluster instead, i.e. segmentation. After a conducted marketsegmentation, the company can use strategies to analyse thesegments in terms of growth, competitors and size to determinewhat or which segments to focus on. The above-mentionedfactors all influence the profitability of the segment and howto best target consumers.

Furthermore, when the target groups of customershave been identified via segmentation, the company canconduct further consumer analysis to gain an even betterunderstanding of the customer and segment. The choice ofcommunication and platform to target the consumer are keyfactors for successful marketing. Therefore, the companyneed to explore what is the best-suited platform and how tocommunicate the message of the product to the consumers. [7]

2) Forecasts: There are also certain difficulties connectedto forecasting in general which needs to be considered whencompanies attempt to predict the market or consumer be-haviour. First, a suitable period should be chosen, the longerthe time span of the forecast the more uncertain it gets. Second,limitations of data, time or resources also add constraints tothe forecast. If any of the mentioned variables are scarce, oneshould preferably use time series analysis, which models timeas the independent variable. Thus, the forecast is limited toshow correlations and trends. However, for the purpose of thereport, correlations and trends are just what is needed. Third,complex methods are not necessarily preferred over simplerones. Simple models are less sensitive to inaccuracies in thedata or the model compare to more complex ones. However,a model can also become too simple and overlook importantfactors or miss correlations. Thus, there is a trade-off betweensimple and complex models to be considered. [8]

B. Polynomial Regression

Regression is one of the most commonly used tools forforecasts and identifying correlations between variables. Themost fundamental approach is simple linear regression, whichcan model linear predictions and identify the relation betweeninput values, x, and the target values, y, as a straight line.The more general model is nonlinear or polynomial regression

which can model more complex relations. The model allowsone to fit an equation of the k:th order to the data:

Y = b0 + b1x1 + b2x22 + ...+ bkx

kk + " (1)

With b being the coefficients and " the random error whichassumed to be independent between the input values, x, andnormal distributed according to " ⇡ N(0,�2) [9].

By increasing the order of k, the equation can fit the trainingdata better. Finding the optimal degree of the polynomial canbe done by Forward Selection, where k is increased until theoptimal value is found. The coefficients can also be optimized,which makes it possible to get the equation that fits wellto the data. The optimization of the degree of polynomialand coefficients can be done in multiple ways, but the moststraightforward approach is to minimize the error measured asMean Square Error (MSE). [10] See equation (9) in AppendixB.

C. Support Vector Regression

Support Vector Regression (SVR) uses the idea behindSupport Vector Machine (SVM) and applies it to regressionproblems (see Appendix VIII-B for SVM). The model can alsobe used for the prediction of unseen data. To solve non-linearregression problems, the technique for mapping the non-lineardata with the kernel trick is used, just as for a regular SVM.Given data points (x1, y1)...(xn, yn), the goal of the model isto find the optimal function to map the input values, x, to thecorresponding target, y.

The main difference between SVR and SVM is the lossfunction used, which has to be changed to solve regressionproblems. There are a few different loss-functions such asLaplace-, Huber’s Gaussian- and "-sensitive loss-function.Among these, the "-sensitive loss function is the most com-monly adopted. [11] See equation (10) in Appendix B.

The loss function is equal to 0 only if the data point iswithin the pre-determined radius of the predicted line. Sincethe loss function is subject to minimization, the model willseek to avoid any data points outside the radius when findinga solution. Thus, the optimal solution will have the majorityof the data points within the radius. [11] See Figure 4 inAppendix A for an illustration of a simple SVR.

The generic function of SVR with the kernel function canbe written as f(x) = (w ⇤ �(x) + b). The most commonkernels are Gaussian RBF: �(x, xi) = exp(�||x�xi||2

2�2 ) andthe polynomial kernel: �(x, xi) = (xT

i ⇤ x � c)p. With �respectively p the parameter to tune. The minimization of theregression risk can thus be expressed as:

Rreg(f) =1

2|W |2 + C

nX

i=1

(⇣i � ⇣⇤i ) (2)

Subject to:

( qi � (w ⇤ �(xi))� b "+ ⇣i

(w ⇤ �(xi)) + b� qi "+ ⇣⇤i⇣⇤i , ⇣i � 0, for i = 1, ..., n

(3)

Where Rreg is the function to minimize, with the constraintsdefined above. The variables to tune, which is done by the user,are " (the radius or allowed deviation), C (slack or trainingerror) and � (kernel function). [12], [13]

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 4

D. ARIMA

Autoregressive Integrated Moving Average (ARIMA) is astatistical analysis model which uses time-series data to predictfuture trends. An ARIMA model is, by standard, characterizedby three parameters p, d and, q.p represents the order of the Auto Regressive term or the

lag order, also known as the number of lag observations in themodel. q is the size of the Moving Average window. The lastparameter is d, which is the number of times the observationsare differenced. When d = 0, the series is already stationary.[14]

In the Auto Regressive model (AR), the function is, aspreviously mentioned, only dependent on its lags and theequation looks as follows:

Yt = ↵+ �1 ⇤ Yt�1 + �2 ⇤ Yt�2

+...+ �p ⇤ Yt�p + "1(4)

Where ↵ is the intercept term and � is the coefficient of lagthat the model estimates.

Similarly, in the Moving Average model (MA), the functionis dependent only on the lagged forecast errors.

Yt = ↵+ "t + �1 ⇤ "t�1 + �2 ⇤ "t�2

+...+ �q ⇤ "t�q(5)

The error terms are the errors of the auto-regressive models(see above) of the respective tags. For example, the error "1is the error of the following equation:

Yt = �1 ⇤ Yt�1 + �2 ⇤ Yt�2

+...+ �0 ⇤ Y0 + "t(6)

So, the ARIMA model combines these two terms and differ-entiate the series to make it stationary, i.e. the equation canbe written as:

Yt = ↵+ �1 ⇤ Yt�1 + �2 ⇤ Yt�2 + ...

+�p ⇤ Yt�p ⇤ "t + �1 ⇤ "t�1

+�2 ⇤ "t�2 + ...+ �q ⇤ "t�q

(7)

To use the model, the hyperparameters need to be determined.The purpose of d is to make the time series stationary.However, there is also a risk of over-differentiating, whichwill affect both the parameters and the outcome. To find thecorrect value of the parameter it is important to identify theminimum difference to get a relatively stationary series, wherethe autocorrelation function (ACF) plot reaches zero quicklyand the mean roams around a set value. [14]

To determine the order of the parameter p (AR term) it isimportant to identify if the model needs it, and then determineit by using the partial autocorrelation function (PACF) plot.PACF finds the correlation of the residuals with the next lag,unlike ACF which finds the correlation of the present withpast lag. The value of p is then determined to be the lag valuewhere the PACF plot crosses the upper confidence level thefirst time. [15]

The order of q (MA term) is obtained from the ACT plot.The information is found where the plot crosses the upperconfidence level and its corresponding lag value. Generally, it

is preferable to set the parameters conservatively to avoid over-differentiation. To fix potential over- or under-differentiationit is possible to add either an additional MA or AR term. [15]

When the ARIMA model is built it can be favourable tooptimize it using out-of-time cross-validation, which is doneby looking back at previous data points and using them toforecast as many steps back that were taken. These forecastedpoints are then compared to the actual ones for that time. SeeFigure 5.

As the different parameters are selected for the model theprediction will look similar to Figure 6. Additional fine-tuningcan then be done to match the observed values to predictedvalues more precisely.

E. Data set

The data consists of the annual survey done by Internetstif-telsen, Svenskarna och Internet, which is a national surveyabout the use of the internet in Sweden. The data fromthe reports are aggregated by the possible responses to eachquestion. One of the questions in the survey concerns onlineshopping behaviour. That is, the data shows the share of usersof e-commerce in Sweden in a specific year.

The survey also provides multiple demographic variables,where the ones chosen for this report are geographic location,education, age and work sector. The data spans from the year2000 to 2020 and from 2015 to 2020.

F. Prediction of very short Time series

The dilemma of insufficient data is one of the most commonproblems encountered when attempting to model a forecastor build any model based on historic data. In theory, thenumber of data points needs to be greater than the number ofparameters of the model. However, the number of data pointsneeded is often a lot more than what the theory suggests. [16]

In the best scenario, there is enough data to both trainand test the model on unseen data. This is not possible withlimited data. One way to increase the available training datais by n-fold cross validation. Because the model’s purposeis forecasting, the training data should always be before thetest data when put on a timeline, which adds a restriction.One method of cross-validation which can be used and takesthe restriction into concern is rolling-origin-recalibration

evaluation.

1) Rolling-origin-recalibration evaluation: By taking asmall subsample of the available data as training data, the nextsample or subset can be the training subset’s test data. At thenext step, the test data is added to the training subset and thefollowing data point or subset is the new test data. This wayof dividing the data into training and test data can be donethroughout the available data. The accuracy of the model willbe the average computed accuracy at each step [17].

G. Related Work

Multiple studies have been done to examine SVR andthe models’ capabilities. The majority of studies have shown

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 5

positive results regarding forecasting models based on the al-gorithm, thus the general conclusion is that the model performssatisfactorily. SVR has been used to forecast different timeseries such as the stock market [11], [18], energy consumption[19] and other time series [20], [21].

As mentioned by U. Thissen et al. SVR has multipleadvantages making the model attractive to use. Especially themodel’s generalization capabilities combined with the abilityto always find a unique and sparse solution which also is theglobal solution [20]. The model can also be used for data withmultiple features which is shown in the study regarding energyconsumption [19]. Furthermore, the same article uses n-foldvalidation when training the model, which indicates that theapproach to handle limited data and hyperparameter selectionwith n-fold validation is suitable for SVR.

The 1998 study “The use of ARIMA models for reliabilityforecasting and analysis” comes to the conclusion whereHo and Xie claim that the model is very flexible and givestatistically accurate predictions, something which has beenaccepted since the model is widely used for forecasting today[22]. This is supported in the Babai et al. 2013 article wherethey forecasted the demand in a two-stage supply chain. Thepublication reaches the same conclusion that the model is bothefficient and accurate. This study is also more relevant seenfrom the perspective of this report, as it also investigates futuretrends [23].

In the 2016 article “Gold Price Forecasting Using ARIMAModel” Guha and Bandyopadhyay use ARIMA for predictingthe price of gold. This article is, as the one previouslymentioned, more in line with what this report will investigatewhich gives insight into both opportunities and limitations ofthe model. The report explains how the model is applied andthe limitations, one of which is that it might be more suitablefor predicting values in the near future, something that wasnot mentioned in earlier reports but is important to take intoconsideration. [24]

III. METHOD

The study consisted of data pre-processing, implementationand parameter optimization for each model, and lastly predic-tion and evaluation of the models. The models implementedwere polynomial regression, SVR and ARIMA.

The findings, that is the forecasts, of the different models,were evaluated in terms of being suitable as a part of a marketor consumer analysis.

A. Data Pre-processing

The data from Internetstiftelsen was transformed into aformat compatible with the models and modelled as a timeseries. x(t), t = 1, 2, ..., n with each value of x, year, havinga corresponding target value, y, percentage of users.

Further processing of the data was made for the differentrepresentations of the data, where each representation repre-sents a certain demographic.

B. Implementation

The models are built in Python using standard libraries suchas NumPy, Scikit-learn, and Matplotlib.

Each model was optimized by iteratively changing thevalues of the hyperparameters. For each model, MSE wascomputed and the optimal model was defined as one withthe lowest average MSE score. When the optimal model wasfound, the values of the hyperparameters were noted and themodel predicted the future e-commerce. This was repeated forthe different data sets.

The forecasts and computed MSE was compared to eachother as a method of evaluation since there is no way to eval-uate a forecast. To identify differences within demographics,multiple predictions were made. One for each demographicgroup.

Additionally, the accuracy of the models was computedto further evaluate how well the different approaches to theproblem fit the historic values and forecasted the future. Theaccuracy was calculated as:

PercentageError =Actual � Forecast

Actual⇥ 100 (8)

Some of the demographic groups had little data or missingdata points. This was dealt with by excluding the groups themodel did not have sufficient data for to make a forecast.

1) Polynomial Regression: By implementing forwardselection with the start at an equation of order p = 1(assuming the data to be non-linear), the degree of thepolynomial was increased iteratively. The maximal possibleorder of the polynomial is n � 1 where n is the number ofdata points. At each step, MSE was computed.

2) Support Vector Regression: The model used the "-sensitive loss function to compute the cost function and theGaussian RBF kernel function for modelling data to a higherdimension. The values of hyperparameters C, respectively theallowed deviation ", was optimized by systematically testingdifferent values and computing MSE.

3) ARIMA: Initially, the PACF and ACF plots were usedto delimit the possible parameters. This guideline was thenused to optimize the implementation of the model by testingdifferent values of the three hyperparameters. The model wasthen optimized by using out-of-time cross-validation.

The accuracy was computed as MSE over the last step.

C. Limitations

The data set could be considered small since it consistedof 20 data points at the most. To handle the problem withlimited data, rolling-origin-recalibration evaluation was used.The data was divided into smaller subsets with only a few datapoints in each subset. The general forecast had 4 subsets of5 points, while the different demographics had a test set thatconsisted of the 2 last data points and the prior data pointsmade up the training set. MSE was computed as the averageMSE.

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 6

The forecast of the future was set to the time horizonof two years ahead, 2021 and 2022. The choice of such ashort forecast was based on the high degree of uncertaintyconcerning the future. For a field like the internet, which ischaracterized by rapid change in technology, one can assumethe future of e-commerce to be more uncertain than other,historically more stable fields. Thereby a short time horizonwas chosen, to limit the amount of uncertainty.

Furthermore, the models did not consider the fact thatthe population percentage cannot exceed or equal 100%, andthereby gave predictions of values over 100%. These valuescould be achieved because of the simplicity of the models, butcannot occur in reality. For cases where the predicted valuewas over 100, we disregarded the actual value and insteadacknowledge the trend as strongly increasing and being closeto 100%. All values were however included in the graphicrepresentations.

IV. RESULT

Below we will highlight the most interesting findings. As thedata concerns the share of users of e-commerce, all predictedvalues will thus represent the share of users for a specificyear. For values in tables and graphs over the forecasts, seeAppendix IX.

General findings were that the general forecast for all threemodels showed a declining trend, see Table I below. Themodels also forecast the elderly and people living in thecountryside to increase their use of e-commerce over the nextfew years.

General Prediction 2021-2022

Year PolynomialRegression

SVR ARIMA

2021 81 83 842022 77 77 79

TABLE IOVERVIEW OF THE GENERAL PREDICTION DONE BY THE DIFFERENT

MODELS FOR YEAR 2021-2022

The removal of the data point of 2020 showed little impacton the forecasts. For the years 2021 and 2022, the change intraining data mostly affected the slope of the forecast. Thepredicted values for the coming years are quite similar withor without the last data point for all forecasts.

A. Polynomial Regression

The general forecast showed a declining trend over the nextyears, where fewer consumers will use online shopping. Theforecast decreases by 10 percentage points over the years2020-2022, measured as the difference between the actualvalue of 2020 and the forecast of 2022.

Furthermore, it is clear how the model is influenced byhistorical data and the degree of the polynomial. The generalmodel is of the second degree, and when visualizing theforecast the curves of a second-degree polynomial are visible(see Figure 1). Likewise, the majority of the demographicforecasts are of the first degree, something that also canbe seen by the straight lines which are the forecast. Some

forecasts also gave unrealistic results with values over 100%during the training phase, which can be seen as a limitationthat comes with the simplicity of the model.

Fig. 1. Forecast over all data points for the regression model. The 2:nd degreecurves are clearly visible.

Removing the data point of 2020 from the training datamade the slope of the prediction steeper, that is the trenddeclined faster. See Figure 7 and 8. However, the value for2022 only differed by 1% between the two forecasts (see TableIII in IX-A).

The majority of the forecasts over the demographics showeda stable or slightly increasing trend. Only one group, Finans

from the work demographic showed a strong declining trendwith a drop of about 10 percentage points.

MSE for the model differed highly between the data sets.The predictions over demographic groups had a lower MSEcompared to the general model. This is probably since the gen-eral model had more data points to measure. The percentageerror was also quite varied, see Table II in IX-A.

B. Support Vector Regression

The general forecast showed a downward trend which isquite steep. Over the years 2020-2022 the number of usersdecrease by 10 percentage points, measured as the differencebetween the actual value of 2020 and the forecast of 2022.

The result of the forecasts over the different demographicsis quite different from the general trend since they all havea more or less straight trend for the years 2021-2022. Someof the forecasts showed a slight increase or decrease over theperiod 2020-2021 but then a stable trend (see Figure 2).

Fig. 2. Forecast over the segment based on age. The future trends for allsubgroups are straight or slightly increasing.

In general, the model does not fit very well to training data,see Figure 13 and 14. The forecasts for the general modeldone during the training phase are not close to the actual

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 7

values, which also can be seen by the high MSE of the modelor by the large difference between predicted and true values(see Table V IX-B). However, MSE-score for the demographicgroups are much lower and some values can be consideredsatisfactory. Similarly, the percentage error of the model differsquite a lot when comparing the general forecast to the averagedemographic forecast.

Moreover, the removal of the data point of 2020 hadlittle influence on the forecast and gave results in line withthe results from the regression model. The forecasted valuesbefore 2020 were the same even after removing the data point.However, the predicted values after 2020 showed almost nodifference at all compared to the forecast with all data points.The value predicted for 2022 was the same and for 2021 therewas only a 1% difference between the predictions.

C. ARIMA

In general, the ARIMA model predicted a negative trendas seen in Figure 18. With the removal of the data point of2020, the forecast behaved similarly to the forecast with thedata point. However, the declining trend was not as steep, seeFigure 20.

In contrast, the results from the different demographicgroups differed, as some predictions increased while othersdecreased. As seen in Figure 3, the predicted values for peopleliving in the countryside were lower in the year 2021 comparedto 2020. This value is expected to rise the following year. Byobserving the same graph, the prediction for those located in acity shows the opposite, as the percentage is expected to firstrise in 2021 and then fall the following year.

Fig. 3. Forecast over the segment based on location of living. The forecastof those living on the countryside is seen increasing in 2022.

Another demographic group where the prediction is a per-centage increase is the group under education which have onlyfinished Grundskola. The model predicted higher percentagesfor both 2021 and 2022, which can be seen in Figure 23.

Most age groups showed a relatively stable forecast withvalues close to the ones from previous years. However, 76+showed a decrease in percentage for the year 2021, which thenrose again in 2022, similarly to the forecast for those livingin the countryside. See Figure 21.

In general, the model produced forecasts which fit wellfor the existing data points. Although, as seen in Table Vin IX-C, some demographics were less accurate as the MSE

was quite high relative to other values. For example, somesubdivisions in the sector demographic had a much higherMSE than others in the same group (see the MSE for Annan

vs Skola/Utbildning). Additionally, the percentage error of thegeneral forecast, as well as the different demographics, showedsmall errors from the historic data points. See table IX-C

V. DISCUSSION

The different models resulted in quite different forecasts,in some cases even contradictory of each other. The variedresults illustrate the uncertainty of prediction models and howthe forecast is dependent on the chosen model.

The investigation of the report and its results did offerpredictions for the future of e-commerce which can be usedas a basis for market analysis. However, as the three differentmodels also produced deviating results, it is quite difficultto do a market analysis based only on the produced results.Instead, it might be possible for companies to draw someconclusions regarding a possible behaviour but the resultsare not strong enough to base the market analysis on alone.Furthermore, the models all show a higher accuracy wheninvestigating a smaller group, which indicates that the modelsare better suited for forecasting specific groups’ behaviour.

A. Effects from the pandemic

Looking at the data set on which the models are built, itis clear that during the height of the pandemic in 2020 e-commerce dropped by several percentage points, somethingthat was quite unexpected given the circumstances. As localrestrictions meant that many people refrained from shoppingin physical stores and many malls and stores decreased theiropening hours, it was reasonable to think that consumers mighthave turned towards e-commerce. However, as previouslymentioned, this was not the case, which leads us to otherpossible explanations as to why e-commerce decreased in2020.

One natural explanation is the fact that many people losttheir jobs or were laid off short-term, which meant less incometo dispose of. Another reason is that the purchases of services(tickets, travels etc.) online has drastically decreased as eventshave been cancelled and travels were largely restricted bothnationally and globally, a product group that previously madeup a large portion of the total e-commerce. Furthermore,the data do not include the occurrence of online shoppingor monetary aspects which can result in the forecasts beingmisleading.

Thereby, the general prediction, which showed a decliningtrend, can be deceiving in the sense that e-commerce forcertain customer groups and certain product groups increased.Individuals already familiar with online shopping increasedthe number of purchases done via the internet, while othergroups of individuals decreased their overall purchases andaccordingly also decreased their online shopping.

B. Demographic groups

The different models sometimes resulted in contradictingforecasts for a certain demographic, which makes it hard

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 8

to draw solid conclusions about the future trend for thatdemographic. However, the forecasts also seemed to agree onthe trend of other demographics, for example, age.

We can see that the general finding for the elderly in Swedenis an increase in e-commerce since the three groups 56-65, 66-

75 and 76+ all showed increasing trends. Thus, these groupswill likely be more present on the internet in the future. Thisis probably the effect of the pandemic, where individualshave to get more comfortable with the different tools offeredby the internet due to offices encouraging working fromhome and restrictions for senior citizens regarding socializingwith others. Another explanation might be that the youngergeneration, consisting of people who already have embracedtechnology and internet solutions, in the near future will enterthe group of elderly in Sweden. Thereby, the groups’ generalhabits concerning the internet will change accordingly.

When dividing data after the demographic of the degreeof education, the majority of the models show an upwards-sloping trend for the group of Grundskola. Assuming themajority of people with a low level of education belong tothe part of the workforce with lower income, the jobs theyoccupy are the ones employers cut first when money becomestight for the company. Due to the previous slow growth inSweden of 2019 and the pandemic in 2020 many low-incomejobs disappeared resulting in the declining trend.

However, the predictions show this trend being turnedaround. Based on the estimates of these trends and history,when the economy recovers the jobs previously dismissed willbe reinstated. As a result of the decrease in unemployment,people have more income to dispose of and thereby money tospend on e-commerce.

Additionally, when investigating the groups of individualsliving in the countryside the models agree on an upwards trendfor the coming years, something that can be interesting toinvestigate further. The different models based on the worksector contradicted each other and thereby gave inconclusiveforecasts.

One general conclusion which can be drawn from the gath-ered demographic forecasts is that the MSE and percentageerror is lower compared to the general forecasts. This indicatesthat the models perform better and thus are more accuratewhen forecasting the behaviour of a smaller group.

C. The Models

The majority of forecasts modelled by ARIMA resulted inoscillating predictions and a low MSE. This can be seen asa result of using a complex model to predict a future witha lot of uncertainty in the form of unpredictable variables.On the other hand, the regression model was often built on alow degree of the polynomial and therefore resulted in modelswith good generalization capabilities. However, the nature ofthe low polynomial also made it possible for the forecasts topredict unrealistic values.

The optimal model of the three implemented to forecastconsumer behaviour in terms of marketing purposes is hardto determine. However, the findings from the report can offersome guidelines and features of the models to be consideredwhen choosing a model.

First, the complexity of the behaviour subject for predictiondecides what model is suitable. A more complex model willoptimize itself to historical data and past trends have a largeinfluence on the shape of predicted trends. While a simplemodel is more general, but faces the risk of being too generaland thus miss to model important relations.

Second, the data available has to be considered. The re-gression model is bound by the constraint of the degree ofthe polynomial, that is, there is a limit of the degree whichlimits the possible relations the model can portrait. ARIMAhas a minimum number of data points needed for trainingdata which adds a constraint to the data set being used astraining data. SVR, on the other hand, has not the sameconstraints concerning training data, but many possibilities ofhyperparameters to fine-tune which can be time-consuming.

Despite the model chosen as a tool for creating marketingstrategies, the results have to be used with caution andrationality as all models come with risks of not being trulyrepresentative and are flawed in some way.

D. Further Improvements and Research

The report investigated the share of users of e-commerceand one interesting aspect to further research would be theannual turnover of e-commerce. The turnover can be a goodcomplement to this report as the number of users alone cannotshow the whole picture of a market or segment. By having in-formation about both the share of potential consumers and theexpected turnover, companies can form a well-based marketanalysis.

For this report, the data used to build the models wasaggregated. One interesting aspect for further investigationwould be to have non-aggregated data, to create models formore specific groups instead of the more general ones usedin this report. As the study was performed on the market asa whole, with some larger demographic groups, the result canmostly be used to get a general picture of the market, andthe under-laying trends, for the next few years. For a specificcompany, it could be more rewarding to look into their specifictarget groups, if they are more delimited.

Generally, the lack of data is often a problem in thesereports, as the results might be lacking when there are limitedobservations to take into consideration. As this report didnot handle a large data set, it could be interesting to furtherresearch the future trend of e-commerce, using more datapoints and thus being able to predict a, possibly, more accurateforecast.

Moreover, while it is possible to conclude that the models,in general, perform better on a small group of individualsthe measurements are not weighted after the number of datapoints. Since the general models are based on more data,the lower accuracy can just be the result of having moredata points to measure. Or that the fewer data points ofthe demographics make it easier for the models to learn thepattern of the data and not get truly generalized. Thus, oneimprovement to be made and a research subject would beto gather more data from the demographics and measure thedifferences in accuracy compared to the general models.

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 9

Concerning the models, specifically the regression model,the forecasts could be unrealistic (predicting a value over100%). One improvement to be made to get more realisticvalues would be to put a constraint on the model, to approachthe value of 100 asymptotically.

VI. CONCLUSION

Based on the results of the models, the general predictionsshow a declining future trend of e-commerce with fewerindividuals turning to online stores. While the results showone thing, it is also important to take into account the increasein revenue e-commerce experienced during 2020. It wouldthereby not be wise to only look at the results of this study,since the interpretation of declining e-commerce might not bethe reality. Instead, other aspects should be incorporated intomarket analysis, such as the total sales revenue.

The report also shows some evidence of how the results canbe misleading and give a wrongful picture. When investigatingdifferent demographics the majority of the forecasts show theopposite trend from the general forecast, and therefore thegeneral model can be seen as too generalized and poorlyrepresentative. Furthermore, the results from the models showthat there are differences within a certain demographic groupin their approach to e-commerce, which has to be consideredby a firm but are missed in the general forecasts. While somegroups in society have reduced their online shopping, othersincreasingly used the internet for their purchases. For example,many elders made their first purchase online during 2020.During the same period e-commerce for people with Grund-

skola level of education dropped. This shows that differentgroups will react differently to the same events and thereforeit is important to investigate the target audience of a companyinstead of the general picture.

Thus, the conclusion is that while the models can show apossible forecast, more information is needed to conduct amarket analysis.

As for the use of these models to gain knowledge of themarket and use the information for marketing purposes, themodels come with different pros and cons. The decision ofthe model should be based on the available data and thetrade-off between simplicity and complexity. Furthermore,the models seem to be more accurate in their forecasts ofcertain demographics compared to the forecast including theentire population. The conclusion is thereby that the modelsimplemented in this report give more accurate forecasts whenmodelling a certain, smaller target group. Companies canthereby use the models to predict the behaviour of certainsegments of the market.

ACKNOWLEDGMENT

We are grateful for the opportunity to research and inves-tigate the area of e-commerce, which we find interesting andof relevance.

We thank Internetstiftelsen for providing the data from theirannual survey, Svenskarna och internet, and especially oursupervisor Cia Bohlin.

We would also like to thank our supervisors from KTH forthe support they have given us. As well as our peers, thankyou for the reviews.

AUTHOR CONTRIBUTIONS

Disa Nilsson Rojas currently a BSc student of Industrial

Engineering and Management with specialization in computerengineering at KTH. The author contributed mainly to thepolynomial regression and SVR models, as well as part I,IV-VI.

Freja Engstrom currently a BSc student in Industrial En-

gineering and Management with a specialization in computerengineering at KTH. The author contributed mainly to theARIMA model, as well as part I, IV-VI.

REFERENCES

[1] L. Zhou, L. Dai, and D. Zhang, “Online shopping acceptance model-a critical survey of consumer factors in online shopping,” Journal of

Electronic commerce research, vol. 8, no. 1, p. 41, 2007.[2] EuroStat. Share of enterprises’ turnover on e-commerce.

[Online]. Available: https://ec.europa.eu/eurostat/databrowser/view/tin00110/default/line?lang=en

[3] C. Katawetawaraks and C. Wang, “Online shopper behavior: Influencesof online shopping decision,” Asian journal of business research, vol. 1,no. 2, 2011.

[4] InternetStiftelsen, “Svenskarna och internet 2020,” Svenskarna och in-

ternet, pp. 44–55, 2020.[5] H. R. Postnord, Svensk digitalhandel, “E-barometern helarsrapport

2020,” E-barometern, p. 5, 2020. [Online]. Available: https://media.dhandel.se/wl/?id=x8VMpPpkiZRvhD0a75bKwplEPMsAl3gp

[6] U. D. of Economic and S. Affairs. The 17 goals. [Online]. Available:https://sdgs.un.org/goals

[7] L. Mossberg and M. Sundstrom, Marknadsforingsboken. Studentlitter-atur, 2011.

[8] A. Feldmann, “Forelasning 4 - prognoser in me1316,” January 2020.[9] A. Agarwal. Polynomial regression. [Online]. Available: https:

//towardsdatascience.com/polynomial-regression-bbe8b9d97491[10] Abhigyan. An introduction to support vector regression

(svr). [Online]. Available: https://medium.com/analytics-vidhya/understanding-polynomial-regression-5ac25b970e18

[11] C.-J. Lu, T.-S. Lee, and C.-C. Chiu, “Financial time series forecastingusing independent component analysis and support vector regression,”Decision Support Systems, vol. 47, no. 2, pp. 115–125, 2009.[Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167923609000323

[12] B.-J. Chen, M.-W. Chang et al., “Load forecasting using support vectormachines: A study on eunite competition 2001,” IEEE transactions on

power systems, vol. 19, no. 4, pp. 1821–1830, 2004.[13] M. Awad and R. Khanna, Support Vector Regression. Berkeley, CA:

Apress, 2015, pp. 67–80. [Online]. Available: https://doi.org/10.1007/978-1-4302-5990-9 4

[14] S. Prabhakaran. Arima model – complete guide to time series forecastingin python. [Online]. Available: https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

[15] J. Salvi. Significance of acf and pacf plots in time seriesanalysis. [Online]. Available: https://towardsdatascience.com/significance-of-acf-and-pacf-plots-in-time-series-analysis-2fa11a5d10a8

[16] R. J. Hyndman and G. Athanasopoulos, Forecasting: Principles and

Practice. OTexts: Melbourne, Australia, 2018, ch. 12.7.[17] ——, Forecasting: Principles and Practice. OTexts: Melbourne,

Australia, 2018, ch. 3.4.[18] P. Meesad and R. I. Rasel, “Predicting stock market price using support

vector regression,” in 2013 International Conference on Informatics,

Electronics and Vision (ICIEV), 2013, pp. 1–6.[19] Z. Ma, C. Ye, and W. Ma, “Support vector regression for predicting

building energy consumption in southern china,” Energy Procedia,vol. 158, pp. 3433–3438, 2019, innovative Solutions for EnergyTransitions. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1876610219309762

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 10

[20] U. Thissen, R. van Brakel, A. de Weijer, W. Melssen, andL. Buydens, “Using support vector machines for time seriesprediction,” Chemometrics and Intelligent Laboratory Systems, vol. 69,no. 1, pp. 35–49, 2003. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169743903001114

[21] Chun-Hsin Wu, Jan-Ming Ho, and D. T. Lee, “Travel-time predictionwith support vector regression,” IEEE Transactions on Intelligent Trans-

portation Systems, vol. 5, no. 4, pp. 276–281, 2004.[22] S. Ho and M. Xie, “The use of arima models for reliability forecasting

and analysis,” Computers Industrial Engineering, vol. 35, no. 1,pp. 213–216, 1998. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0360835298000667

[23] M. Babai, M. Ali, J. Boylan, and A. Syntetos, “Forecasting andinventory performance in a two-stage supply chain with arima(0,1,1)demand: Theory and empirical analysis,” International Journal of

Production Economics, vol. 143, no. 2, pp. 463–471, 2013, focusingon Inventories: Research and Applications. [Online]. Available:https://www.sciencedirect.com/science/article/pii/S0925527311003902

[24] B. Guha and G. Bandyopadhyay, “Gold price forecasting using arimamodel,” Journal of advance Management Journal, 03 2016.

[25] T. Sharp. An introduction to support vector regres-sion (svr). [Online]. Available: https://towardsdatascience.com/an-introduction-to-support-vector-regression-svr-a3ebc1672c2

[26] M. Awad and R. Khanna, Support Vector Machines for Classification.Berkeley, CA: Apress, 2015, pp. 39–66. [Online]. Available: https://doi.org/10.1007/978-1-4302-5990-9 3

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 11

VII. APPENDIX A

Fig. 4. Example of a simple SVR [25]

Fig. 5. ARIMA model with both historic data and forecast plotted [14]

Fig. 6. Example of forecast vs actual values [14]

VIII. APPENDIX B

A. Mean Square Error

E =1

n

nX

i=1

(predii � yi)2 (9)

B. Support Vector Machine

Support Vector Machine (SVM) is a classification modeland performs the task of classifying data points by findingthe optimal hyperplane, i.e decision boundary, in N-dimensionfor separating the data points. The goal is to find a decisionboundary with respect to two criteria: the model shouldgeneralize well on unseen data and thus classify all new datapoints correctly. As well as being the decision boundary thatmaximizes the margin (maximizes the distance between thedata points and the boundary). And while there might bemultiple solutions to a classification problem, there is onlyone solution that fulfils both criteria.

To find this optimal decision boundary, and to be able tosolve problems that are seemingly non-linearly separable whenvisualized, the SVM uses the kernel trick. That is, to map thedata onto a higher dimension. SVM uses the kernel trick be-cause data points that are non-separable in a lower dimensionwill become separable in a higher dimension. Thereby, the

model can always find an optimal decision boundary in somedimension [26].

To increase the generalization capabilities the SVM alsohas an additional parameter C or slack. The slack variableallows for some misclassification which can be good becausethe training data doesn’t always entirely represent theactual distribution of data. This introduces what is calleda soft margin, which is used in cases where some pointsneed to be misclassified to find an optimal solution, i.e.decision boundary. As C increases, the less tolerance formisclassification the model will have. If C instead is small,the more tolerant the model will be.

1) Loss function: L"(f(x), q) =(

|f(x)� q|� ", if |f(x)� q| � "

0, otherwise(10)

Where " represents the the radius around the optimal hyper-plane.

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 12

IX. APPENDIX CA. Polynomial Regression

Group MSE Error %General 173.031 80.34

Age 16-25 7.852 7.7826-35 10.865 9.0036-45 17.125 11.5046-55 41.202 18.8056-65 77.802 27.1566-75 105.813 36.5475+ 3.667 9.06

City Stad 24.222 14.46Landsbyggd 45.625 21.82

Education Grundskola 32.951 20.88Gymnasie 3.086 5.81Hogskola 1.063 2.70

Sector Industri 1.250 3.40Handel 204.5 46.74

Transport 10.25 10.13IT och Kommuikation 14.5 10.64

Finans 6.25 8.18Offentlig forvaltning 2.5 4.38

Skola/Utbildning 65 24.54Vard 0.25 1.14

TABLE IIMSE AND PERCENTAGE ERROR FOR THE REGRESSION MODEL’S

DIFFERENT FORECASTS

Year Prediction W/O 2020 Ground Truth2005 54 54 542006 64 64 642007 75 75 752008 99 99 772009 89 89 792010 94 94 812011 99 99 812012 104 104 842013 108 108 852014 83 83 852015 80 80 792016 77 77 902017 73 73 922018 67 67 922019 89 91 902020 60 83 872021 81 80 -2022 77 76 -

TABLE IIIGENERAL PREDICTION BY THE REGRESSION MODEL

Fig. 7. Forecast with all data points

Below are the different demographics modeled.

Fig. 8. Forecast without 2020

Fig. 9. Forecast over segments based on age

Fig. 10. Forecast over segments based on location of living

Fig. 11. Forecast over segments based on education

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 13

Fig. 12. Forecast over segments based on work sector

B. Support Vector Regression

Fig. 13. Forecast with all data points

Fig. 14. Forecast without 2020

Below are the different demographics modeled.

Fig. 15. Forecast over segments based on age

Fig. 16. Forecast over segments based on location of living

Fig. 17. Forecast over segments based on education

Fig. 18. Forecast over segments based on work sector

Group MSE Error %General 721.046 177.26

Age 16-25 1.124 0.0926-35 0.131 0.5536-45 0.212 0.1146-55 0.393 0.5956-65 2.042 1.5166-75 1.219 2.4875+ 15.708 18.21

City Stad 1.128 0.34Landsbyggd 0.538 0.66

Education Grundskola 9.760 8.80Gymnasie 12.346 8.41Hogskola 0.491 0.17

Sector Industri 1.616 3.43Handel 18.184 14.06

Transport 3.625 5.64IT och Kommuikation 4.447 6.35

Finans 8.191 17.53Offentlig forvaltning 12.553 10.89

Skola/Utbildning 0.630 2.25Vard 1.119 3.40

TABLE IVMSE AND PERCENTAGE ERROR FOR SVRS FORECASTS

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 14

Year Prediction W/O 2020 Ground Truth2005 38 38 542006 29 29 642007 27 27 752008 27 27 772009 27 27 792010 75 75 812011 67 67 812012 59 59 842013 52 52 852014 49 49 852015 83 83 792016 78 78 902017 73 73 922018 67 67 922019 63 63 902020 60 87 872021 83 82 -2022 77 77 -

TABLE VGENERAL PREDICTION BY SVR

C. ARIMA

Fig. 19. Forecast with all data points

Fig. 20. Forecast without 2020

Below are the different demographics modeled.

Fig. 21. Forecast over segments based on age

Fig. 22. Forecast over segments based on location of living

Fig. 23. Forecast over segments based on education

Fig. 24. Forecast over segments based on sector of work

Group MSE Error %General 4.4926 -1.6204

Age 16-25 0.1222 -0.806426-35 0.0995 -0.092636-45 0.3171 -0.039146-55 0.4962 -0.487356-65 0.6411 -0.827766-75 0.6167 -0.320475+ 0.4331 -1.2751

City City 0.2382 0.0173Countryside 0.1542 -0.0037

Education Grundskola 0.1293 -0.3130Gymnasie 0.1036 -0.1430Hogskola 0.0480 -0.0705

Sector Industri 0.0091 -0.0135Handel 0.0965 0.1527

Transport 0.0101 -0.0041Hotell/Restaurang 63.441

IT och Kommuikation 0.0 -2.6471e-11Finans 0.0264 0.0126

Offentlig forvaltning 0.0027 -0.0009Skola/Utbildning 0.0011 -0.0016

Vard 0.0 3.0683e-13Kultur 61.5211Annan 86.9955

TABLE VIMSE AND PERCENTAGE ERROR FOR ARIMA

KTH BACHELOR THESIS REPORT. INDUSTRIAL MANAGEMENT AND ENGINEERING AND COMPUTER SCIENCE. JUNE 2021. 15

Year Prediction W/O 2020 Ground Truth2000 - - 102001 - - 182002 - - 262003 - - 342004 - - 442005 - - 542006 76 75 642007 73 71 752008 77 77 772009 80 81 792010 85 85 812011 87 86 812012 87 83 842013 86 81 852014 84 80 852015 84 83 792016 84 83 902017 88 88 922018 89 89 922019 90 89 902020 88 87 872021 84 84 -2022 79 82 -

TABLE VIIGENERAL PREDICTION BY ARIMA

www.kth.se

TRITA-EECS-EX-2021:370