Price Probe - Price Forecasting using ARIMA on Amazon’s Items · Price Probe - Price Forecasting...
Transcript of Price Probe - Price Forecasting using ARIMA on Amazon’s Items · Price Probe - Price Forecasting...
UNIVERSITA DEGLI STUDI DI CAGLIARIFACOLTA DI SCIENZE
Corso di Laurea Magistrale in Informatica
Price Probe - Price Forecasting usingARIMA on Amazon’s Items
Relatore CandidatiProf. Reforgiato Recupero Diego Andrea Medda Matricola: 65034
Alessio Pili Matricola: 65040
Anno accademico 2016-2017
i
Contents
1 Introduction 1
2 Related Work 3
3 Data Description 7
3.1 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Manufacturers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Categories Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.8 Currencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.9 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Data Gathering 15
4.1 Amazon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Google Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Currencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Method 19
5.1 ARIMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.2.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
ii Contents
6 Evaluation 29
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7 Conclusion 41
1
Chapter 1
Introduction
E-Shops are recently gaining more and more popularity among users. E-Shops like Amazon,
Ebay, Tesco and others, are huge companies which sell any kind of products owing therefore their
success on online purchases, home delivery and low prices. In the past ten years, one of them, namely
Amazon, had an impressive growth so that it built its own delivery and data centers across the world
in order to serve its users in the best way possible. The main reasons for Amazon’s success is due
to the amount of different products sold along with its customer care, its advertisings, the respect
and trust gained among users during past years and its prices. The letter prices are influenced by
several factors which are known and unknown. An item’s price can be influenced by inflation, by the
amount of sales, by its popularity, by the popularity of its manufacturer (who produced it), by the
popularity of its category, by the users reviews, by holiday periods and so on. Some of these factors
are unknown, such as the amount of items sold, since Amazon do not provide those data in their
APIs (Application Programming Interfaces). On the other hand, some of the latter are provided by
its APIs or can be extracted by its pages using crawlers, scrapers and other techniques. Hence, the
data used in this study are limited to the one accessible via Amazon’s Affiliation APIs and other
external data gathered using the latter as a basis. Given that, it is interesting studying how this
factors may influence the future price trend of an item and how much each of them impact on it.
Since websites such as CamelCamel earn from just showing a product’s trend over time, a (price,
date) tuple history, it might be even more interesting showing a prediction of that item’s price in the
following one or two weeks. Our idea was born in this way. We wondered if it’s possible to perform
a forecast on such non-open and heterogeneous data. Since our basic features (date, price) form a
time series, we used ARIMA (Autoregressive Integrated Moving Average), that, by fine-tuning its
parameters (p, q, d) and taking a training set (for instance 90% of the dataset as training set and
2 Chapter 1. Introduction
the remaining 10% as test set), makes a forecast with a low Mean Absolute Percentage Error Score
and follows the real price trend. ARIMA depends on different inputs such as ACF (AutoCorrelation
Function) and PACF (Partial AutoCorrelation Function) and works well if the analyzed time series
is stationary. If the latter is not stationary, a preprocessing is necessary in order to calculate the
proper algorithm’s inputs. Data types and formats (date formats especially) are very important.
Our algorithm does not require daily time series entries to work properly. It also works well with
several missing daily (price, date) tuples between entries. Of course, as the number of entries
per each item grows, also the forecast will be more precise. Similar studies on ARIMA and price
forecasting have been done in [1, 2, 3, 4, 5, 6, 7] but we have not found any survey about price
forecasting on e-shops, especially on Amazon and using exogenous features like authors of [8] did in
their work. Out of all machine learning and statistics methods that can be used for financial time
series forecasting highlighted by the authors in [9, 10], we have chosen to use ARIMA because it is
well suited for our data types and allows us to work wit exogenous variables. Our method can be
applied to any kind of item, we just need (item’s id, price, date) tuples as input. The rest of the
required data is gathered using our suite of crawlers and tools. We are able to collect any sort of
new external features because it’s very easy to extend our back-end functionalities by adding new
models and REST (REpresentational State Transfer) handlers to it. Our RESTful back-end collects
and stores all the data in a relational database so that is easy to build queries to do a forecast on
an item. Moreover, our crawlers and tools are all small and independent microservices which share
a common configuration basis so that writing new ones to crawl a new external feature is pretty
straightforward. All of the software is easily deployable using Docker Compose. More about Docker
can be read in [11]. The rest of the paper is organized as follows. Chapter 2 presents some related
works. Chapter 3 presents details of data used in the experiment. Chapter 4 describes how data
have been gathered. Chapter 5 describes the methodology used. Chapter 6 describes how results
have been obtained and chapter 7 discusses the experimental results obtained.
3
Chapter 2
Related Work
”A hybrid ARIMA and support vector machines model in stock price forecasting” [1], ”Day-
ahead electricity price forecasting using the wavelet transform and ARIMA models” [2], ”Appli-
cation of ARIMA model for forecasting agricultural prices” [3], ”Comparative Study of ARIMA
Methods for Forecasting Time Series of the Mexican Stock Exchange” [5], ”ARIMA forecasting of
Chinas coal consumption, price and investment by 2030” [6] and ”Forecasting Energy Consumption
of Turkey by Arima Model” [7] contain considerations on the utilization of ARIMA in a context that
can be considered similar to our, price stock forecasting, electricity price forecasting, agricultural
price forecasting and other related economics problems. ”Short-term cloud coverage prediction us-
ing the ARIMA time series model” [4] contains consideration on using ARIMA for forecast cloud
coverage based on ground-based cloud images, so taking account the correlation about cloud cov-
erage over continue time. ”Online big data-driven oil consumption forecasting with Google trends”
[12] contains consideration on the utilization of ARIMA using also Google Trends, trying to im-
prove forecasting accuracy using it. ”Predictive Analysis of E-Commerce Products” [13] contains
consideration on utilization of ARIMA model using sentiment analysis to predict product’s success
and failure ratio. ”Introduction to Time Series and Forecasting” [14] contains useful consideration
about how someone should approach to time series forecasting, e.g. the importance, for forecasting,
to remove trends or seasonality of the series considered. ”Comparison of the ARMA, ARIMA, and
the autoregressive artificial neural network models in forecasting the monthly inflow of Dez dam
reservoir” [15] compares ARIMA’s model with ARMA’s one and shows how ARIMA has a less
error compared to the latter. ”Time Series Analysis, Forecasting and Control” [16] contains Box
and Jenkins that we used to calculate ARIMA’s input parameters (p, q, d). The latter resource
deeply explored and described how to find the latter parameters, what they are and what their
4 Chapter 2. Related Work
value mean. ”Distribution of the Estimators for Autoregressive Time Series With a Unit Root” [17]
contains the explanation of the Adfuller Test that we used to check if a given time series is whether
or not stationary. It also talks about the meaning of stationary series, when they can be considered
good for statistics analysis and prediction and when they are not. ”Likelihood Ratio Statistics for
Autoregressive Time Series with a Unit Root” [18] describes the Augmented Adfuller Test talks
more deeply about stationarity and describes how finding ARIMA’s parameter d is bound to the
latter. ”Vader: A parsimonious rule-based model for sentiment analysis of social media text” [19]
describes the Vader Sentiment Analysis method that we used to make a polarized sentiment anal-
ysis on items reviews contents. ”Testing autocorrelation and partial autocorrelation: Asymptotic
methods versus resampling techniques” [20] describes ACF, PACF and other techniques used in
ARIMA. ”Statsmodels: Econometric and statistical modeling with python” [21] describes a python
library that we used to find PACF and ACF values for a time series in order to find, respectively,
ARIMA’s p and q parameters. ”Mean Absolute Percentage Error for regression models” [22] con-
tains consideration about using MAPE (Mean Absolute Percentage Error) as evaluation criteria
regarding regression models, along with MSE (Mean Squared Error). ”Multi-variable Echo State
Network Optimized by Bayesian Regulation for Daily Peak Load Forecasting” [8] contains con-
siderations about the use of exogenous variables in the model used for predictions. ”Forecasting
Base Metal Prices with Commodity Currencies” [23] describes a forecasting method that depends
on Commodity Currencies. ”The Competitive Landscape of Mobile Communications Industry in
Canada: Predictive Analytic Modeling with Google Trends and Twitter” [24] describes a predictive
method based on external variables such as Google Trends and Twitter. ”Impact of social influence
in e-commerce decision making” [25] highlights how the influence of an e-commerce’s users might
help the latter taking decisions. ”Popularity effect in user-generated content: Evidence from online
product reviews” [26] and ”Social media brand community and consumer behavior: Quantifying the
relative impact of user-and marketer-generated content” [27] talk about how user-generated content
might influence other users ones. ”The effects of positive and negative online customer reviews: do
brand strength and category maturity matter?” [28] highlights how brand strength and category
maturity might influence customer reviews. ”Using Docker: Developing and deploying software
with containers” [11] describes Docker, the container engine we used in this study to deploy all of
our services. ”Python for data analysis: Data wrangling with Pandas, NumPy, and IPython” [29]
describes Pandas and NumPy, two python libraries that we used to manipulate raw data obtained
from our database. ”The Study of Big Data Analytics in E-Commerce” [30] describes how big data
Chapter 2. Related Work 5
analysis can be used in E-Commerce field to improve business decisions. ”Computational Intelli-
gence and Financial Markets: A Survey and Future Directions” [9] and ”Surveying stock market
forecasting techniques Part II: Soft computing methods”[10] highlight how such predictive methods
can be important in financial fields of application.
7
Chapter 3
Data Description
In this chapter we are going to analyze the data used in the study. Our data are stored in
different PostgreSQL tables. We have chosen to use the latter since we heavily rely on relational
bindings between different data entities. We have the following tables:
• items 3.1 - storing items main information
• manufacturers 3.2 - storing manufacturers information
• categories 3.3 - storing items categories information
• categories items 3.4 - N:N table for items and categories relations
• prices 3.5 - storing prices information
• reviews 3.6 - storing item’s reviews information
• trends 3.7 - storing Google Trends information on manufacturers
• currencies 3.8 - storing GBP/EUR (Great Britain Pound / Euro) and GBP/USD (Great Britain
Pound / United Stated Dollar) conversion value information
• forecasts 3.9 - storing final forecasts information
3.1 Items
Items Table stores different information about each Amazon item. Most of the item’s columns
fields are used in table relations and for UI (User Interface) purposes. Below, we describe each
column:
8 Chapter 3. Data Description
Field Name Type Key Type Description Example
ID text Primary Key
Item’s ID. It’s also
Item’s PID on
Amazon.
B00LMBT0IO
Manufacturer text Foreign KeyItem’s
Manufacturer.Apple
URL text -Item’s URL on
Amazon.-
Title text -Item’s title on
Amazon.iPhone 6
Image text -Item image’s URL
on Amazon.-
The item’s name is the Primary Key of the table and it is used to join different tables to obtain the
data set used as an algorithm input. Item’s data have been gathered with the technique described
in chapter 4.1.
3.2 Manufacturers
Manufacturers Table stores each manufacturer’s name. Below, we describe it:
Field Name Type Key Type Description Example
Name text Primary KeyManufacturer’s
Name.Apple
Manufacturer table has been used to fetch Google Trends entries. For each Manufacturer we also
have its Google Trends history. We collected this information because we wanted to fetch more
external data bound to manufacturers. Manufacturer’s data have been gathered with the technique
described in chapter 4.1.
3.3 Categories
The categories Table stores each category’s name. Below, we describe it:
3.4. Categories Items 9
Field Name Type Key Type Description Example
Name text Primary Key Category’s Name. Cell Phones
The categories table contains main and child categories of each item. Each item can have 1 or n
categories and vice-versa. We collected this information because we wanted to experiment if all
of the items of a given category had a similar trend over time. Unluckily, using ARIMA does not
allow to do so, so we have not used this data yet. Categories data have been gathered with the
technique described in chapter 4.1.
3.4 Categories Items
Categories Items Table stores (item, category) tuples since each item may have n categories and
vice-versa. Below, we describe it:
Field Name Type Key Type Description Example
Item text Primary Key Item’s ID. B00LMBT0IO
Category text Primary Key Category’s Name. Apple
3.5 Prices
Prices Table that store all information about prices entries. Below, we describe it:
Field Name Type Key Type Description Example
ID uint -Int index of an
item.1
Item text Primary Key Item’s ID. B00LMBT0IO
Price double Primary KeyPrice entry’s price
on a given date100.99
Date text Primary Key
Price entry’s date
in text and
YY-MM-DD
format
2016-12-01
Flag bool -True if date is on a
festivity rangeTrue
10 Chapter 3. Data Description
We inherited prices data from an old project. Prices data contains ∼9 million different items and
∼90 million different prices that go from ∼2015 to ∼2016. All of the prices data are expressed
in tuples (item, price, date). Each one of these entries are unique and there is only one entry per
different day. Prices data may or may not be daily. Some items have daily entries, consecutive days,
some others have not. Flag is a boolean that is true or false whether a given input date is between
holiday periods or not. We considered as festivity prices all those with a date in a range that
goes from YYYY-11-20 to YYYY-01-10. Price data have been preprocessed with the technique
described in chapter 4.1.
3.6 Reviews
Reviews Table that store all information about reviews entries. Below, we describe it:
Field Name Type Key Type Description Example
ID uint Primary KeyInt index of a
Review.1
Item text - Item’s ID. B00LMBT0IO
Content text -Review’s
Comment’s Text
Very nice
product!
Date text -
Review entry’s
date in text and
YY-MM-DD
format
2016-12-01
Sentiment double -
Polarized
Sentiment Analysis
of Review’s
Content
0.88
Stars double -
Vote expressed as
stars in (0.0 - 5.0)
range
4.5
Each item may or may not have reviews. If an item has reviews, there can be 1 or n reviews per item
on each day. Review’s content is a comment posted by an user. Sentiment is a Polarized Sentiment
Analysis performed on the latter. We collected this information because we thought reviews might
3.7. Trends 11
have had influences on the price trend over time. Reviews scenario might be a bit tricky, for instance,
more experienced user might influence future users reviews on a same product or an user review
might also be influences by Manufacturer’s popularity and category maturity [28, 27]. More about
the latter statement is described by authors in [26]. A similar work using ARIMA with sentiment
analysis has been done in [13] to understand whether or not a product would have success. For
instance, we imagined that many negative reviews in terms of Stars and Sentiment score could lead
to a price decrease since works like [25] highlight how e-commerce success might take decision on
the social influence of its users. We found out that this insight is only partially true. Indeed, these
scores influences the future trend but not in a consistent way. Reviews data have been gathered
and preprocessed with the technique described in chapter 4.1.
3.7 Trends
Trends Table that store all information about trend entries. Below, we describe it:
Field Name Type Key Type Description Example
Manufacturer text Primary KeyManufacturer’s
NameApple
Date text Primary Key
Trend entry’s date
in text and
YY-MM-DD
format
2016-12-01
Value double Primary Key
Trend’s value on a
given date (range
0.0 - 100.0)
67.01
For each Manufacturer we have a Google Trends entry. Essentially, we have a manufacturer popular-
ity expressed as a (value, date) range over time. We collected this data starting from manufacturers
because we thought that a manufacturer’s popularity over time might have influenced the future
trend of a given item. A similar work has been conducted in [12] where Google Trend was used
as feature in ARIMA. We found out that that particular feature greatly influences the predic-
tion. Google Trends outputs a csv in weekly format although we rely on daily formats. Hence, a
preprocessing was necessary. This process is described in chapter
12 Chapter 3. Data Description
3.8 Currencies
Currencies Table which stores all information about currencies. Below, we describe it:
Field Name Type Key Type Description Example
Name text Primary KeyGBP/EURO or
GBP/USDGBP/EURO
Date text Primary Key
Currency entry’s
date in text and
YY-MM-DD
format
2016-12-01
Value double Primary KeyCurrency’s value
on a given date1.12
Each entry has a tuple (value, date) which is the conversion value expressed in percentage of GBP
over EURO and GBP over USD. It has a daily format. We collected this information because we
thought it would have been a good external feature, like in the work from the authors of [23], even
though experiments highlighted the opposite. For the most of items, indeed, the addition of the
latter as an exogenous variable alters the prediction increasing error’s score. The original target
was to fetch stock exchange data for each manufacturer but we found out that the most of our
items had manufacturers quoted on it.
3.9 Forecasts
Forecasts table that stores all information about forecasts. Below, we describe it:
3.9. Forecasts 13
Field Name Type Key Type Description Example
ID uint -Int index of a
Forecast Entry.1
Name text Primary Key
Comma separated
features
combinations used
for the forecast
item,price,date
Item text Primary Key Item’s ID. B00LMBT0IO
Price double Primary KeyForecasts price on
a given date11.12
Date text Primary Key
Forecasts entry’s
date in text and
YY-MM-DD
format
2016-12-01
Test Size text Primary Key
Test size used for
the forecast. 10%,
20% or 30% of the
data set size.
20%
Score double -
Mean Absolute
Percentage Error
Score of the Item’s
Forecast in
percentage
2.23
This table contains all the final forecasts performed on an item. Name columns contains all the
features used for a forecast separated by commas. For instance item,price,date could be a Name
since the latter could have been used to obtain a forecast. The algorithm outputs a series of (price,
date) tuples covering a given range of days. These covered days are the same that have been used
for the test set. Test Size highlights the test size used for the test set. This also matches with the
number of predictions made. For instance, if we have 100 (price, date) entries and we decide to use
a test size equals to 10% we will have 90 entries used for the training phase and the remaining 10
items used to test the prediction and make the forecast on these missing days. Score is the Mean
Absolute Percentage Error Score, the lower it is, the better the prediction is. We will show how
14 Chapter 3. Data Description
this data has been used in chapter 5
15
Chapter 4
Data Gathering
This chapter describes how additional data highlighted in the previous chapter 3 has been
collected, processed and stored. The latter is stored in a RESTful back-end written in Google’s
Golang whixh is available at the following link github.com/AndreaM16/go-storm. The latter listens
for crawlers requests, parses their requests bodies and stores the result in our database. We will
describe:
• Amazon 4.1 - covers all of Amazon’s data collecting process
• Google Trends 4.2 - describes how Google Trends data collecting process
• Currencies 4.3 - talks about how Currencies data have been fetched
4.1 Amazon
Concerning Amazon’s basic data, as stated in the previous chapter talking about Prices 3.5,
we inherited a MongoDB collection containing ∼9 million items and ∼90 million prices from a
previous project. This collection had several issues. Prices were represented in different formats
(text, double, uint, ...) so a pre-processing on them was necessary. Using different tools such as
AWK and Regular Expressions, we were able to parse all of the former to a double format. For
simplicity, we also formatted all the date entries to a Date Format YY-MM-DD. After that, we
uploaded the results to a PostgreSQL database described in the previous chapter 3. Additional
data such as manufacturers and reviews have been gathered writing a microservice that performs
API calls to Amazon using Amazon Affiliate Program API. The latter can be found at the following
link github.com/AndreaM16/go-amazon-itemlookup. These APIs expose XML bodies describing a
16 Chapter 4. Data Gathering
given item. This body contains different parameters such as item’s manufacturer, categories, reviews
page URL and so on. Unluckily, APIs do not expose any kind of information about reviews but a
URL that provides an HTML page containing paginated reviews. It was necessary to scrape this
page to get reviews content, stars and date. We found out that for some items, APIs do not return
any useful information about an item. For instance, for some items it was impossible to fetch their
manufacturer or reviews. The Polarized Sentiment Analysis was performed on item’s content using
Vader Sentiment Analysis highlighted by the authors of Vader: A parsimonious rule-based model
for sentiment analysis of social media text, referenced here [19]. Concerning reviews, as stated in
chapter 3.6, per each item, we found 0 or n reviews. This is an issue because, most of the times,
doing a merge between the two data frames (item, date, price) and (item, date, sentiment, stars),
will have 0 or very few rows. Thus, in order to have a more consistent number of reviews, having
at least one entry per day when merging with prices data, we did the following. We have written a
small tool, available at the following link github.com/AndreaM16/review-analyzer, that generates
a list Rt of unique daily reviews. Therefore, if a certain day has more than one review, we will make
an average of their Sentiment and Stars before inserting the latter in Rt. This entry’s Sentiment
and Stars are calculated following the formula:
sentimentr =
n−1∑i=0
sentimentri
n
starsr =
n−1∑i=0
starsri
n
Where sentimentr and starsr are the averaged values obtained as new Sentiment and Stars values
for the resulting review that will be inserted in the daily reviews list Rt. Rt is a list of consecutive
days reviews so that dateri < dateri+1; i ∈ {1, · · · , n};n = ‖Rt‖ − 1;∀i, i <= n . On top of that,
we managed to fill the missing entries between two reviews belonging to Rt by making an average
of two consecutive entries, r0 and rn, their Stars and Sentiment scores are calculated, respectively,
as follows:
sentimentr =
sentimentr0 if i = 0,sentimentri−1
+sentimentrn2 if i ∈ {1, · · · , n− 1},
sentimentrn if i = n
4.2. Google Trends 17
starsr =
starsr0 if i = 0,starsri−1
+starsrn2 if i ∈ {1, · · · , n− 1},
starsrn if i = n
The resulting review is inserted in a final list of daily consecutive reviews Rf where dateri <
dateri+1 ; i ∈ {1, · · · , n};n = ‖Rf‖;∀i, i <= n. In this way, we will have a unique review for each
date in the range that goes from the very first to the very last one of the original input ones. This
is done for each two consecutive reviews in Rt.
4.2 Google Trends
Google Trends data is the popularity of a manufacturer over time. Starting from manufacturers,
we fetched their Google Trends history from 2015 to ∼ half 2016. To do so, we built a crawler
using Selenium Webdriver + Python and Golang that for a given manufacturer, looks for its trend,
downloads its csv, parses it and posts it to our back-end. We had to use Selenium because Google
Trends does not have open APIs to get a given text input’s popularity. Moreover, the downloaded
csv contains only weekly (value, date) entries, hence, one entry per week. Since we need external
data to be in daily format (as stated in the previous section 4.1), we had to process it and get an
entry for each missing day. This has been achieved by using the same approach used to fill missing
reviews. Given two date consecutive trend entries t0 and tn belonging to Tt, so that, for each of
them, dateti < dateti+1; i ∈ {1, · · · , n};n = ‖Tt‖;∀i, i <= n, we get the missing days values with:
vt =
vt0 if i = 0,vti−1
+vtn
2 if i ∈ {1, · · · , n− 1},
vtn if i = n
The resulting trend entry vt is inserted in a final list of daily trend entries Tf where dateti <
dateti+1; i ∈ {1, · · · , n};n = ‖Tf‖− 1;∀i, i <= n. In this way, we will have a unique trend entry for
each date in the range that goes from the very first to the very last one of the original input ones.
This is done for each two consecutive reviews in Rt. The result will be posted as the reputation
over time of a given manufacturer in our back-end. The Selenium + Python part is available at the
18 Chapter 4. Data Gathering
following link github.com/AndreaM16/yaggts-selenium. The latter simply takes a text as input,
for instance Apple, and downloads a csv containing these weekly formatted entries. The Golang
part available at the following link github.com/AndreaM16/yaggts calls the Python part passing to
it an input text, waits for the csv to be downloaded, parses and processes it as stated above and
posts the result to our back-end.
4.3 Currencies
We collected this data from Quandl.com. We obtained two different series of (date, value) for
GBP/EURO and GBP/USD in JSON format. We parsed them and posted the results to our
back-end using the tool available at the following link github.com/AndreaM16/all-hail-gbp.
19
Chapter 5
Method
This chapter describes the prediction algorithm used and how we managed to customize it to
fit our needs. Indeed, we built a general purpose algorithm that takes a set of features (basic and
exogenous ones), fine-tunes ARIMA’s parameters (p, d , q) keeping the combination that outputs
the lowest MSE, applies ARIMA to all the possible combinations of the latter and stores the best
result in terms of lowest MAPE. So, for each item, we only keep the feature combination that gave
us the lowest MAPE score. In order, we’ll describe:
• ARIMA 5.1 - describes the algorithm and its parameters used for the experiment
• Algorithm 5.2 - describes how ARIMA’s parameters have been calculated and how we applied
the algorithm to our features
5.1 ARIMA
A time series is a sequence of measurements of the same variable(s) made over time. Usually the
measurements are made at evenly spaced times - for example, daily or monthly. An ARIMA(p,d,q)
(Autoregressive Integrated Moving Average) model, in time series analysis, is a tool that fitted to
time series data, predict future points in the series. ARIMA is well suited for this use case because
it allows us to work, not only with (price, date) tuples, but also with exogenous features. For
instance, we can see if adding a new exogenous feature improves the overall accuracy or not. This
can be useful to understand how the model reacts to different external features and how much them
influence the precision of the forecast. Since in our study we found out that an external feature like
Item’s Manufacturer Popularity influences in a good way the forecast, this particular functionality
20 Chapter 5. Method
of the algorithm is quite useful. More information on ARIMA can be found in [14, 16]. The model
is a combination of three parts AR, I, MA. These part correspond to (p, d, q) that are algorithm’s
parameters. Below, we shortly explain each part:
• AutoRegressive model (AR):
An autoregressive model AR(p) specifies that the output variable depends linearly on its own
previous values and on a stochastic term. The order of an autoregressive model is the number
of immediately preceding values in the series that are used to predict the value at the present
time The notation AR(p) indicates an autoregressive model of order p. The AR(p) model is
defined as
Xt = c+
p∑i=1
ϕiXt−1 + εt
where ϕ1, . . . , ϕp are the parameters of the model, c is a constant and εt is white noise.
• Integrated (I):
Degree of differencing of order d of the series. Differencing in statistics is a transformation
applied to time series data in order to make it stationary. A stationary time series’ properties
do not depend on the time at which the series is observed. In order to difference the data, the
difference between consecutive observations is computed. Mathematically, this is shown as
y′t = yt − yt−1
Differencing removes the changes in the level of a time series, eliminating trend and seasonality
and consequently stabilizing the mean of the time series.
• Moving Average model (MA):
In time series analysis, the moving-average MA(q) model is a common approach for modeling
univariate time series. The moving-average model specifies that the output variable depends
linearly on the current and various past values of a stochastic term. The notation MA(q)
refers to the moving average model of order q. The MA(q) model is defined as:
Xt = µ+ εt + θ1εt−1 + · · ·+ θqεt−q
where µ is the mean of the series, θ1, . . . , θq are the parameters of the model and εt is white
noise
5.2. Algorithm 21
5.2 Algorithm
This section covers how basic ARIMA’s parameters (p, d, q) have been calculated, how we
applied the latter method using our features and how we obtained our best results. All the code
containing these phases steps is available at the following link github.com/AndreaM16/price-probe-
ml written using Pandas and NumPY as explained by authors in [29]. We’ll describe:
• Preprocessing 5.2.1 - describes how ARIMA’s parameters have been calculated. Also cover ex-
planations about how the latter ones are considered whether or not to be optimal for the time
series
• Application 5.2.2 - describes how ARIMA has been applied taking as input the parameters found
in the previous sub-section
• Results 5.2.3 - describes how the best results have been found applying the previous sub-section
5.2.1 Preprocessing
These steps are mandatory to have a good prediction with an high accuracy. In order to start
this phase, is necessary to have a properly formatted input time series. We used a time series
formed by (price, date) tuples with consecutive daily entries. We used the latter to find out if a
given time series is stationary (d) and the size of clusters of correlated (q) and partially correlated
entries (p). In short, if a time series is stationary, we do not need to perform any extra calculations
to obtain d, so, this will be equal to 1. While if the series is not stationary, some more calculations
are necessary to get the latter. p and q, in short, tell us on how big are clusters of entries that
have a similar behavior over time. Parameters estimation can be done following Box and Jenkins
method [16] for finding the best fit in a time series using ARIMA:
• Model identification: The first step is to check the stationarity of the series, because the
ARIMA model need a stationary series to work properly. Based on [17] and [18] work, we
used the augmented DickeyFuller test from python stastmodel package, highlighted by the
authors of Statsmodels: Econometric and statistical modeling with python, referenced here
[21]. A time series has stationarity if a shift in time doesnt cause a change in the shape of
the distribution; unit roots are one cause for non-stationarity.
22 Chapter 5. Method
Figure 5.1: Example of a stationary series.
A unit root is a stochastic trend in a time series. If a time series has a unit root, it shows a
systematic pattern that is unpredictable. The augmented Dickey-Fuller test is built with a
null hypothesis that there is a unit root The null hypothesis could be rejected if the p-value of
the test result is less than 5%. Furthermore, we checked also that the Dickey-Fuller statistical
value is more negative than the associated t-distribution critical value; the more negative is
the statistical value, the more we can reject the hypothesis that there is a unit root. If the
test result shows that we can’t reject the hypothesis, we have to differentiate the series and
repeat the test again. Usually, differencing more than twice a series means that the series is
not good to fit into ARIMA.
5.2. Algorithm 23
Figure 5.2: Example of a not stationary series.
• Parameters estimation: In time series analysis, the Partial AutoCorrelation Function
(PACF) gives the partial correlation of a time series with its own lagged values, control-
ling for the values of the time series at all shorter lags. It contrasts with the AutoCorrelation
Function (ACF), which does not control for other lags. These functions play an important
role in time series analysis, helping to identify the extent of the lag in an autoregressive model
(PACF) and in a moving average model (ACF). More informations about p and q estimation
and their importance is highlighted by authors in [20].
24 Chapter 5. Method
Figure 5.3: Example of an ACF plot. This plot shows a spike for lag values lessthan 4, based on 95% confidential criteria. So we choose a q value of 4 for the
upper bound when we’re searching for best (p,d,q)configuration.
The use of these function was introduced as part of the BoxJenkins approach to time series
modeling, where computing the partial autocorrelation function one could determine the ap-
propriate lags p in an ARIMA (p,d,q) model, and computing the autocorrelation function one
could determine the appropriate lag q in an ARIMA (p, d, q) model. The partial autocorre-
lation of an AR(p) process becomes zero at lag p+1 and greater, so we examine the sample
partial autocorrelation function to see if there is evidence of a departure from zero. This is
usually determined by placing a 95% confidence interval on the sample partial autocorrela-
tion plot. If the software program does not generate the confidence band, it is approximately
±2/√N , with N denoting the sample size. The autocorrelation function of a MA(q) process
becomes zero at lag q+1 and greater, so we examine the sample autocorrelation function to
see where it essentially becomes zero. We do this by placing the 95% confidence interval for
the sample autocorrelation function on the sample autocorrelation plot.
5.2. Algorithm 25
Figure 5.4: Example of a PACF plot. This plot shows a spike for lag values lessthan 2, based on 95% confidential criteria. So we choose a p value of 2 for the
upper bound when we’re searching for best (p, d, q) configuration.
• Model evaluation: To determine the best ARIMA model for every Amazon’s item, we use
always those steps:
– check stationarity, with the ADF test, useful also to find an appropriate d value
– find p and q based on PACF and ACF. The result on ACF and PACF gives us an upper
bound for iterate the fitting of the model, keeping the best (p,d,q) combination, based
on less Mean Squared Error value, described as:
MSE =1
n
n∑i=1
(Yi − Yi)2.
where Y is a vector of n predictions, and Y is the vector of observed values of the
variable being predicted.
5.2.2 Application
This subsection describes how we applied ARIMA to our study. As stated in the previous chap-
ters, we have different features (in time series format) that can be considered exogenous variables
to ARIMA. This means that each one of them has the date field that is mandatory to merge Data
Frames containing this information. Essentially, we manage to merge the exogenous data with the
basic one on the date key. Basic Data is formed by (price, date) rows, the exogenous ones have
a different value but always have a date column, so, an exogenous row would be like (date, value)
entry, where value is a float or a list of them. For instance, when merging Basic Data with Google
26 Chapter 5. Method
Trends Data we will have to merge (date, price) with (date, popularity) so that we’ll obtain (date,
popularity, price) rows. Our algorithm is very specialized since, for each item:
• ARIMA’s parameters - Calculate (p, q, d) as stated in 5.2.1
• Data Retrieving - Retrieve item’s external data. For instance, if the latter has Google Trends
entries for its Manufacturer we retrieve it and add it to the current external features
• Model Fit - Fit a model for each combination of item’s available basic and external features.
This also depends on the test size used
• Results - Compare the results obtained in the previous steps and only select the best in terms
of lowest MAPE produced by the difference of the original trend and the forecast taken in
consideration
The latter points can be formalized as follows: Given an item I, we retrieve its available data
(features) Fi that can be splitted in two subsets of features such that:
FBi ⊆ Fi
FEi ⊆ Fi
FBi
⋂FEi = {date}
FBi
⋃FEi = Fi
FBi = {date, price}FEi ∈ P((Fi − FBi)
⋃{date})
We have a function P that takes FBi and FEi as inputs and returns a new set of features FTi that
contains all the elements of FBi and a possible combination of FEi (also ∅) that is given by the
function C returns all the possible combinations of features in FEi :
C(FEi) = {fEi : fEi ∈ P(FEi)}P (FBi , fEi) = FBi
⋃fEi
FTi = {P (FBi , fEi) : fEi ∈ C(FEi)}
Obtained all the features combinations FTi we now proceed to fit the ARIMA model with each
one of them coupled with their correlated MAPE and saving these results as a (Map[Fitted Model,
MAPE Score]) defined as m[MTi , ETi ] . Given Oi to be the real trend described by FBi , we can
define the following pseudo-golang code:
5.2. Algorithm 27
// stats is an example helper package
import "stats"
const originalTrend = Oi
func getARIMAForecastByFitModel(m FittedModel) Forecast {
return stats.GetARIMAForecastFromFittedModel(m)
}
func getMape(f Forecast) Score {
return stats.GetMapeScoreFromTrend(originalTrend , f)
}
func main() {
featureCombinations := FTi
fitModelScores := make(map [[] Feature]Score)
for _, value := range featureCombinations:
currFitModel := fit(value)
currForecast := getARIMAForecastByFitModel(currFitModel)
fitModelScores[value] = getMape(currForecast)
}
So we’ll end up having m, that is fitModelScores in the pseudo-code above, containing as key the
fitted model of a given combination of features and as value the MAPE Score described as follows:
MAPE =100
n
n∑t=1
∣∣∣∣Otj −Rtj
Otj
∣∣∣∣Taking a real price trend Ot and a forecast Rt makes the difference for each point with the same date
field by accessing them with an index j. It sums all of these differences and returns a percentage
error of the average difference between all the points considered. More about MAPE can be read
in [22].
5.2.3 Results
In order to obtain the final result, we simply look for the value in m having the lowest value.
It means that the key such that m[key] = min(value) is the best combination of features for this
item i.
29
Chapter 6
Evaluation
In this section, we’ll describe the results obtained and considerations on them:
• Results 6.1
• Comments 6.2
6.1 Results
For experiment purposes we considered 1000 with most prices entries. We did this for two
reasons. The first one, is that we are trying to emulate a daily price tracking, so, items with
more prices reflect this statement closer. The second reason is that the process is computationally
expensive and takes time depending on how many features are involved and on p, d, q values. We
used three different test sizes, 10%, 20% and 30%. Test size also reflects the number of prediction
made. So, in a set of 100 prices, if we have a test set size of 10%, we will use 90 of them to
train the model, we will perform 10 forecasts on the same days excluded from the data set and
then we will calculate the MAPE with them. At the moment, we are excluding Categories feature
because ARIMA is not well suited to make clusterized analysis on items having one or more common
categories. We also excluded currencies because we noticed that these feature influenced negatively
each prediction. It seems like price trend and currencies are not bound. Below we report a table
containing how many times a feature has been found and used for each item:
Table 6.1: Items and Features Numbers
30 Chapter 6. Evaluation
Features
Combination
Number of Items having
this features
Items having
this features
over 1000 items
in %
price, date 1000 100%
price, date,
manufacturer360 36%
price, date,
manufacturer, trend230 23%
price, date, sentiment 552 55.2%
price, date, stars 552 55.2%
price, date, sentiment,
stars552 55.2%
price, date,
manufacturer, stars331 33.1%
price, date,
manufacturer,
sentiment
331 33.1%
price, date,
manufacturer,
sentiment, stars
331 33.1%
price, date,
manufacturer, trend,
stars
211 21.1%
price, date,
manufacturer, trend,
sentiment
211 21.1%
price, date,
manufacturer, trend,
sentiment, stars
211 21.1%
Then, we report three tables containing, for each combination of features, on how many items the
latter is available and their average MAPE respectively to the test size used:
6.1. Results 31
Table 6.2: Test Size 10%
Features CombinationN. of items having
this combinationAverage MAPE
price, date 1000 2.31%
price, date, flag 1000 2.31%
price, date, trend 330 1.98%
price, date, trend, flag 330 1.98%
price, date, sentiment 40 1.83%
price, date, stars 40 1.83%
price, date, flag, sentiment 40 1.83%
price, date, flag, stars 40 1.83%
price, date, flag, sentiment,
stars40 1.83%
price, date, sentiment, stars 40 1.83%
price, date, trend, sentiment 30 3.59%
price, date, trend, stars 30 3.59%
price, date, trend, sentiment,
stars30 3.59%
price, date, trend, sentiment,
flag30 3.59%
price, date, trend, stars, flag 30 3.59%
price, date, trend, flag,
sentiment, stars30 3.59%
32 Chapter 6. Evaluation
Table 6.3: Test Size 20%
Features CombinationN. of items having
this combinationAverage MAPE
price, date 1000 9.56%
price, date, flag 1000 9.56%
price, date, trend 330 5.69%
price, date, trend, flag 330 5.69%
price, date, sentiment 40 7.69%
price, date, stars 40 7.69%
price, date, flag, sentiment 40 7.69%
price, date, flag, stars 40 7.69%
price, date, flag, sentiment,
stars40 7.69%
price, date, sentiment, stars 40 7.69%
price, date, trend, sentiment 30 11.37%
price, date, trend, stars 30 11.37%
price, date, trend, sentiment,
stars30 11.37%
price, date, trend, sentiment,
flag30 11.37%
price, date, trend, stars, flag 30 11.37%
price, date, trend, flag,
sentiment, stars30 11.37%
6.1. Results 33
Table 6.4: Test Size 30%
Features CombinationN. of items having
this combinationAverage MAPE
price, date 1000 9.25%
price, date, flag 1000 9.25%
price, date, trend 330 6.64%
price, date, trend, flag 330 6.64%
price, date, sentiment 40 7.9%
price, date, stars 40 7.9%
price, date, flag, sentiment 40 7.9%
price, date, flag, stars 40 7.9%
price, date, flag, sentiment,
stars40 7.9%
price, date, sentiment, stars 40 7.9%
price, date, trend, sentiment 30 10.65%
price, date, trend, stars 30 10.65%
price, date, trend, sentiment,
stars30 10.65%
price, date, trend, sentiment,
flag30 10.65%
price, date, trend, stars, flag 30 10.65%
price, date, trend, flag,
sentiment, stars30 10.65%
34 Chapter 6. Evaluation
Next, we report three tables containing on how many items a combination of features has been the
best and its average MAPE score on that number of items based on the test size:
Table 6.5: Test Size 10%
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 830 2.77%
price, date, trend 110 1.29%
price, date, flag 30 0%
price, date, trend, flag 30 0.43%
Table 6.6: Test Size 20%
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 830 10.59%
price, date, trend 130 4.85%
price, date, flag 40 0%
Table 6.7: Test Size 30%
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 850 9.97%
price, date, trend 110 6.14%
price, date, flag 40 0%
In the tables above, we have a 0% score because those 40 items have a linear trend history (the
price value never changes over time).
6.1. Results 35
Following we report a table containing the average of the ones above with common feature combi-
nations:
Table 6.8: Average Best Scores on different Test Sizes
Features Combination
N. of Items where
this combination
has been the best
Average MAPE
of this
combination
price, date 837 7.77%
price, date, trend 116 4.1%
price, date, flag 33 0%
As highlighted in the tables above, in most of the cases, the more the test size decreases the
more accurate is the score. We noticed that (price, date, trend) feature combination is the most
promising, having an overall lower Average MAPE. It is interesting to analyze all the items having
a Manufacturers, how many of those have Google Trends entries and on how many of the latter
have as best combination (price, date, trend) respectively to the test size used:
Table 6.9: Manufacturer and Google Trends based predictions
Test
Size
N. of items having
a Manufacturer
N. of items having
Google Trends
Entries
How many times
(price, date,
trend) has been
a best
combination on
them
10% 360 230 140
20% 360 230 130
30% 360 230 100
As stated in chapter 4, unluckily, it was impossible to retrieve some Google Trends entries given a
Manufacturer. Otherwise we would have surely had more of the latter ones. It is interesting to notice
that over 230 items we have an average number of items having the latter best features combination
of 123. Therefore, on the 53.47% of items having both Manufacturer and Google Trends entries
we have (price, date) as best configuration. That highlights that Popularity influences consistently
the prediction.
36 Chapter 6. Evaluation
For completeness, we present a full example of results obtained with an item, respectively to the
test size, having trend, stars and sentiment entries. We describe three graphs, one for each test
size used, comparing the real price to the forecasted ones on the same days. That particular item
considered has (date, price, trend) as best features combination:
10-22 11-11 12-01 12-21 01-10 01-30 02-19200
250
300
350
400 Real Trend
Figure 6.1: This plot shows a simple time series of an item, with price changingover time. Besides some spikes, the overall prices stays between 280 and 260 £
02-17 02-19 02-21 02-23 02-25240
250
260
270
280Real Trend
Forecast
Figure 6.2: This plot shows the time series above compared with the forecastmade with test size size of 10%. It’s relevant how much is closer the forecast to
the real price
6.1. Results 37
01-14 01-24 02-03 02-13 02-23240
260
280
300Real Trend
Forecast
Figure 6.3: This plot shows the time series above compared with the forecastmade with test size size of 20%. The forecast is still very close to the real price,as we can see the decreasing trend of the forecast value in a similar way that the
real one does
12-07 12-27 01-16 02-05 02-25200
250
300
Real TrendForecast
Figure 6.4: This plot shows the time series above compared with the forecastmade with test size size of 30%. The forecast is not as close to real price as the
test size above, because the algorithm learns with less data
38 Chapter 6. Evaluation
Below we report the same charts described above normalized in a range 0-100 and we compare
both real trend and forecast their Google Trend history on the same days:
02-17 02-19 02-21 02-23 02-250
20
40
60
80
100Real Trend
ForecastGoogle Trends
Figure 6.5: This plot shows the normalized price and forecast compared withitem’s manufacturer popularity with a test size of 10%. As we can see, the nor-
malized values are very close to the trend ones
01-14 01-24 02-03 02-13 02-230
20
40
60
80
100Real Trend
ForecastGoogle Trends
Figure 6.6: This plot shows the normalized price and forecast compared withitem’s manufacturer popularity with a test size of 20%. As we can see, the nor-
malized values finally returns close to the trend ones
6.2. Comments 39
12-07 12-27 01-16 02-05 02-250
20
40
60
80
100Real Trend
ForecastGoogle Trends
Figure 6.7: This plot shows the normalized price and forecast compared withitem’s manufacturer popularity with a test size of 30%. As we can see, the nor-malized values finally returns close to the trend ones, with a decreasing price over
an increasing popularity
6.2 Comments
In this section we are going to comment the result obtained in the previous one 6.1. In Table
6.1 it’s clear that we had a consistent number of different feature combinations. Overall, we had
a consistent number or (price, date, stars) and (price, date, sentiment) of 55.2%. As stated in
3 and 4.1 sometimes Amazon Affiliate APIs did not respond or did not have any information on
reviews so having at least 55.2% items having such features is a good percentage. Same thing goes
for Google Trends Data. We have a 23% of items with these entries because, sometimes, Amazon
Affiliate APIs did not respond or didn’t have any information about an item’s manufacturer, so,
with no manufacturer is impossible getting any Google Trend history for that item. In Table 6.2,
6.3 and 6.4 we are showing how the MAPE score changes based on the test size (10%, 20% and
30%) and on the feature combination used. In Table 6.2 we can see that the lowest average MAPE
40 Chapter 6. Evaluation
score is bound to the flag. As stated in chapter 3, the flag is a boolean that is true if a given
date lies on a range of festivity days. In the other two related tables 6.3 and 6.4 we can see how
the best result is bound to the trend feature. Also in 6.2 we can see how trend gives a very good
score. In these tables, we can see how reviews do not influence the trend in a consistent way even
though they give pretty good score. In tables 6.5, 6.6 and 6.7, we highlighted the best features
combinations in terms of times that it has been used and average MAPE score. In the latter ones,
we can notice that the basic configuration is the most used, but this is justified by the fact that
we did not have many Google Trends feature entries for these items. Indeed, we can see how trend
feature influences positively the score in most of cases when it has been present for an item. We
can also notice that (price, date, flag) has a 0% score but this is justified by the fact that those
items had a linear trend history; they had always the same price over time. It also emerges that all
the feature combinations having reviews features have never been a best configuration. This last
evidence makes stronger our previous statement about the trend not being most likely influenced
by reviews. In table 6.8 we highlight that the average of the previous considered tables still states
that the trend feature leads to an high accuracy. In 6.9 we show that for the 64% of items having
a manufacturer it was possible to retrieve their Google Trends history. We also show on how many
of the latter ones, (price, trend, date) was used as best combination. It’s pretty clear that it is a
good number of entries. Next, we are analyzing an item that had all features entries. In Chart 6.1
we drawn a given item’s price trend over time. It has a pretty not linear pattern. In Chart 6.2,
6.3, 6.4 we can see the real trend compared to the best forecast made with a (price, date, trend)
feature combination. It’s clear that, has the test size increases, also the MAPE score increases. In
charts 6.5, 6.6 and 6.7 we compared the real trend, the prediction based on the test size used and
the Google Trend history over time. All the entries have been normalized in a range that goes from
0 to 100 since Google Trends entries are on that range. We can see how using the two trends have
a pretty similar behavior and on small test sizes they are pretty close. While, in the last one, we
can see how as the Google Trend score increases at its maximum value and the price decreases.
41
Chapter 7
Conclusion
The number of online sales is growing quickly and customers do not have a real clear idea of how
prices are influenced aside sale times. Having a way to predict such thing would help customers
to make better choices regarding which marketplace to use in order to purchase a certain product
or in which period. We have shown how Price Probe predicts Amazon prices, one of the biggest
E-Shop players, for a remarkable amount of time and showing a high precision. Using ARIMA
with proper external features and fine-tuned parameters leads to a high accuracy. Our method
highly depends on how external features have been chosen and collected. As stated in the paper,
working with such close data is very expensive. Given that, results can be surely more accurate
having more resources which can be used to crawl items to have a daily (price, date) tuple for each
one of them. Indeed, having a daily tracking of items would improve our results as highlighted
in the previous chapter. Without a shadow of a doubt, it would also be interesting to use more
external features (for instance the popularity of a specific item on Twitter, e.g.: iPhone) about the
product’s popularity over time, since we highlighted how significantly Google Trends influences our
results, also, similar studies already exists like one highlighted by the authors of [24] . The future
of purchases relies on online marketplaces and Price Probe lays the first stone in predicting when
and where it is more profitable to purchase a product.
43
Bibliography
[1] Pai PF, Lin CS. A hybrid ARIMA and support vector machines model in stock price fore-
casting. Omega. 2005;33(6):497 – 505. Available from: http://www.sciencedirect.com/
science/article/pii/S0305048304001082.
[2] Conejo AJ, Plazas MA, Espinola R, Molina AB. Day-ahead electricity price forecasting us-
ing the wavelet transform and ARIMA models. IEEE Transactions on Power Systems. 2005
May;20(2):1035–1042.
[3] Jadhav V, Chinnappa Reddy BV, Gaddi GM. Application of ARIMA model for forecasting
agricultural prices. 2017 01;19:981–992.
[4] Wang Y, Wang C, Shi C, Xiao B. Short-term cloud coverage prediction using the ARIMA
time series model. Remote Sensing Letters. 2018;9(3):274–283.
[5] Rangel-Gonzalez JA, Frausto-Solis J, Javier Gonzalez-Barbosa J, Pazos-Rangel RA, Fraire-
Huacuja HJ. In: Castillo O, Melin P, Kacprzyk J, editors. Comparative Study of ARIMA
Methods for Forecasting Time Series of the Mexican Stock Exchange. Cham: Springer
International Publishing; 2018. p. 475–485. Available from: https://doi.org/10.1007/
978-3-319-71008-2_34.
[6] Jiang S, Yang C, Guo J, Ding Z. ARIMA forecasting of Chinas coal consumption, price and
investment by 2030. Energy Sources, Part B: Economics, Planning, and Policy. 2018;0(0):1–6.
[7] Ozturk S, Ozturk F, et al. Forecasting Energy Consumption of Turkey by Arima Model.
Journal of Asian Scientific Research. 2018;8(2):52–60.
[8] Niu D, Ji L, Xing M, Wang J. Multi-variable Echo State Network Optimized by Bayesian
Regulation for Daily Peak Load Forecasting. JNW. 2012;7:1790–1795.
44 BIBLIOGRAPHY
[9] Cavalcante RC, Brasileiro RC, Souza VLF, Nobrega JP, Oliveira ALI. Computational In-
telligence and Financial Markets: A Survey and Future Directions. Expert Systems with
Applications. 2016;55:194 – 211. Available from: http://www.sciencedirect.com/science/
article/pii/S095741741630029X.
[10] Atsalakis GS, Valavanis KP. Surveying stock market forecasting techniques Part II: Soft com-
puting methods. Expert Systems with Applications. 2009;36(3, Part 2):5932 – 5941. Available
from: http://www.sciencedirect.com/science/article/pii/S0957417408004417.
[11] Mouat A. Using Docker: Developing and deploying software with containers. ” O’Reilly Media,
Inc.”; 2015.
[12] Yu L, Zhao Y, Tang L, Yang Z. Online big data-driven oil consumption forecasting with
Google trends. International Journal of Forecasting. 2018;Available from: http://www.
sciencedirect.com/science/article/pii/S016920701730136X.
[13] Tuladhar JG, Gupta A, Shrestha S, Bania UM, Bhargavi K. Predictive Analysis of E-Commerce
Products. In: Bhalla S, Bhateja V, Chandavale AA, Hiwale AS, Satapathy SC, editors. Intel-
ligent Computing and Information and Communication. Singapore: Springer Singapore; 2018.
p. 279–289.
[14] Brockwell PJ, Davis RA. In: Introduction. Cham: Springer International Publishing; 2016. p.
1–37. Available from: https://doi.org/10.1007/978-3-319-29854-2_1.
[15] Valipour M, Banihabib ME, Behbahani SMR. Comparison of the ARMA, ARIMA, and the au-
toregressive artificial neural network models in forecasting the monthly inflow of Dez dam reser-
voir. Journal of Hydrology. 2013;476:433 – 441. Available from: http://www.sciencedirect.
com/science/article/pii/S002216941200981X.
[16] Box GEP, Jenkins G. Time Series Analysis, Forecasting and Control. Holden-Day, Incorpo-
rated; 1990.
[17] Dickey DA, Fuller WA. Distribution of the Estimators for Autoregressive Time Series With a
Unit Root. Journal of the American Statistical Association. 1979;74(366):427–431. Available
from: http://www.jstor.org/stable/2286348.
BIBLIOGRAPHY 45
[18] Dickey DA, Fuller WA. Likelihood Ratio Statistics for Autoregressive Time Series with a Unit
Root. Econometrica. 1981;49(4):1057–1072. Available from: http://www.jstor.org/stable/
1912517.
[19] Gilbert CHE. Vader: A parsimonious rule-based model for sentiment analysis of social media
text. In: Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available
at (20/04/16) http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf; 2014. .
[20] Ke Z, Zhang ZJ. Testing autocorrelation and partial autocorrelation: Asymptotic methods
versus resampling techniques. British Journal of Mathematical and Statistical Psychology.
2018;71(1):96–116. Available from: http://dx.doi.org/10.1111/bmsp.12109.
[21] Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. In:
9th Python in Science Conference; 2010. .
[22] de Myttenaere A, Golden B, Grand BL, Rossi F. Mean Absolute Percentage Error for regression
models. Neurocomputing. 2016;192:38 – 48. Advances in artificial neural networks, machine
learning and computational intelligence. Available from: http://www.sciencedirect.com/
science/article/pii/S0925231216003325.
[23] Pincheira P, Hardy N. Forecasting Base Metal Prices with Commodity Currencies. University
Library of Munich, Germany; 2018. 83564. Available from: https://ideas.repec.org/p/
pra/mprapa/83564.html.
[24] Szczech M, Turetken O. In: Deokar AV, Gupta A, Iyer LS, Jones MC, editors. The Competi-
tive Landscape of Mobile Communications Industry in Canada: Predictive Analytic Modeling
with Google Trends and Twitter. Cham: Springer International Publishing; 2018. p. 143–162.
Available from: https://doi.org/10.1007/978-3-319-58097-5_11.
[25] Kim Y, Srivastava J. Impact of social influence in e-commerce decision making. In: Proceedings
of the ninth international conference on Electronic commerce. ACM; 2007. p. 293–302.
[26] Goes PB, Lin M, Au Yeung Cm. Popularity effect in user-generated content: Evidence from
online product reviews. Information Systems Research. 2014;25(2):222–238.
[27] Goh KY, Heng CS, Lin Z. Social media brand community and consumer behavior: Quantifying
the relative impact of user-and marketer-generated content. Information Systems Research.
2013;24(1):88–107.
46 BIBLIOGRAPHY
[28] Ho-Dac NN, Carson SJ, Moore WL. The effects of positive and negative online customer re-
views: do brand strength and category maturity matter? Journal of Marketing. 2013;77(6):37–
53.
[29] McKinney W. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython.
” O’Reilly Media, Inc.”; 2012.
[30] Pavithra B, Dr Niranjanmurthy M, Kamal Shaker J, Martien Sylvester Mani F. The Study of
Big Data Analytics in E-Commerce. International Journal of Advanced Research in Computer
and Communication Engineering; 2016.