Deconstructing Google's Penguin 2.0
-
Upload
andreas-voniatis -
Category
Marketing
-
view
820 -
download
1
Transcript of Deconstructing Google's Penguin 2.0
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
1
34
Par
adis
e R
oad
| R
ich
mo
nd
Upo
n T
ham
es |
Un
ited
Kin
gdo
m |
TW
9 1
SE
TE
L//
+4
4 (0
) 84
4 2
64
29
60
V
ISIT
// m
ath
sigh
t.o
rg
White Paper: Deconstructing Google’s Penguin 2.0
Produced by MathSight
This paper identifies how shifts in traffic sourced by Google’s search engine can be
related to the structural and content-based features of a company’s web pages.
We performed this analysis in order to extract potentially useful insights for the
following groups:
A. Marketing agencies
B. The wider online SEO community
C. Online businesses
D. Those with an interest in big data and analytics
Introduction
We decided to apply our machine learning led predictive SEO models to deconstruct
the Google Penguin 2.0 algorithm update, rolled out on 19th May 2013.
Although it is near impossible to reverse engineer a complete search engine algorithm
such as Google’s, it is possible to show the potential causes of any change in algorithm
methods when it occurs. We look for a step change in a pattern that could be an
underlying increase or decrease in actual Google-sourced traffic as a result of an
algorithm alteration, such as the recent Penguin 2.0 update.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
2
What we did
Once the Google search traffic dataset for our chosen group of web domains had been
obtained from website analytics, de-seasonalised and filtered, the first step in the
reverse engineering process was to confirm that a change in traffic did indeed take
place. This was done using signal processing techniques, a best practice in the oil and
gas exploration industry, to detect the likely point of change in the noisy data.
Following this, we gathered a wide range of standard SEO features from the pages
(title character length, number of meta description words, readability and so on) within
the domain. Finally we applied a variety of statistical methods to identify those
features that were rewarded or penalised in terms of their Google search traffic after
the likely algorithm update time.
Our results showed, with some statistical confidence of around 90-95%, that the main
areas within HTML that Google has probably targeted with this change were:
● Main body text
● Hyperlinks
● Anchor text (clickable text in the hyperlink)
● Meta description text
Methods
Data collection
Websites from eight business categories as follows were used for the purposes of this
study, in order to create a well-rounded dataset:
● Online retailers including the travel, gifts, mobile apps and jewellery sectors;
● Corporate B2B companies including business awards, advertising and PR,
HTML file contents were first gathered by our in-house web crawler, which scanned
the sites in-depth, for structural and content-based ‘features’.
Daily website analytics (page view) data was also imported for each domain above,
spanning a two-month period, from 11 April 2013 through to 11 June 2013. This
period afforded a reasonable window around the time that Google had announced the
‘Penguin 2.0’ algorithm update.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
3
WHOLE SITE TRAFFIC OVER 3-MONTH PERIOD
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
SOU
RC
E: G
OO
GL
E /
ME
DIU
M:
OR
GA
NIC
/M
ET
RIC
PA
GE
VIS
ITS
DATE
WHOLE SITE TRAFFIC FOR SITE E OVER LAST 3-MONTH PERIOD
Cleansing and exploration of the dataThe traffic data, in time series form for a single domain were first smoothed using
moving average and then seasonality variation removal, to reduce the effect of a
repeated site usage pattern across the week (e.g. reduced visits on the weekend). This
is slightly more insightful than both the moving average and the raw traffic numbers, as
abrupt changes are clearly defined yet separated from any cyclical variation.
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#O
RG
AN
IC (G
OO
GLE
) PA
GE
VIS
ITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECT
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#O
RG
AN
IC (G
OO
GLE
) PA
GE
VIS
ITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECTGOOGLE SEARCH TRAFFIC OVER TIME
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
4
Using this cleaned traffic data, a change point detection algorithm was deployed in
order to detect the most likely timing of a change in traffic levels over the period in
question. For each domain, this gave a probabilistic confirmation that a change had
indeed occurred at the period in question, rather than simply a series of fluctuations
due to ‘noise’ in the traffic data.
Using this method, of our eight site categories, 3 were selected (numbers 2, 5 and 7) as
they each showed clear evidence (like the pattern in the upper graph) that a change in
daily visitor traffic had occurred.
The lower graph shown to the left here (for the 8th category in our list) shows that it is
unlikely that such a change occurred on the 19th May, rather that it took place later, in
early June.
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 25-JUN-13
-540
-541
-542
-543
-544
-545
-546
DATE
LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING
LIKELIHOOD
27-MAR-13 06-APR-13 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 05-JUL-13
-755
-760
-765
-770
-775
-780
-785
-790
DATE
LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING
LIKELIHOOD
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
5
Simple Statistical modelling
Following this confirmation that a change had indeed occurred, all the html pages of
the chosen domains were classified as either ‘winner’ or ‘loser’ pages with respect to
their mean traffic levels pre- or post- the alleged algorithm update. The traffic values
were normalised, i.e. adjusted so that difference between ‘before’ and ‘after’ algorithm
change traffic level were scaled correctly.
Then, the effect of html page features on traffic difference was analysed using the
Analysis of Variance (ANOVA) method. This enabled us to see if there was any
statistically significant relationship between feature metrics and daily search traffic
variation.
Results
The results below represent a selection of Penguin 2.0 case studies within the overall
data set.
Site A: An online luxury jewellery supplier
CUSTOMER QUERY: They wanted to understand why their daily traffic
jumped up on 19th May.
We found that the average visitor traffic before 19th May was 33.97 per day, while
afterwards it was up to 59.66 per day (an increase of 56.31%). There was a clear
confirmation statistically that a change in traffic took place (see spike in chart below).
27-MAR-13 06-APR-13 16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13 25-JUN-13 05-JUL-13
-462
-464
-465
-466
-467
-468
DATE
LIKELIHOOD OF SINGLE CHANGE IN TRAFFIC LEVEL OCCURING
LIKELIH
OOD
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
6
OUR ANSWER:
Firstly, Google’s algorithm seemed to have become attentive to the nature of the
title tags in their html pages; as this seems to have had an effect on the traffic level
after the change. These aspects were found to be significant, shown here in order of
importance:
1. The number of syllables per title
2. The number of ‘rare’ words (i.e. those not in the list of 5,000 most commonly
used English language words) present in the title
3. The title length, in characters (less significantly)
Secondly, the nature of overall html body text has had an impact; in this order:
1. The number of words and characters in the document
2. The ratio of ‘rare’ to commonly used words
These were rewarded in the following fashion:
Total number of wordsBody total number of rare wordsTitle total number of rare wordsTitle total number of syllables
0
0.13
0.25
0.38
0.50
Differen
ce in m
ean featu
re values
Thirdly, but notably less significantly, the following two features
had some influence:
1. The number of hyperlinks present
2. The meta description character length
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
7
Site B: A mobile application vendorSite type: Promotional and catalogue of products
CUSTOMER QUERY: The ecommerce team wanted to understand why their
visitor traffic fluctuated slightly around 19th May.
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#O
RG
AN
IC (G
OO
GLE
) PA
GE
VIS
ITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECT
16-APR-13 26-APR-13 06-MAY-13 16-MAY-13 26-MAY-13 05-JUN-13 15-JUN-13 25-JUN-13
5.8
5.6
5.4
5.2
5
4.8
4.6
4.4
4.2
4
3.8
DATE
GOOGLE SEARCH TRAFFIC FOR SITE ‘E’ OVER TIME
#O
RG
AN
IC (G
OO
GLE
) PA
GE
VIS
ITS
DAILY
WEEKLY MA
WITHOUT DAY EFFECTGOOGLE SEARCH TRAFFIC OVER TIME
We found that the average visitor traffic before 19th May was 49,534.53, it initially
rose and then afterwards it had settled, overall having dropped slightly to 49,271.79 (a
-0.53% change)
OUR ANSWER:
This site has much higher traffic volumes, and many more pages so the data extracted
was far richer than that obtained from site A. Nevertheless, similarities between this
website and site A quickly became apparent, such that there seemed to be a focus on
the overall html page body text content, meta -descriptions as well as hyperlinks.
That is to say;
● The total number of words; the number of syllables in those words; the ratio of
rare and extremely rare to those commonly used; the number of difficult words;
the number of sentences and the ratio of unique to duplicate words. Indeed,
text readability (which is a combination of almost all the other word-related
features) emerged as slightly significant.
● The number of hyperlinks and those linking to files or html files specifically.
● The meta description; here the number of words and ratio of unique to
duplicate words.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
8
However, the difference here was that there was no hint of title length or content
being significant. Rather, there was also a focus on two other areas (listed in order
of importance):
● The length in characters and number of words in anchor text.
● The number of rare words in the headers.
Thanks to the increased dataset for this client, by comparing pages that ‘won’ post
algorithm change with those that lost, we were able to observe that some features
were quite substantially rewarded as they increased in value; whilst others were
punished. See below:
Punished Rewarded
-0.70 -0.53 -0.35 -0.18 0 0.18 0.35 0.53 0.70Difference in mean feature values
Number of hyperlinks with html
Number of hyperlinks with file
Number of hyperlinks
Metadesc total number of words
Metadesc number of unique words
Metadesc number of duplicate words
h3 total number of rare words
h1 total number of rare words
Number of rare words
Body total number of rare words
Number of duplicate words
Number of unique words
Total number of words
Total number of syllables
Total number of 5K common words
Total number of dictionary (very rare) words
Average number of words in anchor text
Average length of anchor text
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
9
Site C: An online watch vendorSite type: extensive online catalogue with product photographs
CUSTOMER QUERY: They wanted to understand a recent increase in daily traffic
sourced from Google search, which started on 19th May.
The site’s daily average traffic before was 429.85, and after this date about 503.0, thereby showing a 16% increase.
OUR ANSWER:
Again, a relatively rich set of data, from which certain significant features emerged, to
start with those that have already appeared in one or more of the previous two sites
examined:
● Number of hyperlinks (and whether they link to a file)
● Anchor text (length and number of words, number of unique and duplicate
words, number of syllables)
● Body text (ratio of commonplace to rarer words, ratio of unique to duplicate
words, readability)
● Headers (and whether or not they contain rare words)
● To a certain extent; meta description character length
Additionally, the following features were significant:
● Number of external CSS styles (cascading style sheets - manages appearance
of the website)
● Number of scripts
● Number of external and absolute internal links
In terms of rewarded and / or punished features - with this client a large quantity of
anchor text (regardless of whether duplicate words or not) was heavily rewarded.
As was the ‘reading ease level’ of the body text and headers (i.e. the harder/ more
complex it was, the better it was to Google’s algorithm). This relates to the number of
rare words. The full breakdown is shown in the following graph.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
10
Punished Rewarded
-0.50 -0.38 -0.25 -0.13 0 0.13 0.25 0.38 0.50Difference in mean feature values
Max number of words in anchor text
Maximum length of anchor text
Anchor number of unique words
Anchor total number of syllables
Average length of anchor text
Anchor total number of words
Anchor number of duplicate words
h1 total number of rare words
Number of h1 instances
Dale chall readability
Ratio of rare words to total
Number of rare words
Number of unique 5K common words
Total number of 5K common words
Number of hyperlinks
Number of hyperlinks with file
Number of absolute internal links
Number of external links
Number of external CSS styles
Number of scripts
Conclusions
Our analysis suggests that there is perhaps significant, positive association between a
site’s search traffic sourced visitor levels, augmenting the values of certain features and
reducing the values of others, based on two months’ worth of traffic data (observed 10
days, one month and 6 weeks before and after the change).
Overall, there was a large variation between the types of rewarded features for the
different websites analysed. This would suggest that any advice given would be most
effective if tailored to the individual domain, or type of domain if domains can be
grouped or clustered into types. For example, with site B the presence of anchor text
was seemingly punished as a feature, whereas for site C it was heavily rewarded.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
11
However, even with all the above variation - looking at the two larger and more popular
(traffic-wise) sites that clearly exhibited effects of Penguin 2.0, we can draw the
following conclusions:
● In body text, rare words are good and generally rewarded - i.e. those that are
not in the list of 5,000 most common words in the English language. So it is a
good idea to raise the writing level of the page copy (i.e. aim for higher Dale-
Chall readability scores).
● Use of headings will be rewarded; it is also advantageous to use words that are
less commonplace here.
● The number of hyperlinks present appears to have been rewarded – i.e. the
more hyperlinks the greater the increase in traffic (in some cases), although
perhaps this is too vague to take any action upon. Of those hyperlinks, there
was no bias towards external or internal links however.
● Finally, depending on the type of site, and based on our limited survey the
presence and increased character length of meta descriptions and the
increased quantity of words in anchor text are now slightly more rewarded
than previously.
In addition to the insights gained on deconstructing Penguin 2.0, we can now use the
models to evaluate the inbound link profiles of sites that may have been affected by
the latest algorithm update. For example, the models may be applied to a site’s inbound
link profile when trying to decide which links to disavow when faced with a search
engine manual penalty notice (ie apparent abnormal inbound link activity), or
subsequent loss of traffic in response to a search engine’s algorithm update.
Future work
Our research into analysing search algorithm updates are continuing as our data
partnership community grows. Like search engines we are adding the number of
features and searching for the “sweet spots” of site optimisation. The benefit of using
machine learning is that our data modelling of algorithm updates is dynamic and
therefore stays up to date with the constantly improved search engines. Thus
MathSight’s models evolve with the search engines.
Further research papers will be released to provide insights into Penguin and Panda
going forward.
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
12
About MathSight
MathSight was launched in March 2013; it demystifies the search engine algorithms
using machine learning and big data.
The platform analyses both the qualitative and stylistic aspects of content, web design,
and site architecture, their inter-relationships, traffic data and other key performance
indicators. This enables MathSight to determine the cause of changes in search engine
traffic, be it a change in the algorithm, or the SEO (onsite and offsite) of a client or
competitors. These insights are currently available for integration into bespoke and
best in class, enterprise level, SEO tools.
For more information visit: MathSight.org
DECONSTRUCTING GOOGLE’S PENGUIN 2.0 | WHITEPAPER
13