Analyse your SEO Data with R and Kibana

Post on 10-Jan-2017

1.631 views 4 download

Transcript of Analyse your SEO Data with R and Kibana

Analyse your SEO Data with R and Kibana

June 10th, 2016

Vincent Terrasi

Vincent Terrasi

--

SEO Director - Groupe M6Web CuisineAZ, PasseportSanté, MeteoCity, …

--

Join the OVH adventure in July 2016

Blog : data-seo.com

Agenda

Mission : Do a Real-Time Log Analysis Tool

1. Using Screaming Frog to crawl a website

2. Using R for SEO Analysis

3. Using PaasLogs to centralize logs

4. Using Kibana to build fancy dashboards

5. Test !

3

“The world is full of obvious things which nobody by any chance ever observes.”

Sherlock Holmes Quote

Real-Time Log Analysis Tool 4

• Screaming Frog

• Google Analytics

• R Crawler

• IIS Logs

• Apache Logs

• Nginx Logs Logs

Using Screaming Frog

Screaming Frog : Export Data 6

When the crawl is

finished, click the

export button and save

the XLSX file

Add your url and click

the start button

Screaming Frog : Data ! 7

"Address"

"Content"

"Status Code"

"Status"

"Title 1"

"Title 1 Length"

"Title 1 Pixel Width"

"Title 2"

"Title 2 Length"

"Title 2 Pixel Width"

"Meta Description 1"

"Meta Description 1 Length“

"Meta Description 1 Pixel Width"

"Meta Keyword 1"

"Meta Keywords 1 Length"

"H1-1"

"H1-1 length"

"H2-1"

"H2-1 length"

"H2-2"

"H2-2 length"

"Meta Robots 1“

"Meta Refresh 1"

"Canonical Link Element 1"

"Size"

"Word Count"

"Level"

"Inlinks"

"Outlinks"

"External Outlinks"

"Hash"

"Response Time"

"Last Modified"

"Redirect URI“

"GA Sessions"

"GA % New Sessions"

"GA New Users"

"GA Bounce Rate"

"GA Page Views Per Sesssion"

"GA Avg Session Duration"

"GA Page Value"

"GA Goal Conversion Rate All"

"GA Goal Completions All"

"GA Goal Value All"

"Clicks"

"Impressions"

"CTR"

"Position"

"H1-2"

"H1-2 length"

Using R

Why R ?

Scriptable

Big Community

Mac / PC / Unix

Open Source

7500 packages

9

Documentation

WheRe ? How ?

https://www.cran.r-project.org/

10

Rgui RStudio

Using R : Step 1

Export All Urls

11

"request“;"section“;"active“;

"speed“;"compliant“;"depth“;"inlinks"

Packages :

Stringr

Ggplot

Dplyr

Readxl

R Examples

Crawl via Screaming Frog

Classify URLs by : Section

Load Time

Number of Inlinks

Detect Active Pages Min 1 visit per month

Detect Compliant Pages Canonical Not Equal

Meta No-index

Bad HTTP Status Code

Detect Duplicate Meta

12

R : read files 13

# Read xlsx file

urls <- read_excel("internal_html_blog.xlsx",

sheet = 1,

col_names = TRUE,

skip=1)

# Read csv file

urls <- read.csv2("internal_html_blog.csv", sep=";", header = TRUE)

Detect Active Pages 14

#default

urls_select$Active <- FALSE

urls_select$Active[ which(urls_select$`GA Sessions` > 0) ] <- TRUE

#factor

urls_select$Active <- as.factor(urls_select$Active)

Classify URLs by Section 15

schemas <- read.csv(“conf.csv”,header = FALSE, col.names = "schema", stringsAsFactors = FALSE)

urls_select$Cat <- "no match"

for (j in 1:length(schemas))

{

urls_select$Cat[ which(stri_detect_fixed(urls_select$Address , schemas[j]) ) ] <- schemas[j]

}

/agenda/sorties-cinema/

/agenda/parutions/

/agenda/evenements/

/agenda/programme-tv/

/encyclopedie/

Conf.csv

Classify URLs By Load Time 16

urls_select$Speed <- NA

urls_select$Speed[ which(urls_select$`Response Time` < 0.501 ) ] <- "Fast“

urls_select$Speed [ which(urls_select$`Response Time` >= 0.501

& urls_select$`Response Time` < 1.001) ] <- "Medium“

urls_select$Speed[ which(urls_select$`Response Time` >= 1.001

& urls_select$`Response Time` < 2.001) ] <- "Slow“

urls_select$Speed[ which(urls_select$`Response Time` >= 2.001) ] <- "Slowest"

urls_select$Speed <- as.factor(urls_select$Speed)

Classify URLs By Number of Inlinks 17

urls_select$`Group Inlinks` <- "URLs with No Follow Inlinks"

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` < 1 ) ] <- "URLs with No Follow Inlinks"

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` == 1 ) ] <- "URLs with 1 Follow Inlink“

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` > 1

& urls_select$`Inlinks` < 6) ] <- "URLs with 2 to 5 Follow Inlinks“

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` >= 6

& urls_select$`Inlinks` < 11 ) ] <- "URLs with 5 to 10 Follow Inlinks“

urls_select$`Group Inlinks`[ which(urls_select$`Inlinks` >= 11) ] <- "URLs with more than 10 Follow Inlinks"

urls_select$`Group Inlinks` <- as.factor(urls_select$`Group Inlinks`)

Detect Compliant Pages 18

# Compliant Pages

# Canonical Not Equal

# Meta No-index

# Bad HTTP Status Code

# Not Equal

urls_select$Compliant <- TRUE

urls_select$Compliant[ which(urls_select$`Status Code` != 200

| urls_select$`Canonical Link Element 1` != urls_select$Address

| urls_select$Status != "OK"

| grepl("noindex",urls_select$`Meta Robots 1`)

) ] <- FALSE

urls_select$Compliant <- as.factor(urls_select$Compliant)

Detect Duplicata Meta 19

urls_select$`Status Title` <- 'Unique'

urls_select$`Status Title`[ which(urls_select$`Title 1 Length` == 0) ] <- "No Set"

urls_select$`Status Description` <- 'Unique'

urls_select$`Status Description`[ which(urls_select$`Meta Description 1 Length` == 0) ] <- "No Set"

urls_select$`Status H1` <- 'Unique'

urls_select$`Status H1`[ which(urls_select$`H1-1 Length` == 0) ] <- "No Set"

urls_select$`Status Title`[ which(duplicated(urls_select$`Title 1`)) ] <- 'Duplicate'

urls_select$`Status Description`[ which(duplicated(urls_select$`Meta Description 1`)) ] <- 'Duplicate'

urls_select$`Status H1`[ which(duplicated(urls_select$`H1-1`)) ] <- 'Duplicate'

urls_select$`Status Title` <- as.factor(urls_select$`Status Title`)

urls_select$`Status Description` <- as.factor(urls_select$`Status Description`)

urls_select$`Status H1` <- as.factor(urls_select$`Status H1`)

Generate CSV 20

urls_light <- select(urls_select,Address,Cat,Active,Speed,Compliant,Level,Inlinks) %>%

mutate(Address=gsub(“http://moniste.fr","",Address))

colnames(urls_light) <- c("request","section","active","speed","compliant","depth","inlinks")

write.csv2(“file.csv”, filename, row.names = FALSE)

Package dplyr : select and mutate

Edit colnames

Use write.csv2

R : ggplot2 command 21

DATA

Create the ggplot object and populate it with data (always a data frame)

ggplot( mydata, aes( x=section,y=count, fill=active ))

LAYERS

Add layer(s)

+ geom_point()

FACET

Used for conditionning on variable(s)

+ facet_grid(~rescode)

ggplot2 : Geometry 22

R Chart : Active Pages 23

urls_level_active <- group_by(urls_select,Level,Active) %>%

summarise(count = n()) %>%

filter(Level<12)

Geometry Aesthetic

p <- ggplot(urls_level_active, aes(x=Level, y=count, fill=Active) ) +

geom_bar(stat = "identity", position = "stack") +

scale_fill_manual(values=c("#e5e500", "#4DBD33")) +

labs(x = "Depth", y ="Crawled URLs")

#display

print(p)

# save in file

ggsave(file=“chart.png")

R Chart : GA Sessions 24

urls_cat_gasessions <- aggregate( urls_select$`GA Sessions`, by=list(Cat=urls_select$Cat, urls_select$Compliant), FUN=sum, na.rm=TRUE)

colnames(urls_cat_gasessions) <- c("Category","Compliant","GA Sessions")

p <- ggplot(urls_cat_gasessions, aes(x=Category, y=`GA Sessions`, fill=Compliant))+

geom_bar(stat = "identity", position = "stack") +

theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

labs(x = "Section", y ="Sessions") +

scale_fill_manual(values=c("#e5e500","#4DBD33"))

#display

print(p)

# save in file

ggsave(file=“chart.png")

R Chart : Compliant 25

urls_cat_compliant_statuscode <- group_by(urls_select,Cat,

Compliant,`Status Code`) %>%

summarise(count = n()) %>%

filter(grepl(200,`Status Code`) | grepl(301,`Status Code`))

p <- ggplot(urls_cat_compliant_statuscode, aes(x=Cat, y=count,

fill= Compliant ) ) +

geom_bar(stat = "identity", position = "stack") +

theme(axis.text.x = element_text(angle = 90, hjust = 1)) +

facet_grid(`Status Code` ~ .) +

labs(x = "Section", y ="Crawled URLs") +

scale_fill_manual(values=c("#e5e500","#4DBD33"))

R : SEO Cheat Sheet 26

Package Dplyr

select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions

mutate() a data frame by adding new or replacing existing columns

filter() allows you to select a subset of rows in a data frame.

Package Gplot2

aes - geom

ggsave()

Package Readxl

read_excel()

read.csv2()

write.csv2()

ELK

Architecture 28

Hard to monitor and optimize host server performance

Architecture 29

Using PaasLogs

PaasLogs 31

PaasLogs 32

164 noeuds au sein du cluster Elastic Search

180 machines connectées

Entre 100 000 et 300 000 logs traités par seconde

12 milliards de logs transitent tous les jours

211 milliards de documents enregistrés

8 clicks and 3 copy/paste to use it !

PaasLogs: Step 1 33

PaasLogs : Step 2 34

PaasLogs 35

PaasLogs : Streams 36

The Streams are the recipient of your logs. When you send a log with the

right stream token, it arrives automatically to your stream in a awesome

software named Graylog.

PaasLogs : Dashboards 37

The Dashboard is the global view of your logs, A Dashboard is an efficient

way to exploit your logs and to view global information like metrics and

trends about your data without being overwhelmed by the logs details.

PaasLogs : Aliases 38

The Aliases will allow you to access directly your data from your Kibana or

using an Elasticsearch query

DON’T FORGET TO ENABLE KIBANA INDICES AND WRITE YOUR USER PASSWORD

PaasLogs : Inputs 39

The Inputs will allow you to ask OVH to host your own dedicated collector

like Logstash or Flowgger.

PaasLogs : Network Configuration 40

PaasLogs : Plugins Logstash 41

OVHCOMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\]

"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion_num:float})?|%{DATA:rawrequest})"

%{NUMBER:response_int:int} (?:%{NUMBER:bytes_int:int}|-)

OVHCOMBINEDAPACHELOG %{OVHCOMMONAPACHELOG} "%{NOTSPACE:referrer}" %{QS:agent}

PaasLogs : Config Logstash 42

if [type] == "apache" {

grok {

match => [ "message", "%{OVHCOMBINEDAPACHELOG}"]

patterns_dir => "/opt/logstash/patterns"

}

}

if [type] == "csv_infos" {

csv {

columns => ["request", "section","active", "speed",

"compliant","depth","inlinks"]

separator => ";"

}

}

How to send Logs to PaasLogs ? 43

Use Filebeat 44

Filebeat : Install 45

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

Filebeat : Edit filebeat.yml 46

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

filebeat:

prospectors:

-

paths:

- /home/ubuntu/lib/apache2/log/access.log

input_type: log

fields_under_root: true

document_type: apache

-

paths:

- /home/ubuntu/workspace/csv/crawled-urls-filebeat-*.csv

input_type: csv

fields_under_root: true

document_type: csv_infos

output:

logstash:

hosts: ["c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com:5044"]

worker: 1

tls:

certificate_authorities: ["/home/ubuntu/workspace/certificat/key.crt"]

Filebeat : Start 47

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Copy / Paste Key.crt

-----BEGIN CERTIFICATE-----

MIIDozCCAougAwIBAgIJALxR4fTZlzQMMA0GCSqGSIb3DQEBCwUAMGgxCzAJBgNVBAYTAkZSMQ

8wDQYDVQQIDAZGcmFuY2UxDjAMBgNVBAcMBVBhcmlzMQwwCgYDVQQKDANPVkgxCzAJBgNVB

AYTAkZSMR0wGwYDVQQDDBRpbi5sYWFzLnJ1bmFib3ZlLmNvbTAeFw0xNjAzMTAxNTEzMDNaFw0

xNzAzMTAxNTEzMDNaMGgxCzAJBgNVBAYTAkZSMQ8wDQYDVQQIDAZGcmFuY2UxDjAMBgNVBA

cMBVBhcmlzMQwwCgYDVQQKDANPVkgx

-----END CERTIFICATE-----

Start Filebeat

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

How to combine multiple sources ? 48

Paaslogs : Plugins ES 49

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Description : Copies fields from previous log events in Elasticsearch to current events

if [type] == "apache" {

elasticsearch {

hosts => "laas.runabove.com"

index => "logsDataSEO" # alias

ssl => true

query => ‘ type:csv_infos AND request: "%{[request]}" ‘

fields => [["speed","speed"],["compliant","compliant"],

["section","section"],["active","active"],

["depth","depth"],["inlinks","inlinks"]]

}

}

# TIP : fields => [[src,dest],[src,dest]]

Using Kibana

Kibana : Install 51

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Download Kibana 4.1

• Download and unzip Kibana 4

• Extract your archive

• Open config/kibana.yml in an editor

• Set the elasticsearch.url to point at your Elasticsearch instance

• Run ./bin/kibana (or bin\kibana.bat on Windows)

• Point your browser athttp://yourhost.com:5601

Kibana : Edit Kibana.yml 52

Update Kibana.xml

server.port: 8080

server.host: "0.0.0.0"

elasticsearch.url: "https://laas.runabove.com:9200"

elasticsearch.preserveHost: true

kibana.index: "ra-logs-33078"

kibana.defaultAppId: "discover"

elasticsearch.username: "ra-logs-33078"

elasticsearch.password: "rHftest6APlolNcc6"

Kibana : Line Chart 53

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Number of active crawled from google over a period of time

Kibana : Vertical Bar Chart 54

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Kibana : Pie Chart 55

How to compare two periods ? 56

Kibana : Use Date Range 57

Install filebeat

curl -L -O https://download.elastic.co/beats/filebeat/filebeat_1.2.1_amd64.deb

sudo dpkg -i filebeat_1.2.1_amd64.deb

OU https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-installation.html

nano /etc/filebeat/filebeat.yml

Change filebat.yml

> Indiquer le chemin du fichier de logs

> Mettre le chemin : c002-5717e1b5d2ee5e00095cea38.in.laas.runabove.com

Copie Key.crt

-----

sudo /etc/init.d/filebeat start

sudo /etc/init.d/filebeat stop

7/ get logs si erreur

tail -f /var/log/filebeat.log

Final Architecture

PassLogs Kibana

Filebeat

@

58

@

Soft RealTime

--

Old Logs

IIS

Apache

Ngnix

HA Proxy

Test yourself 59

Use Screaming Frog Spider Tool

www.screamingfrog.co.uk

Teach R

www.datacamp.com

www.data-seo.com

www.moise-le-geek.fr/push-your-hands-in-the-r-introduction/

Test PassLogs

www.runabove.com

Install Kibana

www.elastic.co/downloads/kibana

TODO List 60

- Create a GitHub Repository with all source code

- Add Plugin Logstash to do a reverse DNS lookup

- Schedule A Crawl By Command Line

- Upload Screaming Frog File to web server

Thank you

Keep in touch June 10th, 2016

@vincentterrasi Vincent Terrasi