OnCrawl ElasticSearch Meetup France #12

26
Elasticsearch + Oncrawl = <3 A SaaS SEO Monitoring solution by Presentation by Tanguy Moal @tuxnco Meetup Elasticsearch Paris 2015/01/22

Transcript of OnCrawl ElasticSearch Meetup France #12

Page 1: OnCrawl ElasticSearch Meetup France #12

Elasticsearch + Oncrawl = <3

A SaaS SEO Monitoring solution by

Presentation by Tanguy Moal@tuxnco

Meetup Elasticsearch Paris #12

2015/01/22

Page 2: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 222/01/15

[tuxnco@hal]:/opt$ whoami

- age: 0x20- kids: 0x02- hobbies:

- tech founder & cto at cogniteev- search, natural language processing, datamining- misc.

- history:- r&d engineer @ exalead- r&d engineer @ jobijoba

Page 3: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 322/01/15

Presentation plan

Introduction to Oncrawl

Oncrawl technical overview

hadoop-elasticsearch within Oncrawl

Oncrawl API

Scaling Oncrawl infrastructure with Saltstack.

Conclusion / Questions

Page 4: OnCrawl ElasticSearch Meetup France #12

Introduction

Page 5: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 522/01/15

Oncrawl: SEO Monitoring

- SEO Game has changed:

- Websites are getting bigger, harder to maintain- Several indicators to monitor- SaaS to the rescue (Moz, Ranks, Majestic SEO,

Botify, Deepcrawl, …)

Page 6: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 622/01/15

Oncrawl: SEO Monitoring

- Analysis performed through crawl reports - SEO monitoring follows 5 axis:

- Performance- HTML quality- Inlinks- Outlinks- Content

- Interactive Analysis (URL explorer)- Planned: crawl over crawl trends spotting

Page 7: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 722/01/15

Oncrawl: Pricing

Page 8: OnCrawl ElasticSearch Meetup France #12

Oncrawl: technical overview

Page 9: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 9

Oncrawl: application architecture

22/01/15

Page 10: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 1022/01/15

Boom.

Boom2.

Page 11: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 11

Application scenario

- User has a plan and configured projects- Plan grants privileges

- Used to : allow project creation and triggering of crawls

- Each project may have associated crawls- Each crawl contains a report

What data are involved in a crawl report?

22/01/15

Page 12: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 12

Links

22/01/15

- Important piece in serious SEO campaigns- Key fields:

- origin, origin_domain, origin_depth- target, target_domain, target_depth- context:

- position in origin page- anchor text- wraps significant tags (hn, img, …)

- Use cases:- list outlinks (resp. inlinks) of a given page- distinguish links used to go up (resp. down) the site’s tree- anchor text analysis, …

Page 13: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 13

Page model

22/01/15

- Key fields- url- domain- hash- fetch

- date, size, time- HTTP headers- HTTP status code | ignored

(robots.txt|settings)

- parse- title, hn, metas,- canonical

- seo- depth. popularity. total inlinks- outlinks breakdown (internal vs

external, follow vs nofollow)- word count, text to code ratio,

duplicated fields, simhash

- Use cases- stats on size/fetch time/status code, by depth or for pages having any

combination of criterion- find pages with highest similarity to a given one- find pages with duplicated properties (title, hn, …)

- The central piece of the puzzle. Wraps all metadata relating to a given URL

Page 14: OnCrawl ElasticSearch Meetup France #12

Hadoop & Elasticsearch.

Page 15: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 15

Elasticsearch for Hadoop- references

- overview http://www.elasticsearch.org/overview/hadoop/- online documentation

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html

- github- repo https://github.com/elasticsearch/elasticsearch-hadoop- author https://github.com/costin

- features- compatibility- simplicity- low footprint- flexible

22/01/15

Page 16: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 16

Oncrawl: hadoop-elasticsearch- Apache Nutch (v1.x) uses HDFS (v2.x supports several storages through

Apache Gora -- including elasticsearch -- but…)

- Stacked different custom hadoop jobs to compute Oncrawl’s

custom attributes (duplicates, …)

- What about Apache Nutch’s ESIndexer ?

- hadoop-elasticsearch does the job pretty well

- Relies on job’s configuration:

- es.resource(.read|.write)? : « index/type » (supports “late” type

routing from fields in collected output, e.g.

« my_index/{some_field} »)

22/01/15

Page 17: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 17

Oncrawl: hadoop-elasticsearch

• Reading from elasticsearch– job.setInputFormat(EsInputFormat.class);

• Writing to elasticsearch– job.setOutputFormat(EsOutputFormat.class);– Map<Object, Object> value = new LinkedHashMap <Object, Object> ();

– collector.collect(key, WritableUtils.toWritable(value));

22/01/15

Read \ Write HDFS Elasticsearch

HDFS builtin yes

Elasticsearch yes yes

Page 18: OnCrawl ElasticSearch Meetup France #12

Elasticsearch & Python

Page 19: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 19

Oncrawl API

• Python / Flask :– Lightweight– Easy to deploy / mirror– Clean syntax

• elasticsearch python client:– simple API– allows for fine tuning of the client (HTTP connection

parameters, …)• API’s mission : populate application’s report’s

graphs

22/01/15

Page 20: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 20

Oncrawl API- Each graph on the app has a dedicated API endpoint- Binds graph semantics to an elasticsearch query. Returns json data ready for

the rendering (d3.js, …)- Example : Summary of page load times

22/01/15

- 4 buckets : - perfect (under 500ms)- medium (between 500ms and

1000ms)- slow (between 1000ms and

2000ms)- too slow (beyond 2000ms)

- Expected output by plotting library:

Page 21: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 21

Oncrawl API- Queries are easy to compose using python- Write & test it in Marvel- Integrate in Flask API

22/01/15

Page 22: OnCrawl ElasticSearch Meetup France #12

Elastic: Scale it

May I have the salt, please ?

Page 23: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 23

Oncrawl scalability constraints- 1 index per crawl- size of indices ? S-M-L-XL- sharding policy:

- S: 1 shard- M: 3 shards- L: 5 shards- XL: 10 shards

- Hadoop cluster management- Provisioned for a given number

of concurrent crawl cycles- HDFS grows with total clients

- Elasticsearch cluster management- Build: same provision as

hadoop cluster- Storage / service:

- provisionned for 3 months of subscription

- Old indices:- close & snapshot- reopen on demand

22/01/15

Page 24: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 24

Saltstack

• Cluster with members having roles: master vs minions

• Each minion can be fully administrated through the master

• Minions ask master for enrollment• Administrator on master can either accept or

decline minions• Once minion is accepted, can be fully

operated remotely22/01/15

Page 25: OnCrawl ElasticSearch Meetup France #12

Oncrawl · Elasticsearch Meetup France #12 25

Saltstack• A set of « recipes » define what states are made of, and how to get there• Recipes can use « jinja » templating so variable parts of configuration

files can be rendered at deployment time• Minions can have their role defined by several means:

– grains defined on the minion– deployment specific rules, defined in « the pillar »

• Within Oncrawl, saltstack is used :– To maintain indices templates (config/templates/*json)– To maintain elasticsearch clusters, nodes and shards allocation

(config/settings.yml)– To deploy the elasticsearch cluster, the hadoop cluster, staging and prod

servers• Deploy anything, anywhere (Droplets @ Digital Ocean, VMs @ Vultr,

Instances @ AWS, dedicated servers @ OVH)

22/01/15

Page 26: OnCrawl ElasticSearch Meetup France #12

Thank you!Follow us:@tuxnco (me)@cogniteev (company)@oncrawl (product)

Part of the gang

Any question ?