13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB...

27
Elastic Stack in A Day Milano – 16 Giugno 2016 REVOLUTIONIZE THE WAY PEOPLE GET JOBS WITH ELASTICSEARCH

Transcript of 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB...

Page 1: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

Elastic Stack in  A  DayMilano  – 16  Giugno  2016

REVOLUTIONIZE  THE  WAY  PEOPLE  GET  JOBS  WITH  ELASTICSEARCH

Page 2: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

COMPANY WEBSITEwww.jobrapido.com

ROLEHead of Technology @ Jobrapido

[email protected]

TWITTER@totovadacca

LINKEDINhttps://it.linkedin.com/in/salvatorevadacca

NAMESalvatore Vadacca

ABOUT ME

Page 3: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

WHO WE ARE

WEBSITES IN 58 COUNTRIESHead office Milan + office in Amsterdam

Jobrapido is the world's leading jobsearchengine that analyses and collects all job posts on the web, giving jobseekers all offers available, ordered for relevance

based on the search they’ve done

Analysis

AggregationResponse

UNIQUE VISITORS35 Mio Uvs / month

SUBSCRIBERS60+ Mio subs users (current stock)

PAGEVIEWS / CLICKS*280 Mio PVs / month & 130 Mio clicks / month

JOBS20+ Mio jobs at any given time

VISITORS1.0 BN visits / year

PEOPLE100+

Page 4: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

OUR ELASTIC NUMBERS (MAIN CLUSTER)

DOCUMENTS>200 M

DATA1TB

AVG SEARCH RATE20K/sec

AVG INDEX RATE400/sec

MEMORY>1TB

NODES25 data, 3 masters, 65 client

CURRENT VERSION1.7.3

Page 5: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

OUR SEARCH PAGE

Page 6: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

MOBILE APP

MY  SEARCHESMY  JOBS MENU

CNT  SELECTIONSIGN  UP

SIGN  IN

Page 7: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

WHERE WE ARE

Page 8: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

THE NEED FOR A NEW SEARCH ENGINE

• Result  sorting  limited  to  CPC  and  publish  date• Debug  and  troubleshooting  nearly  impossible• Exact  match  was  the  only  option  • Slow  reindex time  (up  to  10  days)• Custom  and  inaccurate  language  analysis• No  high  availability• Hard  to  scale

Page 9: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

MULTI-LANGUAGE SUPPORT

ITALIAN

SPAN

ISH

FREN

CH

ENGLISH

POLISH

HU

NG

ARI

AN

JAPA

NES

E

GERMAN

ROMANIAN

PORTUGUESE

SWEDISH

RUSSIAN

DANISH

DU

TCH

CZE

CH

TURK

ISH

KORE

AN

CHINESE

Page 10: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

JOB INDICES

• One  index  per  country• Two  aliases  with  filter  (organic  vs.  sponsored  jobs)• Each  index  implements  country  and  language-­‐specific  analysers

• A  country  may  support  more  than  one  language• e.g.,  Canada,  Switzerland,  etc.

IT

CH FR

UK

DEUSIT.JOBRAPIDO.COM

CH.JOBRAPIDO.COM

DE.JOBRAPIDO.COM

UK.JOBRAPIDO.COM

FR.JOBRAPIDO.COM

US.JOBRAPIDO.COM

Page 11: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

THE ANATOMY OF AN ANALYZER

• Strip  HTML• Tokenization• Lowercase• Stopwords

1. _german_,  _french,  _english_,  …  (built-­‐in)2. Language-­‐specific  (file)3. Country-­‐specific  (file)

• Stemming1. light_german,  light_french,  english,  …  (built-­‐in)2. Language-­‐specific  exceptions  (file)3. Country-­‐specific  exceptions  (file)

• Language-­‐specific  filters  (e.g.,  elision,  possessive)  • Synonyms• Shingles

Page 12: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

MULTI-LANGUAGE PROPERTIES

BODY

GERMAN

STANDARD

ENGLISH

FRENCH

ITALIAN

SHINGLE

HTML STRIP

CHAR FILTER

STANDARD LOWERCASE, GERMAN STOPWORDS, GERMAN STEMMER

TOKENIZER FILTER

HTML STRIP STANDARD STANDARD, LOWERCASE, GERMAN STOPWORDS

HTML STRIP STANDARD POSSESSIVE, LOWERCASE, ENGLISH STOPWORDS, ENGLISH STEMMER

HTML STRIP STANDARD ELISION, LOWERCASE, FRENCH STOPWORDS, FRENCH STEMMER

HTML STRIP STANDARD ELISION, LOWERCASE, ITALIAN STOPWORDS, ITALIAN STEMMER

HTML STRIP STANDARD LOWERCASE, GERMAN STOPWORDS, GERMAN STEMMER, SHINGLE

Page 13: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

MULTI-LANGUAGE PROPERTIES"query": {

"filtered": {"query": {

"bool": {"must": {

"multi_match": {"query": "product manager","fields": [

"body^3", //german"body.standard^3”,"body.english^2”,"body.french^1”,"body.italian^1”

],"type": "most_fields","operator": "AND”

}}

}},"filter": { … }

}}

Application-side configurationsallow us to define search fieldsand their individual boost

We constantly run A/B-test toimprove matching rate and tunerelevance

Page 14: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

SITEMAPS: PERCOLATORS AND AGGREGATIONS

Page 15: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

SITEMAP BY JOB TITLES

• Industry  is  an  information  you  cannot  easily  find  in  structured  documents• Only  few  websites  explicitly  show  job  titles  and  industry

• What  if  we  build  a  taxonomy  of  job  titles/industry  represented  by  queries?• That  would  allow  enriching  documents  at  index  time  by  means  of  percolators

Page 16: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

JOURNEY OF A JOB DOCUMENT

CRAWLER CRAWLED JOB DOCUMENT

ELASTICSEARCH

.PERCOLATOR

JOBS

ENRICHED JOB DOCUMENT

SITEMAP

CRAWL

WEBSITE

Page 17: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

PERCOLATOR EXAMPLE"query": {

"filtered": {"query": {

"bool": {"must": {

"multi_match": {"query": "Account Director","fields": [

"headline^2","headline.standard^1","body^1","body.standard^1","company_name^1”

],"type": "most_fields","operator": "AND”

}}

}},"filter": {

"bool": {…

}}

},"jobtitle": "Account Director","sector": "Sales"

}

Percolator is a standard query(multi match in search fields)

Jobtitle and sector are attachedto the query and indexedtogether with the document(nested)

Page 18: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

PROS AND CONS

• Live  document  enrichment  (+)• Job  classification  based  on  keywords  (+)• Aggregate  by  industry  and  sub-­‐aggregate  by  location  (+)

• Slower  reindex  time  (-­‐)• Reindex  all  10x  slower

• Aggregations  are  heavy  (-­‐)• Caching  required• Inaccurate  since  the  population  is  dynamic

• Try  to  be  consistent  with  your  queries• e.g.,  percolators  do  not  support  min_score,  whereas  queries  do

Page 19: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

WHAT’S NEXT

• Sitemaps  change  frequently• Job  import  and  lifecycle  cause  link  churn

• Sitemaps  are  heavy• Tons  of  jobtitle  and  locations

• Google  periodically  crawls  sitemaps• Google  allows  pushing  sitemap  changes• We  do  not  want  to  push  unstable  changes

Page 20: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

SITEMAP CHANGES

S1 S2 SXS3 SYΔ1 Δ2 ΔX ΔY

MEMORY

DAY N

DAY 1 CHANGES DAY N CHANGES

DAY 1

ELASTICSEARCH

WEEK 1

WEEKLY ALIAS

CHANGES ANALYZER

AGGREGATE

CHANGES SITEMAPPUSHINDEX

Page 21: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

LOCATIONS UNITED KINGDOM

SCOTLAND ENGLAND WALES NORTHERN IRELAND

NORTH EAST ENGLAND NORTH WEST ENGLAND SOUTH EAST ENGLAND SOUTH WEST ENGLAND… …

BERKSHIRE SURREY LONDON KENT… …

NORTH EAST LONDON NORTH LONDON WEST LONDON SOUTH EAST LONDON… …

LONDON (GREENWICH) LONDON (LEWISHAM) LONDON (BROMLEY) LONDON (SOUTHWARK)… …

LONDON (CHARLTON) LONDON (WOOLWICH) LONDON (ELTHAM) LONDON (MAZE HILL)… …

/UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON/SOUTH EAST LONDON/LONDON (GREENWICH)/LONDON (WOOLWICH)/UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON/SOUTH EAST LONDON/LONDON (GREENWICH)/LONDON (CHARLTON)

Page 22: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

LOCATION MAPPING

LOCATION

CANONICAL NAME

CANONICAL PATH

GEO COORDINATES

LOCATION DEPTH

ORGANIC PATH

SEARCH PATH

SPECIAL PATH

SYNONYMS

WEAK SYNONYMS

LOCATION

LONDON

/UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON

POINT (-0.130714000141, 51.498555)

3

/UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON

{…}

/LONDON

LONDRES, LONDRA, GREATER LONDON, SE1 1PP, EC2A 4JU, …

[]

Page 23: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

LOCATION SEARCH AND INDEXING

WHERE LOCATIONS

CRAWLED LOCATION

CRAWLED TEXT

LOCATIONS JOB

QUERY

SYNONYMS +

WEAK SYNONYMS

SEARCH PATH+

GEO COORDINATES

CANONICAL PATH+

GEO COORDINATES

PATH HIERARCHY ANALYZER

LOCATION DOMAIN (e.g., LONDON) PATH DOMAIN (e.g., /UNITED KINGDOM/ENGLAND/SOUTH EAST ENGLAND/LONDON)

SEARCHSEARCH

INDEX

Page 24: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

LOCATION SEARCH"or": {

"filters": [{

"terms": {"location": [

"/hartsville, sc","/united states/southern united states/south atlantic/south carolina/darlington county, sc/hartsville, sc”

],"_cache": false

}},{

"geo_shape": {"geo_coordinates": {

"indexed_shape": {"id": "443498","type": "location","index": "us_geo_shapes","path": "geo_coordinates"

},"relation": "within”

},"_cache": false

}}],"_cache": true,"_cache_key": "443498"

}

We search locations by path andcoordinates

Caching is performed only on theor filter (sub-filters always dependon it and we may avoid caching)

Cache keys allow saving memory

Page 25: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

CONCLUSIONS

• Jobrapido  covers  58  countries  and  18  languages• Percolations  and  aggregations  allow  for  document  enrichment  and  dynamic  sitemap  creation• Pipeline  aggregations  ease  push  of  significant  sitemap  changes  to  Google• Path  hierarchies  to  cleverly  represent  location  structure• Index  and  search  like  a  pro  J

• Documentation:  https://www.elastic.co/guide/index.html• Training:  thanks  Luca  and  KarelJ• Book:  Elasticsearch – The  Definitive  Guide• Support:  Jobrapido  is  a  proud  platinum  customer  (production  and  development  advice)  –thanks  Antonio  J

Page 26: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

THANK  YOU

http://corporate.jobrapido.com

Page 27: 13 Jobrapido-elastic in a day 2016...OUR ELASTIC NUMBERS (MAIN CLUSTER) DOCUMENTS >200 M DATA 1TB AVG SEARCH RATE 20K/sec AVG INDEX RATE 400/sec MEMORY >1TB NODES 25 data, …

Q&A