Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas&...

36
Tomas Komenda, Lukas Putna, Miroslav Kvasnica Seznam.cz Solr : How to index billion phrases from MySQL and HBase

Transcript of Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas&...

Page 1: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Tomas  Komenda,  Lukas  Putna,  Miroslav  KvasnicaSeznam.cz

Solr: How to index billion phrases from MySQL and HBase

Page 2: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Who we are

• PPC  ads,    AdWords  competitor  in  CZ

1

• Web  portal,  search  engine  in  the  Czech  Republic• 40+  different  web  services  (search,  news,  email,  media,  …)

• Lukas  Putna,  Tomas  Komenda,  MiroslavKvasnica• Senior  developers,  team  leaders,  trainers• MySQL,  HBase,  Hadoop,  Impala,  Solr,  Hive

Page 3: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

What Sklik.cz is

• Advertising  data  +  daily  statistics• Provides  real-­‐time  searching,  aggregation,  filtering  and  analytics  

2

Advertising  hierarchy

Account

Campaign Campaign …  

Group Group …  

Keywords Ads RetargetingPlacements …  

queries urls…   …  

Page 4: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Sklik.cz and data

• Advertising  data  +  daily  statistics• Provides  real-­‐time  searching,  aggregation,  filtering  and  analytics  

3

Account  example10M  keywords  (phrases,  each  has  a  list  of  queries)120  statistical  values    per  keyword  per  day

With  date  filter  of  1  year42  billion  of  values3  aggregated  sum  rows10  GB

Full-­‐text  searching  has  to  return  results  for  sub-­‐term  with  one  or  more  characters(such  term  can  be  prefix,  infix,  postfix)  =>  billions  of  combinations  per  an  account

Page 5: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Sklik data (database) ecosystem

4

Page 6: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Full-text search technologies Elasticsearch,  Apache  Solr,  Sphinx,  SRCH2    

Page 7: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

We used Sphinx

• Free  open  source  search  server  written  in  C++,  lightweight  and  powerful,  SQL  friendly

• Sphinx  can  be  used  as  a  stand-­‐alone  server  or  as  a  storage  engine  (SphinxSE -­‐MySQL  and  its  forks)

• We  used  Sphinx  when  the  automatic  scaling  wasn’t  supported  well

• One  Sphinx  instance  per  one  database  shard,  application  has  to  decide  which  instance  to  use  

• Fast  searching,  easy  configuration

• Our  data  and  requirements  (index  complexity)  grew  fast  =>  finally,  it  wasn’t  possible  to  index  the  data  

• We  chose  Solr because  of  our  Hadoop  ecosystem  and  existing  HBase indexers

5

Page 8: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

We considered: Apache Solr and Elasticsearch

• Both  Open  source,  high-­‐performance,  full-­‐featured  text  search  engine  tools (engines  or  even  databases)

• Both  have  a  distributed  version

• Both  built  on  Apache  Lucene (and  extend  it)

• Both  very  popular  all  over  the  world  

• Elasticsearch is  probably  more  known  and  popular    in  the  Czech  Republic

6

Page 9: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache SolrBrief  Introduction

Page 10: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – introduction

• Solr is  an  open  source  enterprise search  server

• Solr was  created  by  Yonik Seeley  in  2004

• Current  version  is  5.5

• Uses  the  Lucene library  and  extends  it

• Provides  HTTP  interface  (XML,  JSON,  CSV,  binary)

• Since  2012,  Solr has  had  a  distributed  version  SolrCloud(Hadoop  integration)

7

Page 11: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – key features

• Advanced  full-­‐text  search  capabilities

• Optimized  for  high  volume  web  traffic

• Batch  full  and  delta  indexing,  near  real-­‐time  updating    (stream  Apache  Flume/Kafka  – soft  commits)

• Adaptable  with  XML  configuration

• Extensible   plugin  architecture

• Linearly   scalable,  auto  index   replication (Hadoop  integration)

• Comprehensive  web  administration  interface,  statistics  …

• A  lot  of  specialized  queries:   faceted  search,  ordering,  grouping,  pseudo-­‐join,  spatial  search,  functions  …8

Page 12: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr - architecture

9

Source:Jan  Hoydahl,Migrating  Fast  to  Solrpresentation,  Published   on  Mar  5,  2010,http://www.slideshare.net/janhoy/migrating-­‐fast-­‐to-­‐solr

Page 13: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – architecture from data flow point of view

10

MySQLHBaseFlume

Indexing…

Index

Analyzer,Tokenizer,Filter

Index  writer

JSON,  XML,  CSV  …

Import/Update

Searching…

Index  searcher Query  parserAnd  analyzer

Page 14: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – architecture from data flow point of view

11

MySQLHBaseFlume

Indexing…

Index

Analyzer,Tokenizer,Filter

Index  writer

JSON,  XML,  CSV  …

Import

Searching…

Index  searcher Query  parserAnd  analyzer

Page 15: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – Data model and hierarchy

12

Solr Instance  

Core/Index Core/Index Core/Index

Documents

Field Field Field

Indexing  &  QueringSolr.xml

Solrconfig.xml

Schema.xml

Page 16: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – schema.xml and fields definition

13

<schema  name="Seznam Sklik Campaigns"  version="1.5">  ...  

<fieldType name="string"  class="solr.StrField"   sortMissingLast="true"   />  <fieldType name="bool"   class="solr.BoolField"   sortMissingLast="true"   />

<fields>  <field  name="_version_"   type="long"   indexed="true"  stored="true"/>  <field  name="id"   type="string"   indexed="true"  stored="true"   required="true"   />  <field  name="name"   type="text"   indexed="true"  stored="true"   required="true"   />  <field  name="userId"   type="int"   indexed="true"  stored="false"  required="true"   />  <field  name="nameSimple"   type="simpleText"   indexed="true"  stored="false"   required=”false"/>  <copyField source="name"  dest="nameSimple"   />  

</fields>  <uniqueKey>id</uniqueKey>  

...  

Page 17: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – Core overview via Admin

Page 18: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – Core overview via Admin

Page 19: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – architecture from data flow point of view

14

MySQLHBaseFlume

Indexing…

Index

Analyzer,Tokenizer,Filter

Index  writer

JSON,  XML,  CSV  …

Import

Searching…

Index  searcher Query  parserAnd  analyzer

Page 20: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – Indexing and updating

• Configuration  in  solrconf.xml and schema.xml and  data-­‐config.xml

• Request  handlers,  Update  Handlers,  Update  Procesor Chain,  Data  Import  Handler  

• Index  operation:  add,  delete,  optimize,  commit,  rollback  …

• Atomic  updates  – auto  commit,  soft  and  hard  commit,  transaction  log  for  recovery  scenario  

• Near  real-­‐time  indexing,  batch  (full  and  delta)  indexing

15

Update  HandlersXML,  CSV,  JSON,  (PDF,  Word,...)

Data  Import  Handler

(Database  pull,  RSS  pull,  Simple  transformation)

Update  Processor  Chain

(per  handler) Index

Lucene

MySQL

<doc><title> PDF

RSS  feed

HTTP  PostHTTP  Post

PULL

PULL

Update  Processor  Chain

(per  handler)

Update  Processor  Chain

(per  handler)

Page 21: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – data-config.xml and MySQL

16

...<dataSource name="node1"  type="JdbcDataSource"  driver="com.mysql.jdbc.Driver"  batchSize="-­‐1"  url="jdbc:mysql://skdb012.ng.seznam.cz/sklik_node"   user="sklik_ro"  password=“…"/>  …<dataSource name="node12"  type="JdbcDataSource"   driver="com.mysql.jdbc.Driver"  batchSize="-­‐1"  url="jdbc:mysql://skdb053.ng.seznam.cz/sklik_node"   user="sklik_ro"  password=“…"/>  …

<document  name="keywords"><entity  name="keyword"   dataSource="node1"  query="  

SELECT  CONCAT_WS("!",  c.user_id,  k.id)  AS  id,  CAST(kl.name AS  CHAR(255))  AS  name,  cu.url,  g.id AS  groupId,  c.id AS  campaignId,  c.user_id AS  userIdFROM  keyword  k  JOIN   group`  g  ON  g.id =  k.group_id JOIN  campaign  c  ON  c.id =  g.campaign_idJOIN  user  u  ON  u.id =  c.user_id JOIN  sklik_common.keyword_lexiconkl  ON  k.keyword_lexicon_id =  kl.id LEFT  JOIN  sklik_common.v_url cu  ON  k.url_id =  cu.idWHERE  u.serviced =  0  AND  ('${dih.request.user_ids}'  =  ''  OR  c.user_id IN  (${dih.request.user_ids}))  AND  ('${dih.request.from}'  =  ''  OR  k.id >=  '${dih.request.from}')  AND  ('${dih.request.to}'  =  ''  OR  k.id <'${dih.request.to}')    AND  ('${dih.request.from_timestamp}'  =  ''  OR  k.index_date >=  '${dih.request.from_timestamp}') "  />

...  

Page 22: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – data-config.xml via Admin

Page 23: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – Import via Admin

Page 24: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – schema.xml and indexing

17

...  <types>  

<fieldType name="string"  class="solr.StrField"   sortMissingLast="true"   />  <fieldType name="simpleText"  class="solr.TextField"  sortMissingLast="true">  

<analyzer  type="index">  <charFilter class="solr.MappingCharFilterFactory"  mapping="foldToASCII.txt"/>  <filter  class="solr.LowerCaseFilterFactory"/>  <tokenizer class="solr.WhitespaceTokenizerFactory"/>  <filter  class="cz.seznam.sklik.solrconf.ModerateNGramFilterFactory"  minGramSize="2"  maxGramSize="512"/>  

</analyzer>  <analyzer  type="query">  

<charFilter class="solr.MappingCharFilterFactory"  mapping="mapping-­‐FoldToASCII.  txt"/>  <filter  class="solr.LowerCaseFilterFactory"/>  <tokenizer class="solr.WhitespaceTokenizerFactory"/>  

</analyzer>  </fieldType>  

...  

Page 25: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – Import stats via Admin

Page 26: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – architecture from data flow point of view

18

MySQLHBaseFlume

Indexing…

Index

Analyzer,Tokenizer,Filter

Index  writer

JSON,  XML,  CSV  …

Import

Searching…

Index  searcher Query  parserAnd  analyzer

Page 27: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – schema.xml and searching

19

...  <types>  

<fieldType name="string"  class="solr.StrField"   sortMissingLast="true"   />  <fieldType name="simpleText"  class="solr.TextField"  sortMissingLast="true">  

<analyzer  type="index">  <charFilter class="solr.MappingCharFilterFactory"  mapping="foldToASCII.txt"/>  <filter  class="solr.LowerCaseFilterFactory"/>  <tokenizer class="solr.WhitespaceTokenizerFactory"/>  <filter  class="cz.seznam.sklik.solrconf.ModerateNGramFilterFactory"  minGramSize="2"  maxGramSize="512"/>  

</analyzer>  <analyzer  type="query">  

<charFilter class="solr.MappingCharFilterFactory"  mapping="mapping-­‐FoldToASCII.  txt"/>  <filter  class="solr.LowerCaseFilterFactory"/>  <tokenizer class="solr.WhitespaceTokenizerFactory"/>  

</analyzer>  </fieldType>   ...  

Page 28: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – querying

20

http://sksolr2.ng.seznam.cz:8983/solr/test_collection/select?q=name:*dog?&start=10&rows=5&wt=xml<response>  

<lst name="responseHeader">  <int name="status">0</int>  <int name="QTime">510</int>  <lst name="params">  <str name="q">name:*dog*</str>

<str name="indent">true</str>  <str name="start">10</str>  <str name="rows">1</str>  <str name="wt">xml</str><str name="_">1458818597860</str>  

</lst>  </lst>  <result  name="response"  numFound="82"   start="10"  maxScore="1.0">  

<doc>  <str name="name”>Black  bulldogs</str>  <str name="id">43932!99192</str><long  name="_version_">1529083809145815041</long>

</doc>  </result>  

</response>

Page 29: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – querying

21

../select?q=keyword:hire~0.7&fq=avgCpc:[600+TO+*]+OR+competition:[0.6+TO+*]^2&sort=sum(count,competition)+desc&start=5&wt=json&indent=true{

"responseHeader”:{  "status":0,  "QTime":1493,"params":{  

"q":”keyword:hire~0.7",  "indent":"true",  "start":"5",  "fq":"avgCpc:[600  TO  *]  OR  competition:[0.6  TO  *]^7",  "sort":"sum(count,   competition)  desc",  "wt":"json"}},  

"response":{"numFound”:21,"start”:5,"docs":[  {  

"query":”fire",  "count":108,  "competition":0.62176,  "avgCpc":589.0909,  "months":["2015-­‐03",   "2015-­‐04",  "2015-­‐05”,  ……

Page 30: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – querying via Admin

2

Page 31: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

Apache Solr – querying stats via Admin

2

Page 32: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

SolrCloudBrief  Introduction

Page 33: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

SolrCloud – introduction and architeture

• Distributed,  auto  index   replication,   linearly   scalable

• Hadoop  and  HDFS  integration

• “roughly”  CP  system  (good  availability),   fault  tolerant  (HA  +  no  single  points  failure)

• Document  routing  according  to  hash  ID  to  int (or  custom  hashing),  each  shard  covers  a  hash-­‐range

• All  nodes  in  cluster  perform  indexing  and  execute  queries;  no  master  node

• Terminology:  zookeeper,    Node,  Collection,  Replication  Factor,  Shard,  Replica,   Leader

23

Java  VM  

Node  1  (port:  8984)

Solr  Web  app

collectionshard1  -­‐ Leader

collectionshard1  -­‐ Replica

Jetty  (node  4)  on  port:  8985

Solr  Web  app

Zookeeper

Leader  Election

Server  2Balancer

HDFS

Java  VM  

Node  2 (port:  8985)

Solr  Web  app

collectionshard2-­‐ Leader

collectionshard1  -­‐ Replica

Solr  Web  app

Server  2

HDFS

Java  VM  

Node  3 (port:  8984)

Solr  Web  app

collectionshard1  -­‐ Replica

HDFS

Java  VM  

Node  4 (port:  8985)

Solr  Web  app

collectionshard2-­‐ Replica

HDFS

Server  2Server  1

Replication

Replication

Sharding

Page 34: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

SolrCloud – cloud via Admin

2

Page 35: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

SolrCloud – in Seznam.cz

• Two  clusters  (24  and  8  machines  – backup  4  machines),  TBs  indexes,  we  use  Solr :

• as  a  Full-­‐text  search  tool  for  filters  on  our  client’s  website

• as  a  keyword  proposal  tool  (with  stats)  supporting  creating  and  tuning  customer’s  advertising  

• as  a  storage  for  queries  and  their  stats  (public  accessible  via  website  and  API)  for  our  search  engine

• We  are  generally  satisfied,  we  are  still  fighting  with  optimal  data  scaling  and  query  performacebut  indexing  and  availability  are  very  good

25

Solr  Web  appServer  2

Page 36: Tom …€¢ 40+&different&web&services&(search,news,email,media,…) • Lukas&Putna, Tomas& Komenda,MiroslavKvasnica • Senior&developers,team&leaders,trainers • MySQL,HBase,

It is all! Question?

Thank you for listening !

Thursday 12:50  PM  @  Ballroom  F:MySQL and Impala  ecosystem

[email protected]@[email protected]

26

Solr  Web  appServer  2