Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

23

Transcript of Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Page 1: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain
Page 2: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Building a Large Scale SEO/SEM Application with Apache Solr Rahul Jain Freelance Big-data/Search Consultant @rahuldausa [email protected]

Page 3: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

About Me… •  Freelance Big-data/Search Consultant based out of Hyderabad, India •  Provide Consulting services and solutions for Solr, Elasticsearch and other Big data

solutions (Apache Hadoop and Spark) •  Organizer of two Meetup groups in Hyderabad •  Hyderabad Apache Solr/Lucene •  Big Data Hyderabad

Page 4: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

What I am going to talk Share our experience in working on Search in this application …

•  What all issues we have faced and Lessons learned •  How we do Database Import, Batch Indexing… •  Techniques to Scale and improve Search latency •  The System Architecture •  Some tips for tuning Solr •  Q/A

Page 5: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

What does the Application do §  Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM

(Search Engine Marketing) Professionals §  End user search for a keyword or a domain, and get all insights about that. §  Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords. §  Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc.

Web  crawling  

 

Data  Processing    &  Aggrega4on  

 

Ad  Networks  

Apis    

Databases    

 Data  Collec4on  

 

*All  trademarks  and  logos  belong  to  their  respec1ve  owners.  

Page 6: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Technology Stack

Page 7: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

High level Architecture

Load  Balancer  (HAProxy)  

 

Managed  Cache  

Apache    Solr  

Cache  Cluster  (Redis)  

Apache    Solr  

Internet  

Database  (MySQL)  

App  Server  (Tomcat)  

 

Apache    Solr  

Search    Head    

Web  Server  Farm  Php  App  (Nginx)  

 

Cluster  Manager  (Custom  using  Zookeeper)    

Search  Head  :    •  Is  a  Solr  Server  which  does  not  contain  any  data.  •  Make  a  Distributed  Search  request  and  aggregate  the  Search  Results  •  Also  works  as  a  Load  Balancer  for  search  queries.  

Apache    Solr  

Search  Head    (Solr)    

21 3

4

85

6

7

Ids  lookup  

Cache  

Fetch  cluster  Mapping  for  which  month’  cluster    

Page 8: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Search - Key challenges §  After processing we have ~40 billion records every month in MySQL database

including

§  80+ Million Keywords

§  110+ Million Domains

§  1billion+ URLs

§  Multiple keywords for a Single URL and vice-versa

§  Multiple tables with varying size from 50million to 12billion

§  Search is a core functionality, so all records (selected fields) must be Indexed in Solr

§  Page load time (including all 24 widgets, Max) < 1.5 sec (Critical)

§  But… we need to load this data only once every month for all countries, so we can do Batch Indexing and as this data never changes, we can apply caching.

Page 9: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Making Data Import and Batch Indexing Faster

Page 10: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Data Import from MySQL to Solr •  Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume •  We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into

Solr.

Data  Importer  (Custom)  

 

Solr    

Solr    

Solr    

Table  

ID    (Primary/Unique  Key  with  Index)  

Columns  

1   Record1  

2   Record2  

…………  

5000   Record  5000  

*6000   Record  6000  

-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

n…   Record  n…  

Database  Batch  1-­‐2000  Batch  2001-­‐4000  

Importer  batches  these  database  Batches  into  a  Bigger  Batch  (10k  documents)  and  Flushes  to  selected  Solr  servers  Asynchronously  in  a  round  robin  fashion  

Rather  than  using  “limit”  func4on  of  Database,  it  queries  by  Range  of  IDs  (Primary  Key).  Importer  Fetches  10  batches  at  a  4me  from  MySQL  

database,  each  having  2k  Records.  Each  call  is  Stateless.    

Downside:    •  We  must  need  to  have  a  primary  key  and  that  can  be  slow  while  crea4ng  it  in  database.  •  This  approach  required  more  number  of  calls,  if  the  IDs  are  not  sequen4al.  

   

“select * from table t where ID=1 to ID<=2000″

“select * from table t

where ID=2001 to ID<=4000″  ………        

*Non-­‐sequen4al  Id  

Page 11: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Batch Indexing

Page 12: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Indexing  All  tables  into  a  Single  Big  Index  •  All tables in same Index, distributed on multiple Solr cores

and Solr servers (Java processes)

•  Commit on every 120million records or in every 15 minutes

whichever is earlier

•  Disabled Soft-commit and updates (overwrite=false), as

each call to addDocument calls updateDocument under

the hood

•  But still.. Indexing was slow (due to being sequential for all

tables) and we need to stop it after 2 days.

•  Search was also awfully slow (order of Minutes)

From  cache,  aber  warm-­‐up  

Bunch  of  shards  ~100  

Page 13: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Creating a Distributed Index for each table How many shards ? •  Each table have varying number of records from 50million to

12billion •  If we choose 100million per shard (core), it means for 12billion, we

need to query 120 shards, awfully slow.

•  Other side If we choose 500million/shard, a table with 500million records will have only 1 shard, high latency, high memory usage (Heap) and no distributed search*.

•  Hybrid Approach : Determine number of shards based on max number of records in table.

•  Did a benchmarking to find the best sweet spot for max documents (records) per shard with most optimal Search latency

•  Commit at the end for each table.

Records/Max  Shards  Table  

 Max  Number  of  Records  in  table  

   Max  number  of  Shards  (cores)  

Allowed      

<100  million    

1  

100-­‐300million   2  

<500  million   4  

<  1  billion   6  

1-­‐5  billion     8  

>5  billion     16  

*  Distributed  Search  improves  latency  but  may  not  be  faster  always  as  search  latency  is  limited  by  4me  taken  by  last  shard  in  responding.  

 

Page 14: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

It worked fine but one day suddenly….

java.lang.OutOfMemoryError: Java heap Space

•  All Solr servers were crashed.

•  We restarted but they keep crashing randomly after every other day •  Took a Heap dump and realized that it is due to Field Cache •  Found a Solution : Doc values and never had this issue again till date.

Page 15: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Doc Values (Life Saver) •  Disk based Field Data a.ka. Doc values

•  Document to value mapping built at index time

•  Store Field values on Disk (or Memory) in a column stride fashion

•  Better compression than Field Cache

•  Replacement of Field Cache (not completely)

•  Quite suitable for Custom Scoring, Sorting and Faceting

References: •  Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/

•  https://cwiki.apache.org/confluence/display/solr/DocValues

•  http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html

Page 16: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Scaling and Making Search Faster…

Page 17: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Partitioning •  3 Level Partitioning, by Month, Country and Table name •  Each Month has its own Cluster and a Cluster Manager. •  Latency and Throughput are tradeoff, you can’t have both at the same time.

Node  n  

Web  server  Farm  

Load  Balancer  

App  Server  

Search  Head  (US)  

Search  Head  (UK)  

Search  Head  (AU)  

Master  Cluster  Manager  

Internet  

Cluster  2  for  another  month  e.g  

Feb    

Fetch  Cluster  Mapping  and  make  a  request  to  Search  Head  with  respec4ve  Solr  cores  for  that  Country,  Month  and  Table  

App  Server  App  Server  

Cluster  1  for  a  Month  e.g.  Jan  

Solr  Solr  Solr  

Cluster  1  

Cluster  2  

Solr  Solr  Solr  

Node  1  

Node  n  

Solr  Solr  Solr  

Solr  Solr  

Node  1  

Solr  

Cluster  1  

Cluster  Manager  

Cluster  Manager  

A  

P  

*A  :  Ac4ve      P    :  Passive  

Cluster  Manager  

Cluster  Manager  

P  

A  

Real  4me  Sync  

1  user  24  UI  widgets,  24  Ajax  requests  

41  search  requests  

Search  Head  (US)  

Search  Head  (UK)  

Search  Head  (AU)  

Page 18: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Index Optimization Strategy •  Running optimization on ~200+ Solr cores is very-very time consuming •  Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. •  Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO •  Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk •  Other side Small number of segments improve performance drastically, so need to have a balance.

*As  per  our  observa4on  for  our  data,  Op4miza4on  process  takes  ~42-­‐46  seconds  for  1GB      We  need  to  do  it  for  4.6TB  (including  all  boxes),  the  total  Solr  Index  size  for  a  Single  Month  

Node  1  

Solr  Solr  Solr  

Op4mizer    

Staging  Cluster  Manager  

Produc4on  Cluster  Manager  

Fetches Cluster Mapping (list of all cores)

Once optimization and cache warmup is done, pushes the Cluster Mapping to Production Cluster manager, making all Indices live

Optimizing a Solr core into a very small number of segments takes a huge time. so we do it iteratively. Algo:

Choose Max 3 cores on a Machine to optimize in parallel. Start with least size of Index

Index  Size   Number  of  docs  

Determine  Max  Segments  Allowed  

Current  Segments  

Aber  op4miza4on  

Reduce  Segments  to  *.90  in  each  Run  

Node  2  

Solr  Solr  Solr  

Page 19: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Finally after optimization and cache warm-up…

A shard look like this.

 Max  Segments  

aber  op4miza4on  

 

Page 20: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

External Caching  •  In Distributed search, for a repeated query request, all Solr severs

needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput.

•  Solution: cache most frequently accessed query results in app layer (LRU based eviction)

•  We use Redis for Caching •  All complex aggregation queries’ results, once fetched from multiple

Solr servers are served from cache on subsequent requests. Why Redis… •  Advanced In-Memory key-value store •  Insane fast •  Response time in order of 5-10ms •  Provides Cache behavior (set, get) with advance data structures like

hashes, lists, sets, sorted sets, bitmaps etc. •  http://redis.io/

Page 21: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Hardware •  We use Bare Metal, Dedicated servers for Solr due to below reasons

1. Performance gain (with virtual servers, performance dropped by ~18-20%) 2. Better value of computing power/$ spent •  2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10)

•  2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10)

•  Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM.

SSD vs SAS 1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 2. SAS 15k: 182k docs/sec (dropped by ~45%) 3. SAS 15k is quite cheaper than SSD for bigger hard disks. 4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.

Page 22: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Conclusion : Key takeaways General: •  Understand the characteristics of the data and partition it well. Cache: §  Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster. §  Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache.

GC : §  Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger

heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB. §  Keep an eye on Garbage collection (GC) logs specially on Full GC. Tuning Params: §  Don’t use Soft Commit if you don’t need it. Specially in Batch Loading §  Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s

various configurations. §  Use hash in Redis to minimize the memory usage. Read the whole experience for more detail:http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records-month-with-solr/

Page 23: Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain

Thank you! Twitter: @rahuldausa [email protected]

http://www.linkedin.com/in/rahuldausa