Faceting optimizations for Solr

OCTOBER 13-16, 2015 • AUSTIN, TX

Faceting optimizations for SolrToke EskildsenSearch Engineer / Solr HackerState and University Library, Denmark@TokeEskildsen / te@statsbiblioteket.dk

Overview

Web scale at the State and University Library, Denmark

Field faceting 101 Optimizations Reuse Tracking Caching Alternative counters

Web scale for a small web

Denmark Consolidation circa 10th century 5.6 million people

Danish Net Archive (http://netarkivet.dk) Constitution 2005 20 billion items / 590TB+ raw data

Indexing 20 billion web items / 590TB into Solr

Solr index size is 1/9th of real data = 70TB Each shard holds 200M documents / 900GB Shards build chronologically by dedicated machine Projected 80 shards Current build time per shard: 4 days Total build time is 20 CPU-core years

So far only 7.4 billion documents / 27TB in index

Searching a 7.4 billion documents / 27TB Solr index

SolrCloud with 2 machines, each having 16 HT-cores, 256GB RAM, 25 * 930GB SSD 25 shards @ 900GB 1 Solr/shard/SSD, Xmx=8g, Solr 4.10 Disk cache 100GB or < 1% of index size

String faceting 101 (single shard)

counter = new int[ordinals]for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++

for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])

for entry: priorityQueue result.add(resolveTerm(ordinal), count)

ord term counter0 A 01 B 32 C 03 D 10064 E 15 F 16 G 07 H 08 I 3

Test setup 1 (easy start)

Solr setup 16 HT-cores, 256GB RAM, SSD Single shard 250M documents / 900GB

URL field Single String value 200M unique terms

3 concurrent “users” Random search terms

Vanilla Solr, single shard, 250M documents, 200M values, 3 users

Allocating and dereferencing 800MB arrays

Reuse the counter

counter = new int[ordinals]for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++

Reuse the counter

counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++

pool.release(counter)

Note: The JSON Facet API in Solr 5 already supports reuse of counters

Using and clearing 800MB arrays

Reusing counters vs. not doing so

Reusing counters, now with readable visualization

Why does it always take more than 500ms?

Iteration is not free

counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++

200M unique terms = 800MB

ord counter0 01 02 03 04 05 06 07 08 0

trackerN/AN/AN/AN/AN/AN/AN/AN/AN/A

Tracking updated counters

ord counter0 01 02 03 14 05 06 07 08 0

tracker3

N/AN/AN/AN/AN/AN/AN/AN/A

counter[3]++

ord counter0 01 12 03 14 05 06 07 08 0

tracker31

N/AN/AN/AN/AN/AN/AN/A

counter[3]++counter[1]++

ord counter0 01 32 03 14 05 06 07 08 0

tracker31

N/AN/AN/AN/AN/AN/AN/A

counter[3]++counter[1]++counter[1]++counter[1]++

ord counter0 01 32 03 10064 15 16 07 08 3

tracker31845

N/AN/AN/AN/A

counter[3]++counter[1]++counter[1]++counter[1]++counter[8]++counter[8]++counter[4]++counter[8]++counter[5]++counter[1]++counter[1]++…counter[1]++

counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) if counter[ordinal]++ == 0 && tracked < maxTracked tracker[tracked++] = ordinalif tracked < maxTracked for i = 0 ; i < tracked ; i++ priorityQueue.add(tracker[i], counter[tracker[i]])else for ordinal = 0 ; ordinal < counter.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])

ord counter0 01 32 03 10064 15 16 07 08 3

tracker31845

N/AN/AN/AN/A

Distributed faceting

Phase 1) All shards performs faceting. The Merger calculates the top-X terms.Phase 2) The term counts are requested from the shards that did not return them in phase 1. The Merger calculates the final counts for the top-X terms.

for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))

Test setup 2 (more shards, smaller field)

Solr setup 16 HT-cores, 256GB RAM, SSD 9 shards @ 250M documents / 900GB

domain field Single String value 1.1M unique terms per shard

1 concurrent “user” Random search terms

Pit of Pain™ (or maybe “Horrible Hill”?)

Fine counting can be slow

Phase 1: Standard faceting

Phase 2:for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))

Alternative fine counting

counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter.increment(ordinal)

for term: fineCountRequest.getTerms() result.add(term, counter.get(getOrdinal(term)))

} Same as phase 1, which yieldsord counter

0 01 32 03 10064 15 16 07 08 3

Using cached counters from phase 1 in phase 2

counter = pool.getCounter(key)

for term: query.getTerms() result.add(term, counter.get(getOrdinal(term)))

Pit of Pain™ practically eliminated

Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com

Test setup 3 (more shards, more fields)

Solr setup 16 HT-cores, 256GB RAM, SSD 23 shards @ 250M documents / 900GB

Faceting on 6 fields url: ~200M unique terms / shard domain & host: ~1M unique terms each / shard type, suffix, year: < 1000 unique terms / shard

1 machine, 7 billion documents / 23TB total index, 6 facet fields

High-cardinality can mean different things

Single shard / 250,000,000 docs / 900GB

Field References Max docs/term Unique termsdomain 250,000,000 3,000,000 1,100,000

url 250,000,000 56,000 200,000,000

links 5,800,000,000 5,000,000 610,000,000

2440 MB / counter

Remember: 1 machine = 25 shards

25 shards / 7 billion / 23TB

Field References Max docs/term Unique termsdomain 7,000,000,000 3,000,000 ~25,000,000

url 7,000,000,000 56,000 ~5,000,000,000

links 125,000,000,000 5,000,000 ~15,000,000,000

60 GB / facet call

Different distributions domain 1.1M url 200M links 600M

High max

Low max

Very long tail

Short tail

Theoretical lower limit per counter: log2(max_count)

max=2047

max=63

int vs. PackedIntsdomain: 4 MBurl: 780 MBlinks: 2350 MB

int[ordinals] PackedInts(ordinals, maxBPV)

domain: 3 MB (72%)url: 420 MB (53%)links: 1760 MB (75%)

n-plane-z counters

Platonic ideal Harsh reality

Plane d

Plane c

Plane b

Plane a

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000L: 1 ≣ 000001

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000L: 1 ≣ 000001L: 2 ≣ 000011

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000L: 1 ≣ 000001L: 2 ≣ 000011L: 3 ≣ 000101

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000L: 1 ≣ 000001L: 2 ≣ 000011L: 3 ≣ 000101L: 4 ≣ 000111L: 5 ≣ 001001L: 6 ≣ 001011L: 7 ≣ 001101...L: 12 ≣ 010111

Comparison of counter structuresdomain: 4 MBurl: 780 MBlinks: 2350 MB

domain: 3 MB (72%)url: 420 MB (53%)links: 1760 MB (75%)

domain: 1 MB (30%)url: 66 MB ( 8%)links: 311 MB (13%)

int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z

Speed comparison

I could go on about

Threaded counting Heuristic faceting Fine count skipping Counter capping Monotonically increasing tracker for n-plane-z Regexp filtering

What about huge result sets?

Rare for explorative term-based searches Common for batch extractions Threading works poorly as #shards > #CPUs But how bad is it really?

Really bad! 8 minutes

Heuristic faceting

Use sampling to guess top-X terms Re-use the existing tracked counters 1:1000 sampling seems usable for the field links,

which has 5 billion references per shard Fine-count the guessed terms

Over provisioning helps validity

10 seconds < 8 minutes

Never enough time, but talk to me about

Threaded counting Monotonically increasing tracker for n-plane-z Regexp filtering Fine count skipping Counter capping

Extra info

The techniques presented can be tested with sparse faceting, available as a plug-in replacement WAR for Solr 4.10 at https://tokee.github.io/lucene-solr/. A version for Solr 5 will eventually be implemented, but the timeframe is unknown.

No current plans for incorporating the full feature set in the official Solr distribution exists. Suggested approach for incorporation is to split it into multiple independent or semi-independent features, starting with those applicable to most people, such as the distributes faceting fine count optimization.

In-depth descriptions and performance tests of the different features can be found at https://sbdevel.wordpress.com.

18M documents / 50GB, facet on 5 fields (2*10M values, 3*smaller)

6 billion docs / 20TB, 25 shards, single machinefacet on 6 fields (1*4000M, 2*20M, 3*smaller)

7 billion docs / 23TB, 25 shards, single machinefacet on 5 fields (2*20M, 3*smaller)

Faceting optimizations for Solr

Software

Transcript of Faceting optimizations for Solr

Faceting Machine Alignment - Lapidary Worldlapidaryworld.com/pdf/faceting_machine_alignment.pdfFaceting Machine Alignment By Don Rogers What is a faceting machine? It is a tool that

Solr Presentation

Apache Solr + ajax solr

Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

Solr Fusion a Solr Proxy

Multiwalled nanotube faceting unravelledodedhod/papers/paper53.pdf · 2016-12-20 · Multiwalled nanotube faceting unravelled Itai Leven1†, Roberto Guerra2,3†, Andrea Vanossi2,3,ErioTosatti2,3,4

Surface Melting (faceting?)

Inside Solr 5 - Bangalore Solr/Lucene Meetup

Apache Solr Cookbook - the-eye.euApache Solr Cookbook iii 4 Solr autocomplete example 27 4.1 Install Apache Solr ...

V5 CLASSIC FACETING MACHINE - Ultra Tec

Faceting Machine Alignment

Faceting, Grain Growth, and Crack Healing in Alumina

Geometry Based Faceting of 3D Digitized …openaccess.thecvf.com/content_ICCV_2017_workshops/papers/...Geometry Based Faceting of 3D Digitized Archaeological Fragments Hanan ElNaghy,

Solr pattern

Basics of Solr and Solr Integration with AEM6

Understanding the Solr security framework - Lucene Solr Revolution 2015

NYC Lucene/Solr Meetup: Spark / Solr

INTERNATIONAL INDIVIDUAL FACETING CHAMPIONSHIP Schedule 2018... · INTERNATIONAL INDIVIDUAL FACETING . CHAMPIONSHIP . and . INTERNATIONAL FACETING TEAMS COMPETITION . SCHEDULE and

Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance