Solr 3.1 and Beyond Yonik Seeley Lucid Imagination [email protected] October 8, 2010 2.

33
Solr 3.1 and Beyond Yonik Seeley Lucid Imagination [email protected] October 8, 2010 1

Transcript of Solr 3.1 and Beyond Yonik Seeley Lucid Imagination [email protected] October 8, 2010 2.

Page 1: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Solr 3.1 and Beyond

Yonik Seeley

Lucid Imagination

[email protected]

October 8, 2010

1

Page 2: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Agenda

Goal : Introduce new features you can try & use now in Solr development versions 3.1 or 4.0

Relevancy (Extended Dismax Parser)Spatial/Geo SearchSearch Result Grouping / Field CollapsingFaceting (Pivot, Range, Per-segment)Scalability (Solr Cloud)Odds & EndsQ&A

04/21/23 2

Page 3: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Solr 3.1? What happened to 1.5?

Lucene/Solr merged (March 2010) Single set of committers Single dev mailing list ([email protected]) Single shared subversion trunk Keep separate downloads, user mailing lists Other former lucene subprojects spun off (Nutch, Tika, Mahout, etc)

Development trunk is now always next major release (currently 4.0) branch_3x will be base for all 3.x releases Branch together, Release together, Share version numbers

Page 4: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

RELEVANCE

Page 5: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Extended Dismax ParserSuperset of dismax

&defType=edismax&q=foo&qf=body

Fixes edge cases where dismax could still throw exceptionsOR AND NOT - “

Full lucene syntax support Tries lucene syntax first Smart escaping is done if syntax errors

Optionally supports treating “and”/”or” as AND/OR in lucene syntax

Fielded queries (e.g. myfield:foo) even in degraded mode

uf parameter controls what field names may be directly specified in “q”

Page 6: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Extended Dismax Parser (continued)boost parameter for multiplicative boost-by-functionPure negative query clauses

Example: solr OR (-solr)

Enhanced term proximity boosting pf2=myfield – results in term bigrams in sloppy phrase queries

myfield:“aa bb cc” -> myfield:“aa bb” myfield:“bb cc”

Enhanced stopword handling stopwords omitted in main query, but added in optional proximity boosting part

Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”)

Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

Page 7: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

SPATIAL SEARCH

7

Page 8: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Spatial Search

04/21/23 8

Step1: Index some locations!<field name=“name”>The Alpine Shop</field><field name=“store”>44.013617,-73.168264</field>

Step2: Decide where you are&pt=44.0153371,-73.16734&d=1&sfield=store

Step3: Profit!

Spatial Filter: &fq={!geofilt}

Bounding Box: &fq={!bbox}

Distance Function: &sort=geodist() asc

Page 9: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

RESULT GROUPING /FIELD COLLAPSING

Page 10: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Field Collapsing Definition

Field collapsing Limit the number of results per category “category” normally defined by unique values in a field

Uses Web Search – collapse by web site Email threads – collapse by thread id Ecommerce/retail

Show the top 5 items for each store category (music, movies, etc)

Page 11: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Field Collapsing by Site

Page 12: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Field Collapse on Product TypeResult Grouping by Category

Page 13: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Group by Field

http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

04/21/23 13

"grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback

Black"}] }}]}}}

Page 14: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Group by Query

04/21/23 14

http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5

"grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback

Black"}] }}}}

Page 15: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Grouping Params

parameter meaning default

group.field=<field> Like facet.field – group by unique field values

group.query=<query> Like facet.query – top docs that also match

group.function=<function query>

Group by unique values produced by the function query

group.limit=<n> How many docs per group 1

group.sort=<sort spec> How to sort documents within a group Same as “sort” param

rows=<n> How many groups to return 10

sort=<sort spec> How to sort the groups relative to each other (based on top doc)

04/21/23 15

Page 16: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

FACETING

Page 17: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Pivot Faceting

Other names that could have made sense: Grid Faceting, Cross-Product Faceting, Matrix Faceting

Syntax: facet.pivot=field1,field2,field3,…

04/21/23 17

#docs #docs w/ inStock:true

#docs w/ instock:false

cat:electronics 14 10 4

cat:memory 3 3 0

cat:connector 2 0 2

cat:graphics card 2 0 2

cat:hard drive 2 2 0

facet.pivot=cat,inStock

Page 18: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Pivot Faceting

"facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4},

04/21/23 18

http://...&facet=true&facet.pivot=cat,popularity

(continued)

{ "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]},

[…]

14 docs w/cat==electronics

5 docs w/cat==electronics&& popularity==6

Page 19: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Range Faceting

• Like Date faceting, but more generic

http://...&facet=true&facet.range=price&facet.range.start=0&facet.range.end=500&facet.range.gap=50

"facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

04/21/23 19

Page 20: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

53514521

(null)batman

flashspidermansupermanwolverine

order: for each doc, an index into the lookup array

lookup: the string values

Lucene FieldCache Entry (StringIndex) for the “hero” field

027

010002

Documents matching the base query “Juggernaut”

accumulator

increment

lookup

q=Juggernaut&facet=true&facet.field=hero

Priority queue

Batman, 3flash, 5

Existing single-valued faceting algorithm

Page 21: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Segment1FieldCache

Entry

Segment2FieldCache

Entry

Segment3FieldCache

Entry

Segment4FieldCache

Entry

027

035012

0210

1304

010

Priority queue

Batman, 3flash, 5

Base DocSet

lookupinc

accumulator1 accumulator2 accumulator3 accumulator4

FieldCache + accumulator merger(Priority queue)

thread1

thread2 thread3thread4

Per-segment single-valued algorithm

Page 22: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Per-segment faceting

Enable with facet.method=fcsControllable multi-threading

facet.field={!threads=4}myfield

Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed)

Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded

Page 23: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Per-segment faceting performance comparison

Time for request* facet.method=fc facet.method=fcs

static index 3 ms 244 ms

quickly changing index 1388 ms 267 ms

Base DocSet=100 docs, facet.field on a field with 100,000 unique terms

Test index: 10M documents, 18 segments, single valued field

Time for request* facet.method=fc facet.method=fcs

static index 26 ms 34 ms

quickly changing index 741 ms 94 ms

Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms

*complete request time, measured externally

A

B

Page 24: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Faceting Performance Improvements

For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement

Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster

Optimized deep facet paging – up to 10x faster with really large facet.offsets

Less memory consumed by field cache entries

04/21/23 24

Page 25: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

SCALABILITY

Page 26: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

SolrCloud

First steps toward simplifying cluster managementIntegrates Zookeeper

Central configuration (schema.xml, solrconfig.xml, etc) Tracks live nodes + shards of collections

Removes need for external load balancersshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr

Can specify logical shard idsshards=NY_shard,NJ_shard

Clients don’t need to know shards at all:http://localhost:8983/solr/collection1/select?distrib=true

Page 27: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

SolrCloud : The Future

Eliminate all single points of failureRemove Master/Searcher distinction

Enables near real-time search in a highly scalable environment

High Availability for Writes Eventual consistency model (like Amazon Dynamo, Cassandra)

Elastic Simply add/subtract servers, cluster will rebalance automatically By default, Solr will handle document partitioning

Page 28: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

ODDS & ENDS

Page 29: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Auto-SuggestMany people currently use terms component

Can be slow for a large corpus

New auto-suggest builds off SpellCheck component Compact memory based trie for really fast completions Based on a field in the main index, or on a dictionary file

http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult

04/21/23 29

"spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

Page 30: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Index with JSON$ URL=http://localhost:8983/solr/update/json$ curl $URL -H 'Content-type:application/json' -d '{"add": { "doc": { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 }}}'

30

Page 31: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Query Results in CSV

http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv

name,price,cat,popularity

iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1

Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1

Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10

Can handle multi-valued fields (see “cat” field in example) Completely compatible with the CSV update handler (can round-trip) Results are streamed – good for dumping entire parts of the index

04/21/23 31

Page 32: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

http://localhost:8983/solr/browse

04/21/23 32

Page 33: Solr 3.1 and Beyond Yonik Seeley Lucid Imagination yonik@lucidimagination.com October 8, 2010 2.

Q&A