Solr 6 Feature Preview

Yonik Seeley3/09/2016

My Background

• Creator of Solr• Cloudera Engineer • LucidWorks Co-Founder• Lucene/Solr committer, PMC member• Apache Software Foundation member• M.S. in Computer Science, Stanford

Solr 6

• Happy Birthday Solr!• 10 Years at the Apache Software Foundation as of 1/2016

• Release branch as been cut• ETA before April• Java 8+ only

Streaming Expressions

Solr Streaming Expressions

• Generic platform for distributed computation• The basis for implementing distributed SQL

• Works across entire result sets (or subsets)• normal search operations are designed for fast top-N operations

• Map-reduce like "shuffle" partitions result sets for greater scalability• Worker nodes can be allocated from a collection for parallelism

Tuple Streams

• A streaming expression compiles/parses to a tuple stream• direct mapping from a streaming expression function->tuple_stream

• Stream Sources – produce a tuple stream• Stream Decorators – operate on tuple streams• Designed to include streams from non-Solr systems

search() expression

$ curl http://localhost:8983/solr/techproducts/stream -d 'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")'

{"result-set":{"docs":[{"score":1.0,"id":"0579B002","price":179.99},{"score":1.0,"id":"100-435805","price":649.99},{"score":1.0,"id":"3007WFP","price":2199.0},{"score":1.0,"id":"VDBDB1A16"},{"score":1.0,"id":"VS1GB400C3","price":74.99},{"EOF":true,"RESPONSE_TIME":6}]}}

resulting tuple stream

Search Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

Worker

Tuple StreamTuple Stream

/stream worker executing the "search" expression

• search() is a stream source• SolrCloud aware (CloudSolrStream java class)• Fully streaming (no big buffers)• Worker node doesn't need to be a Solr node

search expression args

search( // parses to CloudSolrStream java class

techproducts, // name of the collection to searchzkHost="localhost:9983", // (opt) zookeeper address of collection to searchqt="/select", // (opt) the request handler to use

(/export is also available)rows=1000000, // (opt) number of rows to retrieve q=*:*, // query to match returned

documentsfl="id,price,score", // which fields to returnsort="id asc, price desc", // how to sort the results

aliases="id=myid,price=myprice" // (opt) renames output fields)

reduce() streaming expression

• Groups tuples by common field values• Emits one group-head per group• Each group-head contains list of tuples• "by" parameter must match up with

"sort" parameter• Any partitioning should be done on

same group field.

reduce( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc, price desc"), by="manu"), group(sort="price desc",n=100))

stream operation

rollup() expression

• Groups tuples by common field values• Emits rollup value along with metrics• Closest equivalent to faceting

rollup( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc"), over="manu"), count(*), max(price))

metrics

{"result-set":{"docs":[{"manu":"apple","count(*)":1.0},{"manu":"asus","count(*)":1.0},{"manu":"ati","count(*)":1.0},{"manu":"belkin","count(*)":2.0},{"manu":"canon","count(*)":2.0},{"manu":"corsair","count(*)":3.0},[...]

facet() expression

• Like search+rollup, but pushes down computation to JSON Facet API

facet( techproducts,q="*:*",buckets="manu",bucketSorts="count(*)

desc",bucketSizeLimit=1000,count(*),sum(price),max(popularity)

{"result-set":{"docs":[{"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3},{"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2},{"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2},{"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1},{"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1},{"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1},{"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1},[...]

Parallel Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

WorkerPartition 1

WorkerPartition 2

Worker

Tuple Stream

Streaming Expressions – parallel

• Wraps a stream and sends to N worker nodes• The first parameter is the collection to

use for the intermediate worker nodes• partitionKeys must be provided to

underlying workers• usually makes sense to partition by

what you are grouping on• inner and outer sorts should match

parallel(collection1, rollup( search(techproducts, q="*:*", fl="id,manu,price", sort="manu asc", partitionKeys="manu"), over="manu asc"), workers=2, zkHost="localhost:9983", sort="manu asc")

Joins!

innerJoin( search(people, q=*:*, fl="personId,name", sort="personId asc"), search(pets, q=type:cat, fl="personId,petName", sort="personId asc"), on="personId")

leftOuterJoin, hashJoin, outerHashJoin,

More decorators

• complement – emits tuples from A which do not exist in B• intersect – emits tuples from A whish do exist in B• merge• top – reorders the stream and returns the top N tuples• unique – emits only the first tuple for each value• select – select, rename, or give default values to fields in a tuple

Interesting streams• update stream – indexes input into another SolrCloud collection!• daemon stream – blocks until more data is available from underlying stream• topic stream – a publish/subscribe messaging service• checkpoints are persisted in a Solr collection• resubmit to get new stuff• combine with daemon stream to automatically get continuous updates over time• further combine with update stream to push all matches to another collection

topic(checkpointCollection, dataCollection, id="topicA", q="solr rocks" checkpointEvery="1000")

jdbc() expression streamjoin with other data sources!

innerJoin( // example from JDBCStreamTest select( search(collection1, fl="personId_i,rating_f", q="rating_f:*", sort="personId_i asc"), personId_i as personId, rating_f as rating ), select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID", sort="ID asc", get_column_name=true), ID as personId, NAME as personName, COUNTRY_NAME as country ), on="personId")

Parallel SQL

/sql Handler

• /sql handler is there by default on all solr nodes• Translates SQL -> parallel streaming expressions• SQL tables map to SolrCloud collections• Query planner / optimizer• Currently uses Presto parser• May switch to Apache Calcite?

Simplest SQL Example

$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"

{"result-set":{"docs":[{"id":"EN7800GTX/2DHTV/256M"},{"id":"100-435805"},{"id":"UTF8TEST"},{"id":"SOLR1000"},{"id":"9885A004"},[...]

tables map to collections

SQL handler HTTP parameters

curl http://localhost:8983/solr/techproducts/sql -d '&stmt=<sql_statement>&numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream)&workerCollection=collection1 // where to create intermediate workers&workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address&aggregationMode=map_reduce | facet

The WHERE clause

• WHERE clauses are all pushed down to the search layer

select id where popularity=10 // simple match on numeric field "popularity" where popularity='[5 TO 10]' // solr range query (note the quotes) where name='hard drive' // phrase query on the "name" field where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic

Ordering and Limiting

select id,score from techproducts where text='(memory hard drive)' ORDER BY popularity desc // default order is score desc for limited queries LIMIT 100

• Limited queries use /select handler• Unlimited queries use /export handler• fields selected need to be docValues• fields in "order by" need to be docValues• no "score" field allowed

More SQL examples

select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc

// simple stats select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello'

select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA where fieldC = 'term1 term2' group by fieldA, fieldB having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10)) order by sum(fieldC) asc

Solr JDBC Driver

Solr JDBC driver works with Zeppelin

More Solr6 Features

Graph Query

• Basic (non-distributed) graph traversal query• Follows nodes to edges, optionally filtering during traversal• Currently only a "filter" query (produces a set of documents)• Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth

• This example query matches “Philip J. Fry” and all of his ancestors:fq={!graph from=parent_id to=id}id:"Philip J. Fry"

Scoring changes

• For docCount (i.e. idf) in scoring, use the number of documents with that field rather than the number of documents in the whole index (maxDoc).• can add documents of a different type and not disturb/skew scoring

• BM25 scoring by default• tweakable on a per-fieldType basis ("k1" and "b" factors)• classic tf-idf still available

Cross DC Replication

Thank youyonik@cloudera.com

Solr 6 Feature Preview

Technology

Transcript of Solr 6 Feature Preview

X1 StoreFront and Receiver X1 for Web Tech Preview ... · X1 StoreFront and Receiver X1 for Web Tech Preview Disclaimers ... • XenApp 6.5 Feature Pack 2 • XenApp 6.5 Feature Pack

Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin

Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark

The%NoSQL%Database%archive.apachecon.com/...Lucene/...NoSQL-database.pdfEarliestHA% Solr%Conﬁguraons% Load%Balancer% Appservers% Solr%Searchers% Solr%Master% DB Updater% updates%

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr

Solr Architecture

Summer '15 Release Preview: Platform Feature Highlights

Apache Solr

Solr Fusion a Solr Proxy

Oak / Solr integration Tommaso Teofili - pro!vision · Solr replicated architecture Solr%@10.1.1.20% C1 C2 Solr%@10.1.1.21% C1 C2 Solr%@10.1.1.22% C1 C2 RRLoad%balancer% adaptTo()

Solr Flair

Spring ’15 Release Preview - Platform Feature Highlights

Solr + jQuery =

Oak / Solr integration Tommaso Teofili · Oak / Solr integration Tommaso Teofili . adaptTo() 2012 ! Why ! Search on Oak with Solr ! Solr based QueryIndex ! Solr based MK ! Benchmarks

Apache Solr Cookbook - the-eye.euApache Solr Cookbook iii 4 Solr autocomplete example 27 4.1 Install Apache Solr ...

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance

Optimizing SOLR to Improve Searchinfo2.magento.com/rs/magentosoftware/images/SOLR... · Agenda ! Overview of SOLR ! Basic Solr Troubleshooting – Common SOLR Troubleshooting and

Extending Solr: Packaging Common - · PDF fileExtending Solr: Packaging Common Sense ... •Zookeeper for property ... //cwiki.apache.org/conﬂuence/display/solr/Kerberos+Authentication+Plugin

What's New In Apache Solr? · Graph shows the dates of every Solr feature release (ie: not bug ﬁx releases) along the X axis, with the Y axis showing the number of Solr releases

Inside Solr 5 - Bangalore Solr/Lucene Meetup