Solr 6 Feature Preview

Post on 11-Feb-2017

2.385 views 0 download

Transcript of Solr 6 Feature Preview

1© Cloudera, Inc. All rights reserved.

Solr 6 Feature Preview

Yonik Seeley3/09/2016

2© Cloudera, Inc. All rights reserved.

My Background

• Creator of Solr• Cloudera Engineer • LucidWorks Co-Founder• Lucene/Solr committer, PMC member• Apache Software Foundation member• M.S. in Computer Science, Stanford

3© Cloudera, Inc. All rights reserved.

Solr 6

• Happy Birthday Solr!• 10 Years at the Apache Software Foundation as of 1/2016

• Release branch as been cut• ETA before April• Java 8+ only

4© Cloudera, Inc. All rights reserved.

Streaming Expressions

5© Cloudera, Inc. All rights reserved.

Solr Streaming Expressions

• Generic platform for distributed computation• The basis for implementing distributed SQL

• Works across entire result sets (or subsets)• normal search operations are designed for fast top-N operations

• Map-reduce like "shuffle" partitions result sets for greater scalability• Worker nodes can be allocated from a collection for parallelism

6© Cloudera, Inc. All rights reserved.

Tuple Streams

• A streaming expression compiles/parses to a tuple stream• direct mapping from a streaming expression function->tuple_stream

• Stream Sources – produce a tuple stream• Stream Decorators – operate on tuple streams• Designed to include streams from non-Solr systems

7© Cloudera, Inc. All rights reserved.

search() expression

$ curl http://localhost:8983/solr/techproducts/stream -d 'expr=search(techproducts, q="*:*", fl="id,price,score", sort="id asc")'

{"result-set":{"docs":[{"score":1.0,"id":"0579B002","price":179.99},{"score":1.0,"id":"100-435805","price":649.99},{"score":1.0,"id":"3007WFP","price":2199.0},{"score":1.0,"id":"VDBDB1A16"},{"score":1.0,"id":"VS1GB400C3","price":74.99},{"EOF":true,"RESPONSE_TIME":6}]}}

resulting tuple stream

8© Cloudera, Inc. All rights reserved.

Search Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

Worker

Tuple StreamTuple Stream

/stream worker executing the "search" expression

• search() is a stream source• SolrCloud aware (CloudSolrStream java class)• Fully streaming (no big buffers)• Worker node doesn't need to be a Solr node

9© Cloudera, Inc. All rights reserved.

search expression args

search( // parses to CloudSolrStream java class

techproducts, // name of the collection to searchzkHost="localhost:9983", // (opt) zookeeper address of collection to searchqt="/select", // (opt) the request handler to use

(/export is also available)rows=1000000, // (opt) number of rows to retrieve q=*:*, // query to match returned

documentsfl="id,price,score", // which fields to returnsort="id asc, price desc", // how to sort the results

aliases="id=myid,price=myprice" // (opt) renames output fields)

10© Cloudera, Inc. All rights reserved.

reduce() streaming expression

• Groups tuples by common field values• Emits one group-head per group• Each group-head contains list of tuples• "by" parameter must match up with

"sort" parameter• Any partitioning should be done on

same group field.

reduce( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc, price desc"), by="manu"), group(sort="price desc",n=100))

stream operation

11© Cloudera, Inc. All rights reserved.

rollup() expression

• Groups tuples by common field values• Emits rollup value along with metrics• Closest equivalent to faceting

rollup( search(collection1, qt="/export" q="*:*", fl="id,manu,price", sort="manu asc"), over="manu"), count(*), max(price))

metrics

{"result-set":{"docs":[{"manu":"apple","count(*)":1.0},{"manu":"asus","count(*)":1.0},{"manu":"ati","count(*)":1.0},{"manu":"belkin","count(*)":2.0},{"manu":"canon","count(*)":2.0},{"manu":"corsair","count(*)":3.0},[...]

12© Cloudera, Inc. All rights reserved.

facet() expression

• Like search+rollup, but pushes down computation to JSON Facet API

facet( techproducts,q="*:*",buckets="manu",bucketSorts="count(*)

desc",bucketSizeLimit=1000,count(*),sum(price),max(popularity)

)

{"result-set":{"docs":[{"avg(price)":129.99, "max(popularity)":7.0,"manu":"corsair","count(*)":3},{"avg(price)":15.72,"max(popularity)":1.0,"manu":"belkin","count(*)":2},{"avg(price)":254.97,"max(popularity)":7.0,"manu":"canon","count(*)":2},{"avg(price)":399.0,"max(popularity)":10.0,"manu":"apple","count(*)":1},{"avg(price)":479.95,"max(popularity)":7.0,"manu":"asus","count(*)":1},{"avg(price)":649.98,"max(popularity)":7.0,"manu":"ati","count(*)":1},{"avg(price)":0.0,"max(popularity)":"NaN","manu":"boa","count(*)":1},[...]

13© Cloudera, Inc. All rights reserved.

Parallel Tuple Stream

Shard 1Replica 2

Shard 1Replica 1

Shard 1Replica 2

Shard 2Replica 1

Shard 1Replica 2

Shard 3Replica 1

WorkerPartition 1

WorkerPartition 2

Worker

Tuple Stream

14© Cloudera, Inc. All rights reserved.

Streaming Expressions – parallel

• Wraps a stream and sends to N worker nodes• The first parameter is the collection to

use for the intermediate worker nodes• partitionKeys must be provided to

underlying workers• usually makes sense to partition by

what you are grouping on• inner and outer sorts should match

parallel(collection1, rollup( search(techproducts, q="*:*", fl="id,manu,price", sort="manu asc", partitionKeys="manu"), over="manu asc"), workers=2, zkHost="localhost:9983", sort="manu asc")

15© Cloudera, Inc. All rights reserved.

Joins!

innerJoin( search(people, q=*:*, fl="personId,name", sort="personId asc"), search(pets, q=type:cat, fl="personId,petName", sort="personId asc"), on="personId")

leftOuterJoin, hashJoin, outerHashJoin,

16© Cloudera, Inc. All rights reserved.

More decorators

• complement – emits tuples from A which do not exist in B• intersect – emits tuples from A whish do exist in B• merge• top – reorders the stream and returns the top N tuples• unique – emits only the first tuple for each value• select – select, rename, or give default values to fields in a tuple

17© Cloudera, Inc. All rights reserved.

Interesting streams• update stream – indexes input into another SolrCloud collection!• daemon stream – blocks until more data is available from underlying stream• topic stream – a publish/subscribe messaging service• checkpoints are persisted in a Solr collection• resubmit to get new stuff• combine with daemon stream to automatically get continuous updates over time• further combine with update stream to push all matches to another collection

topic(checkpointCollection, dataCollection, id="topicA", q="solr rocks" checkpointEvery="1000")

18© Cloudera, Inc. All rights reserved.

jdbc() expression streamjoin with other data sources!

innerJoin( // example from JDBCStreamTest select( search(collection1, fl="personId_i,rating_f", q="rating_f:*", sort="personId_i asc"), personId_i as personId, rating_f as rating ), select( jdbc(connection="jdbc:hsqldb:mem:.", sql="select PEOPLE.ID as PERSONID, PEOPLE.NAME, COUNTRIES.COUNTRY_NAME from PEOPLE inner join COUNTRIES on PEOPLE.COUNTRY_CODE = COUNTRIES.CODE order by PEOPLE.ID", sort="ID asc", get_column_name=true), ID as personId, NAME as personName, COUNTRY_NAME as country ), on="personId")

19© Cloudera, Inc. All rights reserved.

Parallel SQL

20© Cloudera, Inc. All rights reserved.

/sql Handler

• /sql handler is there by default on all solr nodes• Translates SQL -> parallel streaming expressions• SQL tables map to SolrCloud collections• Query planner / optimizer• Currently uses Presto parser• May switch to Apache Calcite?

21© Cloudera, Inc. All rights reserved.

22© Cloudera, Inc. All rights reserved.

Simplest SQL Example

$ curl http://localhost:8983/solr/techproducts/sql -d "stmt=select id from techproducts"

{"result-set":{"docs":[{"id":"EN7800GTX/2DHTV/256M"},{"id":"100-435805"},{"id":"UTF8TEST"},{"id":"SOLR1000"},{"id":"9885A004"},[...]

tables map to collections

23© Cloudera, Inc. All rights reserved.

SQL handler HTTP parameters

curl http://localhost:8983/solr/techproducts/sql -d '&stmt=<sql_statement>&numWorkers=4 // currently used by GROUP BY and DISTINCT (via parallel stream)&workerCollection=collection1 // where to create intermediate workers&workerZkhost=localhost:9983 // cluster (zookeeper ensemble) address&aggregationMode=map_reduce | facet

24© Cloudera, Inc. All rights reserved.

The WHERE clause

• WHERE clauses are all pushed down to the search layer

select id where popularity=10 // simple match on numeric field "popularity" where popularity='[5 TO 10]' // solr range query (note the quotes) where name='hard drive' // phrase query on the "name" field where name='((memory retail) AND popularity:[5 TO 10])' // arbitrary solr query where name='(memory retail)' AND popularity='[5 TO 10]' // boolean logic

25© Cloudera, Inc. All rights reserved.

Ordering and Limiting

select id,score from techproducts where text='(memory hard drive)' ORDER BY popularity desc // default order is score desc for limited queries LIMIT 100

• Limited queries use /select handler• Unlimited queries use /export handler• fields selected need to be docValues• fields in "order by" need to be docValues• no "score" field allowed

26© Cloudera, Inc. All rights reserved.

More SQL examples

select distinct fieldA as fa, fieldB as fb from tableA order by fa desc, fb desc

// simple stats select count(fieldA) as count, sum(fieldB) as sum from tableA where fieldC = 'Hello'

select fieldA, fieldB, count(*), sum(fieldC), avg(fieldY) from tableA where fieldC = 'term1 term2' group by fieldA, fieldB having ((sum(fieldC) > 1000) AND (avg(fieldY) <= 10)) order by sum(fieldC) asc

27© Cloudera, Inc. All rights reserved.

Solr JDBC Driver

28© Cloudera, Inc. All rights reserved.

Solr JDBC driver works with Zeppelin

29© Cloudera, Inc. All rights reserved.

More Solr6 Features

30© Cloudera, Inc. All rights reserved.

Graph Query

• Basic (non-distributed) graph traversal query• Follows nodes to edges, optionally filtering during traversal• Currently only a "filter" query (produces a set of documents)• Parameters: from, to, traversalFilter, returnRoot, returnOnlyLeaf, maxDepth

• This example query matches “Philip J. Fry” and all of his ancestors:fq={!graph from=parent_id to=id}id:"Philip J. Fry"

31© Cloudera, Inc. All rights reserved.

Scoring changes

• For docCount (i.e. idf) in scoring, use the number of documents with that field rather than the number of documents in the whole index (maxDoc).• can add documents of a different type and not disturb/skew scoring

• BM25 scoring by default• tweakable on a per-fieldType basis ("k1" and "b" factors)• classic tf-idf still available

32© Cloudera, Inc. All rights reserved.

Cross DC Replication

33© Cloudera, Inc. All rights reserved.

Thank youyonik@cloudera.com