Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

19
OCTOBER 11-14, 2016 BOSTON, MA

Transcript of Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

Page 1: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Page 2: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

Your Big Data Stack is Too Big! Timothy Potter

Apache Lucene/Solr Committer & PMC Member Lucidworks

Page 3: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

3

01How we got here and where we’re going …

• Giving away 2 books ~ tweet including: #FusionBigData #LuceneSolrRev

• A quick trip down memory lane …

Cassandra, Pig, Hive, HCatalog, HDFS, Mahout, Sqoop, Oozie, Storm, and of course Solr!

• Big Data integration trap

• Lucidworks Fusion provides a viable alternative that emphasizes fast access, agility, and automation

Page 4: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

4

03A few patterns emerge …

• Begins with need for better relevancy ~ automatically

• More and more mission-critical data lives in Fusion

• Much of big data is unstructured making search the ideal exploration technology ~ people grok search

• Speed is addictive!

• But integrating these two is a non-trivial problem to solve -> Fusion FTW!

• fusion-spark-bootcamp: http://bit.ly/2dZfBhk

Page 5: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

5

01Data Ingest

• Connectors! Lots of them …

• Pipelines … because data ingest is messy

• JavaScript when you must!

• SparkSQL too! Replace DIH with SparkSQL JDBC datasource: 31K docs / sec on a small Spark cluster

gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181

• Spark Streaming to Solr too

Page 6: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

6

01Time-based Partitioning• Docs partitioned into time-based collections in Solr

• New time partitions created on-the-fly when needed; older partitions should age out automatically

• Need a document router to index docs in the correct collection based on timestamp (doesn’t use aliases)

• Need a query router to read the appropriate collections based on query time range

• Deeper analytics on larger historical time ranges achieved using Spark by joining Solr with archived files stored in HDFS

• Check out the eventsim lab in the bootcamp

Page 7: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

7

02Common access patterns with big data

• Big data systems have grown complex trying to satisfy a variety of access patterns

• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)

• Low-latency ranked retrieval and facet-driven discovery (Solr, Elastic, DataStax, …)

• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)

• Graph traversal (Graphx, Giraph, Neo4j, …)

Page 8: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

8

01Solr Streaming Inside• Relies on docValues (column-oriented data

structure) and /export handler

• Extreme read performance (8-10x faster than queries using cursorMark)

• Facet or map/reduce style aggregation modes

• Tiered architecture

• SQL interface tier

• Worker tier (scale a pool of worker “nodes” independently of the data collection)

• Data tier (Solr collection)

Page 9: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

9

01Fusion Signals for Relevance• Simple DSL for aggregating user interactions

with search results, quite useful for boosting & recommendations

• Scale using Spark

• Take user activity and feed it back into the search engine to improve relevancy using Fusion query pipelines

• Integrated with Lucidworks View to capture user activity

• Custom logic via JavaScript … don’t get bogged down into the weeds of Spark

Page 10: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

10

01Self-service Analytics

• Can’t overstate the importance of SQL in big data

• Shortage of data scientists and engineers, abundance of SQL-savvy business analysts

• JDBC-compliant Tools abound!

• De-normalization is inconvenient

• Apache Zeppelin for exploring data in Solr and other data sources

Page 11: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

11

01Best of Both Worlds: Spark SQL and Solr SQL

• Spark SQL provides an amazing query plan optimizer with SQL2003 support

• BUT … Spark SQL can’t compete with Solr performance for queries that can be expressed in Solr

• Push-down aggregations into the engine!

• spark-solr tries to detect when sub-queries can be pushed down into Solr

• movielens lab in fusion-spark-bootcamp

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 12: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

12

01Fusion Catalog API

• REST API for CRUD on data assets: views, tables, UDFs, etc

• Full-text search for business analysts to find data sets of interest

• Tool for SMEs to share complex data sets as simple views

• Authn & Authz via Fusion security

• Seamless integration with SparkSQL, streaming expressions, parallel SQL, and JDBC

parallel(workers,      hashJoin(          search(movielens,  q=*:*,                              fl="user_id_i,movie_id_i,rating_i",                              sort="movie_id_i  asc",                              partitionKeys="movie_id_i"),          hashed=search(movielens_movies,  q=*:*,                                                fl="movie_id_i,title_s,genre_s",                                                sort="movie_id_i  asc",                                                partitionKeys="movie_id_i"),          on="movie_id_i"      ),      workers="4",      sort="movie_id_i  asc"  )  

Page 13: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

13

01Custom Script Jobs

• Not limited by our built-in toolset

• Develop a custom Spark script in Scala and then upload it Fusion to be scheduled and run on Spark cluster

• Focus more on solving business problems vs. ops / job mgmt

• See apachelogs example in the fusion-spark-bootcamp

sessionize using window function and then compute aggregations for each session

Page 14: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

14

01Data science in a box• REPL with hooks into Solr for quickly exploring

unstructured data sets

• Jake's RecSys recipe for building recommender systems

• Full access to Lucene text analyzers when building ML pipelines

• See mlsvm & ml20news labs in the fusion-spark-bootcamp

• searchhub.lucidworks.com

see slides from Grant’s talk about SearchHub

Page 15: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

15

01Machine Learning in Index & Query Pipelines

• Query intent

• Document classification

• Recommendations

• Design / evaluate / refine models in Spark ML pipelines or MLlib and then publish to Fusion to generate predictions from query / index pipelines

Page 16: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

ID#of#model#stored#in#Fusion’s#blob#store#

Field#to#store#model#predic5on#in#each#document#during#indexing#

16

01Example: Sentiment Classifier during Indexing

Page 17: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

17

03You could do this yourself …

• It’s too easy to fallback into the trap of thinking that hard work getting cool technologies working together equates to business value.

• Get back to focusing on solving business problems ~ increased ROI, faster

• Fusion gives you a clear buy vs. build choice

Page 18: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

Billions of Docs

Optional

REST

Security woven throughout

Prox

yRecs

Worker

Pipes Metrics

NLP Sched.

Blobs Admin

Connectors

Worker Cluster Mgr.

Spark

Shards Shards

Solr

HD

FS

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Zookeeper

ZK N

Signals

Fusion Architecture

Millions of Users

Page 19: Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

19

01Thanks! Q & A

• Try Fusion: https://lucidworks.com/products/fusion/download/

• spark-solr: http://bit.ly/1Ub12GU

• fusion-spark-bootcamp: http://bit.ly/2dZfBhk

• 40% off Manning books coupon code: ctwlucsoltw