Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Your Big Data Stack is Too Big! Timothy Potter

Apache Lucene/Solr Committer & PMC Member Lucidworks

3

01How we got here and where we’re going …

• Giving away 2 books ~ tweet including: #FusionBigData #LuceneSolrRev

• A quick trip down memory lane …

Cassandra, Pig, Hive, HCatalog, HDFS, Mahout, Sqoop, Oozie, Storm, and of course Solr!

• Big Data integration trap

• Lucidworks Fusion provides a viable alternative that emphasizes fast access, agility, and automation

4

03A few patterns emerge …

• Begins with need for better relevancy ~ automatically

• More and more mission-critical data lives in Fusion

• Much of big data is unstructured making search the ideal exploration technology ~ people grok search

• Speed is addictive!

• But integrating these two is a non-trivial problem to solve -> Fusion FTW!

• fusion-spark-bootcamp: http://bit.ly/2dZfBhk

http://bit.ly/2dZfBhk

5

01Data Ingest

• Connectors! Lots of them …

• Pipelines … because data ingest is messy

• JavaScript when you must!

• SparkSQL too! Replace DIH with SparkSQL JDBC datasource: 31K docs / sec on a small Spark cluster

gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181

• Spark Streaming to Solr too

6

01Time-based Partitioning• Docs partitioned into time-based collections in Solr

• New time partitions created on-the-fly when needed; older partitions should age out automatically

• Need a document router to index docs in the correct collection based on timestamp (doesn’t use aliases)

• Need a query router to read the appropriate collections based on query time range

• Deeper analytics on larger historical time ranges achieved using Spark by joining Solr with archived files stored in HDFS

• Check out the eventsim lab in the bootcamp

7

02Common access patterns with big data

• Big data systems have grown complex trying to satisfy a variety of access patterns

• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)

• Low-latency ranked retrieval and facet-driven discovery (Solr, Elastic, DataStax, …)

• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)

• Graph traversal (Graphx, Giraph, Neo4j, …)

8

01Solr Streaming Inside• Relies on docValues (column-oriented data

structure) and /export handler

• Extreme read performance (8-10x faster than queries using cursorMark)

• Facet or map/reduce style aggregation modes

• Tiered architecture

• SQL interface tier

• Worker tier (scale a pool of worker “nodes” independently of the data collection)

• Data tier (Solr collection)

9

01Fusion Signals for Relevance• Simple DSL for aggregating user interactions

with search results, quite useful for boosting & recommendations

• Scale using Spark

• Take user activity and feed it back into the search engine to improve relevancy using Fusion query pipelines

• Integrated with Lucidworks View to capture user activity

• Custom logic via JavaScript … don’t get bogged down into the weeds of Spark

10

01Self-service Analytics

• Can’t overstate the importance of SQL in big data

• Shortage of data scientists and engineers, abundance of SQL-savvy business analysts

• JDBC-compliant Tools abound!

• De-normalization is inconvenient

• Apache Zeppelin for exploring data in Solr and other data sources

11

01Best of Both Worlds: Spark SQL and Solr SQL

• Spark SQL provides an amazing query plan optimizer with SQL2003 support

• BUT … Spark SQL can’t compete with Solr performance for queries that can be expressed in Solr

• Push-down aggregations into the engine!

• spark-solr tries to detect when sub-queries can be pushed down into Solr

• movielens lab in fusion-spark-bootcamp

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

12

01Fusion Catalog API

• REST API for CRUD on data assets: views, tables, UDFs, etc

• Full-text search for business analysts to find data sets of interest

• Tool for SMEs to share complex data sets as simple views

• Authn & Authz via Fusion security

• Seamless integration with SparkSQL, streaming expressions, parallel SQL, and JDBC

parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc" )

13

01Custom Script Jobs

• Not limited by our built-in toolset

• Develop a custom Spark script in Scala and then upload it Fusion to be scheduled and run on Spark cluster

• Focus more on solving business problems vs. ops / job mgmt

• See apachelogs example in the fusion-spark-bootcamp

sessionize using window function and then compute aggregations for each session

14

01Data science in a box• REPL with hooks into Solr for quickly exploring

unstructured data sets

• Jake's RecSys recipe for building recommender systems

• Full access to Lucene text analyzers when building ML pipelines

• See mlsvm & ml20news labs in the fusion-spark-bootcamp

• searchhub.lucidworks.com

see slides from Grant’s talk about SearchHub

http://searchhub.lucidworks.com

15

01Machine Learning in Index & Query Pipelines

• Query intent

• Document classification

• Recommendations

• Design / evaluate / refine models in Spark ML pipelines or MLlib and then publish to Fusion to generate predictions from query / index pipelines

ID#of#model#stored#in#Fusion’s#blob#store#

Field#to#store#model#predic5on#in#each#document#during#indexing#

16

01Example: Sentiment Classifier during Indexing

17

03You could do this yourself …

• It’s too easy to fallback into the trap of thinking that hard work getting cool technologies working together equates to business value.

• Get back to focusing on solving business problems ~ increased ROI, faster

• Fusion gives you a clear buy vs. build choice

Billions of Docs

Optional

REST

Security woven throughout

Prox

yRecs

Worker

Pipes Metrics

NLP Sched.

Blobs Admin

Connectors

Worker Cluster Mgr.

Spark

Shards Shards

Solr

HD

FS

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Zookeeper

ZK N

Signals

Fusion Architecture

Millions of Users

19

01Thanks! Q & A

• Try Fusion: https://lucidworks.com/products/fusion/download/

• spark-solr: http://bit.ly/1Ub12GU

• fusion-spark-bootcamp: http://bit.ly/2dZfBhk

• 40% off Manning books coupon code: ctwlucsoltw

https://lucidworks.com/products/fusion/download/

http://bit.ly/2dZfBhk

Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks

Technology

Transcript of Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks