Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
-
Upload
lucidworks -
Category
Technology
-
view
248 -
download
3
Transcript of Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Your Big Data Stack is Too Big! Timothy Potter
Apache Lucene/Solr Committer & PMC Member Lucidworks
3
01How we got here and where we’re going …
• Giving away 2 books ~ tweet including: #FusionBigData #LuceneSolrRev
• A quick trip down memory lane …
Cassandra, Pig, Hive, HCatalog, HDFS, Mahout, Sqoop, Oozie, Storm, and of course Solr!
• Big Data integration trap
• Lucidworks Fusion provides a viable alternative that emphasizes fast access, agility, and automation
4
03A few patterns emerge …
• Begins with need for better relevancy ~ automatically
• More and more mission-critical data lives in Fusion
• Much of big data is unstructured making search the ideal exploration technology ~ people grok search
• Speed is addictive!
• But integrating these two is a non-trivial problem to solve -> Fusion FTW!
• fusion-spark-bootcamp: http://bit.ly/2dZfBhk
5
01Data Ingest
• Connectors! Lots of them …
• Pipelines … because data ingest is messy
• JavaScript when you must!
• SparkSQL too! Replace DIH with SparkSQL JDBC datasource: 31K docs / sec on a small Spark cluster
gist.github.com/kiranchitturi/0be62fc13e4ec7f9ae5def53180ed181
• Spark Streaming to Solr too
6
01Time-based Partitioning• Docs partitioned into time-based collections in Solr
• New time partitions created on-the-fly when needed; older partitions should age out automatically
• Need a document router to index docs in the correct collection based on timestamp (doesn’t use aliases)
• Need a query router to read the appropriate collections based on query time range
• Deeper analytics on larger historical time ranges achieved using Spark by joining Solr with archived files stored in HDFS
• Check out the eventsim lab in the bootcamp
7
02Common access patterns with big data
• Big data systems have grown complex trying to satisfy a variety of access patterns
• Fast primary key lookups / atomic updates (Solr, HBase, Cassandra, …)
• Low-latency ranked retrieval and facet-driven discovery (Solr, Elastic, DataStax, …)
• Large, distributed table scans (Spark, M/R, Pig, Cassandra, Hive, Impala, …)
• Graph traversal (Graphx, Giraph, Neo4j, …)
8
01Solr Streaming Inside• Relies on docValues (column-oriented data
structure) and /export handler
• Extreme read performance (8-10x faster than queries using cursorMark)
• Facet or map/reduce style aggregation modes
• Tiered architecture
• SQL interface tier
• Worker tier (scale a pool of worker “nodes” independently of the data collection)
• Data tier (Solr collection)
9
01Fusion Signals for Relevance• Simple DSL for aggregating user interactions
with search results, quite useful for boosting & recommendations
• Scale using Spark
• Take user activity and feed it back into the search engine to improve relevancy using Fusion query pipelines
• Integrated with Lucidworks View to capture user activity
• Custom logic via JavaScript … don’t get bogged down into the weeds of Spark
10
01Self-service Analytics
• Can’t overstate the importance of SQL in big data
• Shortage of data scientists and engineers, abundance of SQL-savvy business analysts
• JDBC-compliant Tools abound!
• De-normalization is inconvenient
• Apache Zeppelin for exploring data in Solr and other data sources
11
01Best of Both Worlds: Spark SQL and Solr SQL
• Spark SQL provides an amazing query plan optimizer with SQL2003 support
• BUT … Spark SQL can’t compete with Solr performance for queries that can be expressed in Solr
• Push-down aggregations into the engine!
• spark-solr tries to detect when sub-queries can be pushed down into Solr
• movielens lab in fusion-spark-bootcamp
https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
12
01Fusion Catalog API
• REST API for CRUD on data assets: views, tables, UDFs, etc
• Full-text search for business analysts to find data sets of interest
• Tool for SMEs to share complex data sets as simple views
• Authn & Authz via Fusion security
• Seamless integration with SparkSQL, streaming expressions, parallel SQL, and JDBC
parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc" )
13
01Custom Script Jobs
• Not limited by our built-in toolset
• Develop a custom Spark script in Scala and then upload it Fusion to be scheduled and run on Spark cluster
• Focus more on solving business problems vs. ops / job mgmt
• See apachelogs example in the fusion-spark-bootcamp
sessionize using window function and then compute aggregations for each session
14
01Data science in a box• REPL with hooks into Solr for quickly exploring
unstructured data sets
• Jake's RecSys recipe for building recommender systems
• Full access to Lucene text analyzers when building ML pipelines
• See mlsvm & ml20news labs in the fusion-spark-bootcamp
• searchhub.lucidworks.com
see slides from Grant’s talk about SearchHub
15
01Machine Learning in Index & Query Pipelines
• Query intent
• Document classification
• Recommendations
• Design / evaluate / refine models in Spark ML pipelines or MLlib and then publish to Fusion to generate predictions from query / index pipelines
ID#of#model#stored#in#Fusion’s#blob#store#
Field#to#store#model#predic5on#in#each#document#during#indexing#
16
01Example: Sentiment Classifier during Indexing
17
03You could do this yourself …
• It’s too easy to fallback into the trap of thinking that hard work getting cool technologies working together equates to business value.
• Get back to focusing on solving business problems ~ increased ROI, faster
• Fusion gives you a clear buy vs. build choice
Billions of Docs
Optional
REST
Security woven throughout
Prox
yRecs
Worker
Pipes Metrics
NLP Sched.
Blobs Admin
Connectors
Worker Cluster Mgr.
Spark
Shards Shards
Solr
HD
FS
Shared Config Mgmt
Leader Election
Load Balancing
ZK 1
Zookeeper
ZK N
Signals
Fusion Architecture
Millions of Users
19
01Thanks! Q & A
• Try Fusion: https://lucidworks.com/products/fusion/download/
• spark-solr: http://bit.ly/1Ub12GU
• fusion-spark-bootcamp: http://bit.ly/2dZfBhk
• 40% off Manning books coupon code: ctwlucsoltw