Hugfr SPARK & RIAK -20160114_hug_france

Post on 22-Jan-2018

1.641 views 0 download

Transcript of Hugfr SPARK & RIAK -20160114_hug_france

SPARK & RIAKINTRODUCTION TO THE SPARK-RIAK-CONNECTOR

LATERALTHOUGHTS

Me, Myself & I

Associate at LateralThoughts.com

Scala, Java, Python Developer

Data Engineer @ Axa & Carrefour

Apache Spark Trainer with Databricks

LATERALTHOUGHTS

And the Other One …

Director Sales @ Basho Technologies

(Basho make Riak)

Ex of MySQL France

Co-Founder MariaDB

Funny Accent

Quick Introduction …2011 Creators of Riak

Riak KV: NoSQL key value database Riak S2: Large Object Storage

2015 New Products Basho Data Platform: Integrated NoSQL databases, caching, in-memory analytics, and search

Riak TS: NoSQL Time Series database

120+ employees

Global Offices Seattle (HQ), Washington DC, London, Paris, Tokyo

300+ Enterprise customers, 1/3 of the Fortune 50

PRIORITIZED NEEDS

High Availability - Critical Data

High Scale – Heavy Reads & Writes

Geo Locality – Multiple Data Centers

Operational Simplicity – Resources

Don’t Scale as Clusters

Data Accuracy – Write Conflict Options

RIAK S2 USE CASES

Large Object Store Content Distribution

Web & Cloud Services Active Archives

RIAK KV USE CASES

User Data Session Data Profile Data

Real-time Data Log Data

RIAK TS USE CASES

IoT/Devices Financial/Economic

Scientific Observations Log Data

The Evolution of NoSQL

Unstructured Data Platforms

Multi-Model Solutions

Point Solutions

Basho Data Platform …

ABOUT SPARK & RIAK

Spark & Riak

Disclaimer, the following presentation uses :

Spark v1.5.2

Spark-Riak-Connector v1.1.0

Pre-Requisites

To use the Spark Riak Connector, as of now, you need to build it yourself :

Clone https://github.com/basho/spark-riak-connector

`git checkout v1.1.0`

`mvn clean install`

Bootstrapped project

Reading from

Connect to a Riak KV Cluster from Spark

Query it :

Full Scan

Using Keys

Using secondary indexes (2i)

Connecting to

Loading data from

riakBucket[V](bucketName: String): RiakRDD[V]

riakBucket[V](bucketName: String, bucketType: String): RiakRDD[V]

riakBucket[K, V](bucketName: String, convert: (Location, RiakObject) => (K, V)): RiakRDD[(K, V)]

On your Spark Context, you can use :

add a query, otherwise…

Find all :

Find by key(s) :

Implicits that will give you the riak* methods

Reading from

Using case classes

Using Secondary Indexes

Basic I/O

Mapping Objects - Buckets

Adding fields during save

Spark Riak Connector - RoadmapBetter Integration with Riak TS

Enhanced DataFrames - based on Riak TS Schema APIs

Server-side aggregations and grouping - using TS SQL commands

Speed

Data Locality (partition RDDs according to replication in the cluster) - launch Spark executors on the same nodes where the data resides.

Better mapping from vnodes to Spark workers using coverage plan

Better support for Riak data types (CRDT) and Search queries

Today requires using Java Riak client APIs

Spark Streaming

Provide example and sample integration with Apache Kafka

Improve reliability using Riak for checkpoints and WAL

Add examples and documentation for Python support

DRAFT

Thank you@ogirardot

o.girardot@lateral-thoughts.com

https://github.com/ogirardot/spark-riak-example

https://speakerdeck.com/ogirardot/spark-and-riak-introduction-to-the-spark-riak-connector

@mcarney23

michael.carney@basho.com

fr.basho.com