Fighting Fraud with Apache Spark

Fighting Fraud in Medicare with Apache Spark

Miklos Christine Solutions [email protected], @Miklos_C

About Me: Miklos ChristineSolutions Architect @ Databricks

- Assist customers architect big data platforms- Help customers understand big data best practices

Previously:- Systems Engineer @ Cloudera

- Supported customers running a few of the largest clusters in the world

- Software Engineer @ Cisco

Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

3

Data Value

Created Databricks on top of Spark to make big data simple.

Next Generation Big Data Processing Engine

• Started as a research project at UC Berkeley in 2009• 600,000 lines of code (75% Scala)• Last Release Spark 1.6 December 2015• Next Release Spark 2.0 • Open Source License (Apache 2.0)• Built by 1000+ developers from 200+ companies

9

…

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL SparkML /

MLLib

Graph Frames / GraphX

Unified engine across diverse workloads & environments

Scale out fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

History of Spark APIs

RDD(2011)

DataFrame(2013)

Distribute collection of JVM objects

Functional Operators (map, filter, etc.)

Distribute collection of Row objects

Expression-based operations and UDFs

Logical plans and optimizer

Fast/efficient internal representations

DataSet(2015)

Internally rows, externally JVM objects

Almost the “Best of both worlds”: type safe + fast

But slower than DF Not as good for interactive analysis, especially Python

Apache Spark 2.0 API

DataSet(2016)

• DataFrame = Dataset[Row]• Convenient for interactive analysis• Faster

DataFrame

DataSet

Untyped API

Typed API

• Optimized for data engineering• Fast

Benefit of Logical Plan:Performance Parity Across Languages

DataFrame

RDD

Machine Learning with Apache Spark

Why do Machine Learning?

• Machine Learning is using computers and algorithms to recognize patterns in data

• Businesses have to Adapt Faster to Change

• Data driven decisions need to be made quickly and accurately

• Customers expect faster responses

15

From Descriptive to Predictive to Prescriptive

16

••

Data Science Time

17

http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Iterate on Your Models

18

•

•

•

•

Spark ML

Why Spark MLProvide general purpose ML algorithms on top of Spark

• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)

Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

SparkMLML Pipelines provide:

• Integration with DataFrames• Familiar API based on

scikit-learn• Easy workflow inspection• Simple parameter tuning

21

Databricks & SparkML• Use DataFrames to directly access data (SQL, raw files)

• Extract, Transform and Load Data using an elastic cluster

• Create the model using all of the data

• Iterate many times on the model

• Deploy the same model to production using the same code

• Repeat

Advantages for Spark ML• Data can be directly accessed using the Spark Data Sources API (no more endless

hours copying data between systems)

• Data Scientist can use all of the data rather than subsamples and take advantage of the Law of Large numbers to improve model accuracy

• Data Scientist can scale compute needs with the data size and model complexity

• Data Scientists can iterate more giving them the opportunity to create better models and test and release more frequently

SparkML - Tips• Understand Spark Partitions

• Parquet file format and compact files• coalesce() / repartition()

• Leverage Existing Functions / UDFs• Leverage DataFrames and SparkML

• Iterative Algorithms • More cores for faster processing

24

What’s new Spark 2.0

Spark 2.0 - SparkML• MLLib is deprecated and in maintenance mode

• New Algorithm Support• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler

feature transformer.

• PySpark Update• LDA, Gaussian Mixture Model, Generalized Linear Regression

• Model Persistence across languages

26

Spark Demo

Thanks!

Sign Up For Databricks Community Edition! https://databricks.com/try-databricks

Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains many more

examples and references.References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

29

https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20MLlib/2%20Algorithms/2%20Naive%20Bayes.html

https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20MLlib/2%20Algorithms/2%20Naive%20Bayes.html

https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20MLlib/6%20Example%20-%20Power%20Plant.html

https://docs.cloud.databricks.com/docs/latest/databricks_guide/05%20MLlib/6%20Example%20-%20Power%20Plant.html

http://spark.apache.org/docs/latest/mllib-guide.html

http://spark.apache.org/docs/latest/mllib-guide.html

http://arxiv.org/abs/1505.06807

http://arxiv.org/abs/1505.06807

Fighting Fraud with Apache Spark

Technology

Transcript of Fighting Fraud with Apache Spark