Fighting Fraud with Apache Spark
-
Upload
miklos-christine -
Category
Technology
-
view
210 -
download
1
Transcript of Fighting Fraud with Apache Spark
Fighting Fraud in Medicare with Apache Spark
Miklos Christine Solutions [email protected], @Miklos_C
About Me: Miklos ChristineSolutions Architect @ Databricks
- Assist customers architect big data platforms- Help customers understand big data best practices
Previously:- Systems Engineer @ Cloudera
- Supported customers running a few of the largest clusters in the world
- Software Engineer @ Cisco
Databricks, the company behind Apache Spark
Founded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
3
Data Value
Created Databricks on top of Spark to make big data simple.
Next Generation Big Data Processing Engine
• Started as a research project at UC Berkeley in 2009• 600,000 lines of code (75% Scala)• Last Release Spark 1.6 December 2015• Next Release Spark 2.0 • Open Source License (Apache 2.0)• Built by 1000+ developers from 200+ companies
9
…
Apache Spark Engine
Spark Core
Spark StreamingSpark SQL SparkML /
MLLib
Graph Frames / GraphX
Unified engine across diverse workloads & environments
Scale out fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
History of Spark APIs
RDD(2011)
DataFrame(2013)
Distribute collection of JVM objects
Functional Operators (map, filter, etc.)
Distribute collection of Row objects
Expression-based operations and UDFs
Logical plans and optimizer
Fast/efficient internal representations
DataSet(2015)
Internally rows, externally JVM objects
Almost the “Best of both worlds”: type safe + fast
But slower than DF Not as good for interactive analysis, especially Python
Apache Spark 2.0 API
DataSet(2016)
• DataFrame = Dataset[Row]• Convenient for interactive analysis• Faster
DataFrame
DataSet
Untyped API
Typed API
• Optimized for data engineering• Fast
Benefit of Logical Plan:Performance Parity Across Languages
DataFrame
RDD
Machine Learning with Apache Spark
Why do Machine Learning?
• Machine Learning is using computers and algorithms to recognize patterns in data
• Businesses have to Adapt Faster to Change
• Data driven decisions need to be made quickly and accurately
• Customers expect faster responses
15
From Descriptive to Predictive to Prescriptive
16
••
Data Science Time
17
Iterate on Your Models
18
•
•
•
•
Spark ML
Why Spark MLProvide general purpose ML algorithms on top of Spark
• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)
Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility
SparkMLML Pipelines provide:
• Integration with DataFrames• Familiar API based on
scikit-learn• Easy workflow inspection• Simple parameter tuning
21
Databricks & SparkML• Use DataFrames to directly access data (SQL, raw files)
• Extract, Transform and Load Data using an elastic cluster
• Create the model using all of the data
• Iterate many times on the model
• Deploy the same model to production using the same code
• Repeat
Advantages for Spark ML• Data can be directly accessed using the Spark Data Sources API (no more endless
hours copying data between systems)
• Data Scientist can use all of the data rather than subsamples and take advantage of the Law of Large numbers to improve model accuracy
• Data Scientist can scale compute needs with the data size and model complexity
• Data Scientists can iterate more giving them the opportunity to create better models and test and release more frequently
SparkML - Tips• Understand Spark Partitions
• Parquet file format and compact files• coalesce() / repartition()
• Leverage Existing Functions / UDFs• Leverage DataFrames and SparkML
• Iterative Algorithms • More cores for faster processing
24
What’s new Spark 2.0
Spark 2.0 - SparkML• MLLib is deprecated and in maintenance mode
• New Algorithm Support• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler
feature transformer.
• PySpark Update• LDA, Gaussian Mixture Model, Generalized Linear Regression
• Model Persistence across languages
26
Spark Demo
Thanks!
Sign Up For Databricks Community Edition! https://databricks.com/try-databricks
Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains many more
examples and references.References• Apache Spark MLlib User Guide
• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.
• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)
29