Post on 21-Apr-2017
Data Profiling and Pipeline Processing with Spark – A Journey
Suren NathanSynchronoss
(Q3’2014 revenue)
Who am I• Sr. Director Big Data Platform and Analytics Framework
at Synchronoss• CTO at Razorsight (acquired by Synchronoss)• Worked in Analytics and decision support systems for
more than 15 years• Passionate about solving business problems leveraging
latest technology
(Q3’2014 revenue)
Synchronoss provides Personal Cloud and Activation Platforms to Tier One Operators, MSO’s and Enterprises around the globe
Mobile ContentTransfer
PersonalCloud
Device Activation
Cloud Account
Provisioning
On-Boarded
Welcome
Synchronoss Integrated Cloud Products
Online and Device ACTIVATION
Back-up, Sync and Share
ACTIVATION CLOUD Internet of Things
Integrated Life
(Q3’2014 revenue)
Synchronoss Connects Operators to their Customers
Big Data @ SynchronossSample numbers @ one tier1 customer:• 30M registered users• 14M monthly active users• 8M daily active users• Up to 215TB of ingest per day• 62PB of content stored• 50 Billion user content files• Ingest of 1PB per week• 4+ Star Rating Apps
What do we do?• Big Data Analytics Platform Group• Implement scalable big data technology platform
to help deliver consistent analytics • Platform deployed in private cloud and AWS
Data Pipeline Process
Ingest Data
Profile Data
Parse Data
Transform Data
Enrich Data
Aggregate Data
- Perform Analysis- Load
Index Store- Feed EDW
Our Data Pipeline Journey
Data Pipeline – V1
StagingETL
EDWETL
Process CentricETL
Source Data EDW
Multiple Custom ETLs separated from data layerSMP architecture not distributedLong running batch workloadsContention, Bottlenecks with increased data volumeNo support for unstructured dataCannot retain historical data
$$$>1 YEARInflexible
Data Pipeline – V2
StagingETL
EDWETL
Process CentricETL
Source Data EDW
ETLs closer to data High performance, but expensive Batch workloads, with reduced latencies Unable to handle unstructured data Storage costs prohibitive
$$$$6 Months+Still Inflexible
MPP Appliance
Data Pipeline – V3 Option Skipped
Source Data
Did not foresee a huge improvement Batch workloads only Slow performance with MapReduce Lack of resources and skills gap Lack of consistency Too many tools
$$1 year +Risks
Data Pipeline – V4
Source Data
ETLs closer to dataBatch and stream workloadsSuperior performanceAbstracted features via FrameworkComponents and standardsMultiple language supportSimplified design
$<1 MonthHighly Flexible
Data ProfilingPut all the data in the lake man
What’s in these data sets?
More data is better. Work with the
population and not a sample
-- Data Scientists
Why Data Profiling?• Find out what is in the data • Get metrics on data quality • Assess the risk involved in creating business rules • Discover metadata including value patterns and
distributions• Understanding data challenges early to avoid delays
and cost overruns• Improve the ability to search the data
Analysts spend 80-90% of time in data munging
Current approaches require multiple manual touch point and processes
Lost opportunity due to lengthy project time frames
Business Challenge
Typical ScenarioData size too large to view using excel & notepad
Data has to be loaded into database for profilingCannot load into database unless the data fields are known
File formats are not right and specifications are incorrect
Distribution, space, multiple touch points, moving files here and there
CRAZYToo many dependencies, wasted time
What do we need?
Speed, Agility & Automation
Data Profiler RequirementsProfile data from data lake
Validate and Preview data
Review statistics
Create Meta Data
Create Downstream Schema
Spark to the rescue
Check the Types
Check the Values
Calculate metrics
Generate MetaData
RDD
….
C1 C2 C3 C4 Cn
RDDsData Files
Dynamic build execution graph
Map-> Map
Built in transformations (unique, get first etc.,)
In memory execution provides speed
Execution Flow and Software Stack
Repository
Data Lake location for data
Data Profiler UI
3
SparkData Profiler
1
2
4
5
6.1
6.3
7
6.2
8
MapR FS (M7)
Spark
Spark Monitoring UI
Spark Data Profiler MapR
UI
MapR Cluster
Hardware Infrastructure Level
OS/File System Level
Razorsight Application Level
System Application Level
Legend:
NFS
Meta Data Repository
WEB Server
Data Profiler UI
Univariate StatisticsOutputs for Numeric Values Outputs for
Non-Numeric Values
Histograms
Count of Missing Values
Count of Non-Missing Values
Mean
Variance
Standard Deviation
Minimum
Maximum
Range
Mode
Median
Q1 Value
Q3 Value
Interquartile Range
Skewness
Kurtosis
Data Profiler Web Application
Meta Data and DDL
Advantages• Source data in data lake• All profiling done in the data lake• No manual movement of data• Profile sample or full data set• Integrate creation of meta data for transformation,
enrichment• Send clean data to downstream processes
Results• Improved data analysis time from weeks to
hours• Average improvement of data pipeline process
80%• Identified data quality issues well ahead of time• Empowered business analysts to perform the
work
Secure Repository
Data Health | Cleansing | Pruning | Transformation |Univariate Analysis
Descriptive | Predictive | Bivariate | Multivariate
RESTful | SOA
Dashboards | Adhoc Queries | KPIs | Alerts
Data Ingestion
Data Lake
Data Preparation
Data Analytics
Data Services
Data Visualization
Layer 1Infrastructure
Layer 2Data Management
Layer 3Modeling
Layer 4Integration
Layer 5Business Insight and Actions
Structured | Unstructured | Batch | Streaming
SFTP NDM Nwk
PathSocial Media StreamEmail
Framework Layers
Framework ComponentsIngestion
Multiple source channels
Batch/Real Time
Data Validation
Compression/Encryption
ProfilingData Health Check
Summary Statistics
Scrubbing/Cleansing
Meta Data Creation
Parsing
Fixed Width
Delimited
Mapping
TransformationEnrichment
Truncation
Imputation
Aggregation
Integration
Batch
RESTful
Database
Web PortalMeta Data Configuration
Tracking
Alerts
Dashboard
Framework Architecture
ProcessingComponents
Data Storage Layer
Data Aggregator
Data Parser &
Transformer
Elastic Search Loader
DB LoaderData Reconciliation
Orchestration Layer
Elastic Search
XDF Web UI
Data Profiler
MySQL
Meta-data Repository
Control Flow
Data Flow
Data Partitioner
Synchronoss Data Lake
Data Ingestion
DataBeacon
ExternalData
SourcesBivariate Engine
Data Prep Engine
SQL Engine
Framework Technology Stack
MapR FS (M7)
ScoopApache Spark
Hadoop
MapR Cluster
Hardware Infrastructure Level
OS/File System Level
System Application Level
NFS
UI/Control Cluster
Oozie ApacheDrill
Tomcat
ActiveMQ
Spring Integration
HUE
ElasticSearch Cluster
NFS
ElasticSearch Engine
Angular REST
Unix/Linux Unix/Linux Unix/Linux
What’s Next?• Bivariate Analysis • Multicollinearity
Outputs for Numeric Values(by target value for each variable)
Correlation Outputs
Record Count
Row Count Percent
Average
Variance
Standard Deviation
Skewness
Kurtosis
Minimum
Maximum
Pearson’s Correlation Coefficient
Spearman’s Correlation Coefficient
Covariance
Variable Clustering
Regression Coefficients
Dendogram
Hierarchical Cluster(HCA)
Correlation Matrix
Variance Inflation Factor
(VIF)
Lessons• Let business value drive technology adoption• Plan incremental updates• Pay attention to hidden costs• Simplify• Implement Framework based development• Leverage existing skillset to scale
Simplify
THANK YOU.Suren.nathan@Synchronoss.com