Open Data Science on Hadoop in the Enterprise
-
Upload
continuum-analytics -
Category
Data & Analytics
-
view
1.245 -
download
1
Transcript of Open Data Science on Hadoop in the Enterprise
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Open Data Science on Hadoop in the EnterpriseFrom Sandbox to Production
Peter Wang CTO, Co-founder
© 2016 Continuum Analytics - Confidential & Proprietary 2
Overview• Open Data Science and Anaconda • Architecture & challenges of real-world Hadoop • Anaconda for Open Data Science in the Enterprise
an inclusive movement that makes open source tools of data science – data, analytics, & computation –
easily work together as a connected ecosystem
Open Data Science is…
Availability | Innovation | Interoperability | Transparency For everyone in the data science team
Open Data Science means…
OPEN DATA SCIENCE IS THE FOUNDATION TO MODERNIZATION
© 2016 Continuum Analytics - Confidential & Proprietary 5
Data Science is not just Machine Learning…
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
© 2016 Continuum Analytics - Confidential & Proprietary 6
Data Science is Interdisciplinary…
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
Classification, deep learning, Regression, PCA
Hadoop, SparkWeb crawling, scraping, 3rd party data & API providers, predictive
services & APIs
GPUs, multi-coresData warehouse, querying, reporting
© 2016 Continuum Analytics - Confidential & Proprietary 7
Numba
dask
xlwings
Airflow
BlazeOpen Source Communities Creates Powerful Technology for Data Science
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary 8
Numba
dask
xlwings
Airflow
BlazePython is the common language
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary 9
Python’s Not the Only One…
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
SQL
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary 10
But it’s also a Great Glue Language
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
SQL
© 2016 Continuum Analytics - Confidential & Proprietary 11
Numba
dask
xlwings
Airflow
BlazeAnaconda is the Open Data Science Platform bringing technology together…
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary 12
Open Data ScienceVibrant and Growing Community
Python Community
30M+Packages in Anaconda
720+
R Community
16M+Spark Python Usage
60%+
ANACONDADownloads
8M+
© 2016 Continuum Analytics - Confidential & Proprietary 13
Open Data Science PlatformACCELERATE. CONNECT. EMPOWER
© 2016 Continuum Analytics - Confidential & Proprietary 14
INNOVATE faster through managed agile experimentation MOVE from analysis to deployment immediately DELIVER powerful results backed by high performance open data science platform
LEVERAGE innovative open source analytics to extract value from data MAXIMIZE your computational power to easily analyze all data CONNECT and integrate all your data sources for predictive models
ITERATE quickly to create powerful analysis and predictive models COLLABORATE and share with your data science team PUBLISH interactive results to the business
ACCELERATETime-to-Value
CONNECTData, Analytics & Compute
EMPOWERData Science Teams
Common Architectures of Real-world Hadoop Environments
15
© 2016 Continuum Analytics - Confidential & Proprietary
Major Components
16
Hadoop Infrastructure: • Hadoop Manager • HDFS NameNode, DataNodes • Hive, Impala servers • YARN Resource Manager • Spark: History server, Gateway
server, compute nodes
DW / Analytics Env: • SQL DB • ETL systems • Data Marts
Data Science Sandbox: • Notebook server • Big memory nodes • GPU nodes
© 2016 Continuum Analytics - Confidential & Proprietary 17
Anaconda Scale System Architecture
© 2016 Continuum Analytics - Confidential & Proprietary 18
Hadoop / Spark
(& existing DW)
App 1
HTTP API
Legacy ETL
App 2
Data marts
XLS, CSV
Viz servers
© 2016 Continuum Analytics - Confidential & Proprietary 19
source: Master Data Management and Data Governance, 2e
© 2016 Continuum Analytics - Confidential & Proprietary 20
source: Master Data Management and Data Governance, 2e
Data Science “Sandbox”
© 2016 Continuum Analytics - Confidential & Proprietary
Common Problems
21
• Data Science Sandbox is on isolated network, outside of “GRC reservation”
• Provides freedom to data scientists • Protects production ETL, DW, event processing • … but moving anything from Sandbox to Production is a huge pain
• Multiple orgs / LOBs interface with Data Science team in the mixed sandbox environment
• Compliance, audit, & risk control?
© 2016 Continuum Analytics - Confidential & Proprietary
Contrasting Concerns
22
Exploration ProductionData • Fast, unfettered access
• Ease of introducing new, varied, messy datasets
• Reproducibility
• Strict, governed access • Well-defined schema • Provenance & auditability
Compute Infrastructure
• High performance • Low latency, interactive • Individualized & specialized
• Scalable, high-availability • Manageable at scale • Cost amortization over many
machines and users
Organization • Individual high-achievers with lots of context & capability
• Agile, able to quickly learn new skills and approaches
• Sustain operations at lowest possible cost
• Robustness against unintended change
© 2016 Continuum Analytics - Confidential & Proprietary
• Data Exploration generates insight & is required to respond to business challenges
• Production data processing & analytics requires different operational concerns
• Over-engineering for either leads to structural deficiencies • Modern & future needs will require more agile exploration
Core Challenges
23
The Core Challenge of Open Data Science in the Enterprise
24
© 2016 Continuum Analytics - Confidential & Proprietary
Conway’s Law
25
The design of any piece of software reflects the communications structure of the organization that produced it.
© 2016 Continuum Analytics - Confidential & Proprietary
Peter’s Corollary to Conway’s Law
26
The architecture of any business data system evolves to reflect the budget structure of the IT groups that maintain it.
… not strategic or operational needs … not ensuring future analytical agility … not optimizing for rapid insights
© 2016 Continuum Analytics - Confidential & Proprietary
• How businesses are used to buying can actually push power away from exploratory data science capabilities
• Information systems have ossified into “software & hardware”, which is fine for straightforward data processing
• Not suited for human-in-the-loop production of inference, insight, knowledge
“Don’t Starve the Unicorns”
27
© 2016 Continuum Analytics - Confidential & Proprietary
• VERY common misconception • Python is probably the most misunderstood language
• There are “tribes” and ecosystems in Python: web dev, scipy, pydata, embedded, scripting, 3D graphics, etc.
• But businesses tend to pigeonhole it: • IT/software/data engineering view: competes with Java, C#, Ruby… • Analytics, stats, data science view: competes with R, SAS, Matlab, SPSS, BI
systems
Data Science != Software Development
28
Python done right can be a powerful, unifying force across the business.
Anaconda for Open Data Science in Hadoop
29
© 2016 Continuum Analytics - Confidential & Proprietary 30
Data ScientistBiz Analyst Data EngineerDeveloper DevOps
Modern Data Science TeamsLove ANACONDA
• Hadoop / Spark • Programming
Languages • Analytic Libraries • IDE • Notebooks • Visualization
• Spreadsheets • Visualization • Notebooks • Analytic
Development Environment
• Database / Data Warehouse
• ETL
• Programming Languages
• Analytic Libraries • IDE • Notebooks • Visualization
• Database / Data Warehouse
• Middleware • Programming
Languages
© 2016 Continuum Analytics - Confidential & Proprietary 31
Anaconda Powers Teams
EXPLORE & ANALYZE
COLLABORATE & PUBLISH
DEPLOY &OPERATE
• Explore & prepare data • Build, test, validate data science models with Python & R • Build simulations & optimizations • View data lineage & reuse transformations • Leverage & explore metadata
• Create & share data science notebooks with interactive visualizations • Identify reusable data science assets easily • Authorize access to data science projects • Manage & control data science asset versions
• Build & share data science packages & environments • Launch & provision distributed environments
© 2016 Continuum Analytics - Confidential & Proprietary 32
Write Once, Deploy AnywhereO
PEN
DAT
A SC
IEN
CE
Explore & Analyze
Collaborate & Publish
Deploy & Operate
Servers Linux, Windows OSX
GPUs & High End Workstations
Linux & Windows NVIDIA, AMD, X86/ARM
Clusters Yarn, Mesos, MPI Power8, LSF, Sun Grid Engine
NoSQL MongoDB Cassandra / DataStax
Hadoop Cloudera, Hortonworks Apache Hadoop & Spark
Files Microsoft Excel Trifacta, Import.io
DW & SQL Any SQL DB Any SQL DW, Impala
© 2016 Continuum Analytics - Confidential & Proprietary 33
Anaconda Architectures
ON-PREMISE PRIVATE CLOUD ANACONDA CLOUD
© 2016 Continuum Analytics - Confidential & Proprietary 34
Public Anaconda Repository
Cloud
Has access to Gateway Repo Have access to Prod Repo
Active Directory/ LDAP Optional
Authentication
Mirror
Anaconda Repository
Multi-Step Process
– Mirror packages from Anaconda’s public Repository to a ‘Gateway’ Repo
– Testers (with authorization to access Gateway) evaluate new packages.
– Approved packages are mirrored to the Production Repo Server
– Standard End users now have access to updated, approved packages.
Gateway (Test) Repo Server
Production Repo ServerMirror
If an Anaconda repo can function as a gateway
“Tester” End User
</>
End User
© 2016 Continuum Analytics - Confidential & Proprietary 35
Public Anaconda Repository
Cloud
conda install numpy ipython conda update ipython conda create –n env1 ipython pandas
conda env upload environment.yml project1 anaconda notebook upload project1.ipynb
conda build project2 anaconda upload project2.bz2
Active Directory/ LDAP Optional
Authentication
Firewall
Anaconda Repository—Air-Gapped Install
On-site Package Repo and Sharing platform
– Mirror public repository of packages
– Analysts consume packages from local repo
– Analysts upload and share notebooks & pre-configured computing environments
– Developers create, deploy & share custom packages
Internal Anaconda Repository (pre-loaded from disk)
Analyst 1 Analyst 2
</>
Developer
© 2016 Continuum Analytics - Confidential & Proprietary 36
Internal Anaconda Repository
Package Control
Head NodeCluster Provisioning Job Submission Worker
Nodes
Edge Node
State Management
Job Control
Package Control
Cluster
Anaconda Scale: Cluster Management
Client Machine
© 2016 Continuum Analytics - Confidential & Proprietary 37
Gateway & Project Nodes,running IPython kernels
Package Control
Internal Anaconda Repository
Authentication
Anaconda Enterprise Notebook Server
Computation
Web Interface
Active Directory/ LDAP Optional
Workflow: – Analyst Log into the Enterprise
notebook server, authenticating against LDAP/AD
– Based on the project they select, is re-directed to the appropriate project node
– All notebooks/python code runs on project nodes; any needed packages are pulled down from your local repository
Anaconda Enterprise Notebook Computing
User 1 User 2 User 3
© 2016 Continuum Analytics - Confidential & Proprietary 38
Client Machine
Internal
Anaconda Repository
Package Control
Head NodeCluster Provisioning
Job Submission Hadoop Worker Nodes
Edge Node
State Management
Cluster
Package Control
Authentication
Anaconda Enterprise Notebook Server
Web Interface
Computation
Package Control
LDAP: TCP 389/636
HTTP: TCP 8080
HTTP: TCP 5002
SSH: TCP 22 SALT: TCP 4505, 4506
HTTP/HTTPS: TCP 80/44 TCP 8080
Teradata
Integrated Environment
User 1 User 2 User 3Analyst 1 Analyst 2 Developer
</>
© 2016 Continuum Analytics - Confidential & Proprietary 39
AnacondaAccelerates Adoption of Open Data Science for Enterprises
Across all Data, Operating Systems, & Hardware Platforms
Explore & Visualize complex data easily
Harness Open Source Python & R Analytics
Write Once, Deploy Anywhere for Scalable High Performance
Data Engineering Simplified for All Data
Collaborate with Your Team anywhere in the World
Integrate Data from Anywhere