PyData: The Next Generation
-
Upload
wes-mckinney -
Category
Technology
-
view
17.117 -
download
1
Transcript of PyData: The Next Generation
1 © Cloudera, Inc. All rights reserved.
PyData: The Next Genera@on Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
2 © Cloudera, Inc. All rights reserved.
PyData: Everything’s awesome…or is it? Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
3 © Cloudera, Inc. All rights reserved.
Me
• Data systems, tools, Python guru at Cloudera • Formerly Founder/CEO of DataPad (visual analy@cs startup) • Created pandas in 2008, lead developer un@l 2013 • Python for Data Analysis, published 10/2012 • O’Reilly’s best-‐selling data book of 2014
• Pythonista since 2007
4 © Cloudera, Inc. All rights reserved.
What’s this about?
• Hopes and fears for the community and ecosystem • Why do I care? • Python is fun! • Leverage • Accessibility for newbies • Community: smart, nice, humble people
5 © Cloudera, Inc. All rights reserved.
Python at Cloudera
• Want Cloudera plaaorm users to be successful with Python
• Spark/PySpark part of the Enterprise Data Hub / CDH
• Ac@vely inves@ng in Python tooling • (p.s. we’re hiring?) • (p.p.s. we have an Aus@n office now!)
6 © Cloudera, Inc. All rights reserved.
Historical perspec@ve and background
• 20 years of fast numerical compu@ng in Python (Numeric 1995) • 10 years of NumPy • PyData becomes a thing in 2012 • Python as a data language goes mainstream • Job descrip@ons tell all • Shig in larger Python community from web towards data • PyCon 2015 commihee reported substan@al growth in data-‐related submissions!
7 © Cloudera, Inc. All rights reserved.
How’d this happen?
• Data, data everywhere • Science! scikit-‐learn, statsmodels, and friends • Comprehensive data wrangling tools and in-‐memory analy@cs/repor@ng (pandas) • IPython Notebook • Learning resources (books, conferences, blogs, etc.) • Python environment/library management that “just works”
8 © Cloudera, Inc. All rights reserved.
Put a Python (interface) on it! Something no one got fired for, ever.
9 © Cloudera, Inc. All rights reserved.
Meanwhile…
• Hadoop and Big Data go mainstream in 2009 onward • First Hadoop World: Fall 2009 • First Strata conference: Spring 2011
• Lots of smart engineers in fast-‐growing businesses with massive analy@cs / ETL problems • Solu@ons built, frameworks developed, companies founded • Python was generally not a central part of those solu@ons • A lot of our nice things weren’t much help for data munging and coun@ng at scale (more on this later)
10 © Cloudera, Inc. All rights reserved.
We’re lucky to have lots of nice things
• What a language! • IPython: interac@ve compu@ng and collabora@on • Libraries to solve nearly any (non-‐big data) problem • Trustworthy (medium) data wrangling, sta@s@cs, machine learning • HPC / GPU / parallel compu@ng frameworks • FFI tools • … and much more
12 © Cloudera, Inc. All rights reserved.
So, what kind of big data?
• Big mul@dimensional arrays / linear algebra
• Big tables (structured data)
• Big text data (unstructured data)
• Empirically I personally am mostly interested in big tables
13 © Cloudera, Inc. All rights reserved.
What kind of big data problems?
• ETL / Data Wrangling • Python been used here for years with Hadoop Streaming
• BI / Analy@cs (“things you can do in SQL”)
• Advanced Analy@cs / Machine Learning
14 © Cloudera, Inc. All rights reserved.
Some ways we are #winning
• Python seen as a viable alterna@ve to SAS/MATLAB/proprietary sogware without nearly as much arguing
• Huge uptake in the financial sector
• Many current and upcoming genera@ons of data scien@sts learning Python as a first language
• Python in HPC / scien@fic compu@ng
15 © Cloudera, Inc. All rights reserved.
Some ways we are not #winning
• Python s@ll doesn’t have a great “big data story”
• Lihle venture capital trickling down to Python projects
• Data structures and programming APIs lagging modern reali@es • Weak support for emerging data formats
• Many companies with Python big data successes have not open-‐sourced their work
16 © Cloudera, Inc. All rights reserved.
Python in big data workflows in prac@ce
HDFS Hadoop-‐MR
Spark SQL
Big Data, Many machines Small/Medium Data, One Machine
pandas
Viz tools
ML / Stats
More coun@ng / ETL More insights / repor@ng
DSLs
17 © Cloudera, Inc. All rights reserved.
Big data storage formats
• JSON and CSV are not a good way to warehouse data • Apache Avro • Compact binary data serializa@on format • RPC framework
• Apache Parquet • Efficient columnar data format op@mized for HDFS • Supports nested and repeated fields, compression, encoding schemes • Co-‐developed by Twiher and Cloudera • Reference impl’s in Impala (C++), and standalone Java/Scala (used in Spark)
18 © Cloudera, Inc. All rights reserved.
We’re living in a JVM world
• Scala rapidly taking over big data analy@cs • Func@onal, concise, good for building high level DSLs • Build nice Scala APIs to clunkier Java frameworks
• JVM legi@mately good for concurrent, distributed systems
• Binary interface with Python a major issue
19 © Cloudera, Inc. All rights reserved.
Dremel, baby, Dremel…
• VLDB 2010: Dremel: Interac5ve Analysis of Web-‐Scale Datasets • Inspira@on for Parquet (cf blog “Dremel made easy with Parquet”) • Peta-‐scale analy@cs directly on nested data
• Google BigQuery said to be a IaaS-‐ifica@on of Dremel • Supports SQL variant + new user-‐defined func@ons with JavaScript + V8
SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1,
SUM(a.b.p.q.r) WITHIN RECORD AS c2
FROM T3)
20 © Cloudera, Inc. All rights reserved.
Cloudera Impala
• Open-‐source interac@ve SQL for Hadoop
• Analy@cal query processor wrihen in C++ with LLVM code genera@on • Op@mized to scan tables (best as Parquet format) in HDFS • SQL front-‐end and query op@mizer / planner • User-‐defined func@on API (C++) • impyla enables Python UDFs to be compiled with Numba to LLVM IR
21 © Cloudera, Inc. All rights reserved.
Cloudera Impala (cont’d)
• For high performance big data analy@cs, Impala could be Python’s best friend
• C++/LLVM backend is lower-‐level than SQL
• Nested data support is coming
23 © Cloudera, Inc. All rights reserved.
Set point: Hadley Wickham
• R has upped it’s game with dplyr, @dyr, and other new projects • New standard for a uniform interface to either in-‐memory or in-‐database data processing • Composable table primi@ve opera@ons • Mul@ple major versions shipped, gevng adopted
80dc69b 2012-10-28 | Initial commit of dplyr [hadley]
tbl %>% filter(c==‘bar’) %>% group_by(a, b) %>% summarise(metric=mean(d – f)) %>% arrange(desc(metric))
24 © Cloudera, Inc. All rights reserved.
Blaze
• Shares some seman@cs with dplyr • Uses a generalized datashape protocol
• Fresh start in 2014 under Mahhew Rocklin’s (Con@nuum) direc@on • Deferred expression API • Support for piping data between storage systems • Mul@ple backends (pandas, SQL, MongoDB, PySpark, …) • Growing support for out-‐of-‐core analy@cs
25 © Cloudera, Inc. All rights reserved.
libdynd
• Led by Mark Wiebe at Con@nuum Analy@cs • Pure C++11 modern reimagining of NumPy • Python bindings • Supports variadic data cells and nested types (datashape protocol)
• Development has focused on the data container design over analy@cs
26 © Cloudera, Inc. All rights reserved.
PySpark
• Popularity may exceed official Scala API • Spark was not exactly designed to be an ideal companion to Python • General architecture • Users build Spark deferred expression graphs in Python • User-‐supplied func@ons are serialized and broadcast around the cluster • Spark plans job and breaks work into tasks executed by Python worker jobs • Data is managed / shuffled by the Spark Scala master process • Python used largely as a black box to transform input to output
27 © Cloudera, Inc. All rights reserved.
PySpark: Some more gory details
• Spark master controlled using py4j • Py4J docs: “If performance is cri@cal to your applica@on, accessing Java objects from Python programs might not be the best idea”
• Data is marshalled mostly with files with various serializa@on protocols (pickle + bespoke formats)
• Does not na5vely interface with NumPy (yet) • But, the in-‐memory benefits of Spark over Hadoop Streaming alterna@ves massively outweigh the downsides
# pass large object by py4j is very slow and need much memory
28 © Cloudera, Inc. All rights reserved.
Spartan
• hhp://github.com/spartan-‐array/spartan • Python distributed array expression evaluator (“distributed NumPy”) • Developed by Russell Power & others at NYU • Uses ZeroMQ and custom RPC implementa@on
29 © Cloudera, Inc. All rights reserved.
Things I think we should do
• Create high fidelity data structures for Dremel-‐style data
• Get serious about Avro, Parquet, and other new data format standards
• Invest in the Python-‐Impala-‐LLVM rela@onship
• Efficient binary protocols to receive and emit data from Python processes
30 © Cloudera, Inc. All rights reserved.
Conclusions
• Python + PyData stack is as strong as ever, and s@ll gaining momentum
• The @me for a “dark horse” Python-‐centric big data solu@on has probably passed us by. Maybe beher to pursue alliances.
• Focused work is needed to s@ll be relevant in 2020. Some of our compe@@ve advantages are eroding