20161215 python pandas-spark四方山話

Post on 23-Jan-2018

464 views 0 download

Transcript of 20161215 python pandas-spark四方山話

Python, Pandas, Spark 2.0

Sky

• Python 2000

(**)

• db tech showcase MongoDB

• FB: Ryuji Tamagawa• Twitter : tamagawa_ryuji

2017

• Python Spark

• Python / Pandas

• Spark 2.0

Part 1 :

csv

Python

Pandas Python

Jupyter Notebook

Jenkins

Spark 2.0

• Spark API RDD ~1.3 DataFrame

/ DataSet 1.4~

• DataFrame API

RDD API Python Spark

DataFrame• RDB /

• R Pandas Spark

Spark

R / Pandas

Spark +

Part 2 :

CSVzip

RDB

Parquet

Excel

CSV

Feather

Spark

Pandas / Spark

• CPU

• Pandas read_csv zip CSV

Pandas

2

• CSV CPU

Pandas zip CSV

CPU …

• Parquet !

: Parquet

I/O

• Spark Parquet• Python Parquet

HDFS / S3

Parquet Parquet

SSD

Parquet Parquet

Parquet

No

No

Yes

HDD

• I/O Pandas

• Spark

• DataFrame Pandas → Spark

Spark → Pandas Pandas → Spark

• Apache Arrow

CPU

~2010

2010~SSD

CPU

Apache Spark 2.0• 1.x

• 2.0

1.x

• DataFrame API Python

• databricks

http://go.databricks.com/mastering-apache-spark-2.0

Spark 2.0

• CPU

• CPU

• SQL DataFrame

• + SSD

• CSV zip

Pandas read_csv

Python + Spark• Python serialize

• DataFrame API UDFUDF Scala/Java

• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-and-dataframe-api

Executor

JVM

DataFrame, Cached

Python

lambda items: items[0] == ‘abc’

transfer

DataFrame, result

transfer

Dri

ver