20161215 python pandas-spark四方山話

26
Python, Pandas, Spark 2.0 Sky

Transcript of 20161215 python pandas-spark四方山話

Page 1: 20161215 python pandas-spark四方山話

Python, Pandas, Spark 2.0

Sky

Page 2: 20161215 python pandas-spark四方山話
Page 3: 20161215 python pandas-spark四方山話

• Python 2000

(**)

• db tech showcase MongoDB

• FB: Ryuji Tamagawa• Twitter : tamagawa_ryuji

Page 4: 20161215 python pandas-spark四方山話
Page 5: 20161215 python pandas-spark四方山話

2017

Page 6: 20161215 python pandas-spark四方山話

• Python Spark

Page 7: 20161215 python pandas-spark四方山話

• Python / Pandas

• Spark 2.0

Page 8: 20161215 python pandas-spark四方山話

Part 1 :

Page 9: 20161215 python pandas-spark四方山話

csv

Page 10: 20161215 python pandas-spark四方山話

Python

Pandas Python

Jupyter Notebook

Jenkins

Spark 2.0

Page 11: 20161215 python pandas-spark四方山話

• Spark API RDD ~1.3 DataFrame

/ DataSet 1.4~

• DataFrame API

RDD API Python Spark

Page 12: 20161215 python pandas-spark四方山話

DataFrame• RDB /

• R Pandas Spark

Spark

R / Pandas

Spark +

Page 13: 20161215 python pandas-spark四方山話

Part 2 :

Page 14: 20161215 python pandas-spark四方山話

CSVzip

RDB

Parquet

Excel

CSV

Feather

Spark

Pandas / Spark

Page 15: 20161215 python pandas-spark四方山話

• CPU

• Pandas read_csv zip CSV

Pandas

Page 16: 20161215 python pandas-spark四方山話

2

• CSV CPU

Pandas zip CSV

CPU …

• Parquet !

Page 17: 20161215 python pandas-spark四方山話

: Parquet

I/O

• Spark Parquet• Python Parquet

Page 18: 20161215 python pandas-spark四方山話

HDFS / S3

Parquet Parquet

Page 19: 20161215 python pandas-spark四方山話

SSD

Parquet Parquet

Page 20: 20161215 python pandas-spark四方山話

Parquet

No

No

Yes

HDD

Page 21: 20161215 python pandas-spark四方山話

• I/O Pandas

• Spark

• DataFrame Pandas → Spark

Spark → Pandas Pandas → Spark

• Apache Arrow

Page 22: 20161215 python pandas-spark四方山話

CPU

~2010

2010~SSD

CPU

Page 23: 20161215 python pandas-spark四方山話

Apache Spark 2.0• 1.x

• 2.0

1.x

• DataFrame API Python

• databricks

http://go.databricks.com/mastering-apache-spark-2.0

Page 24: 20161215 python pandas-spark四方山話

Spark 2.0

• CPU

• CPU

• SQL DataFrame

• + SSD

• CSV zip

Pandas read_csv

Page 25: 20161215 python pandas-spark四方山話

Python + Spark• Python serialize

• DataFrame API UDFUDF Scala/Java

• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-and-dataframe-api

Executor

JVM

DataFrame, Cached

Python

lambda items: items[0] == ‘abc’

transfer

DataFrame, result

transfer

Dri

ver

Page 26: 20161215 python pandas-spark四方山話