Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE...

28
Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT PROFESSOR, DEPARTMENT. OF CSE, JAMIA HAMDARD (DEEMED TO BE UNIVERSITY), NEW DELHI, INDIA. https://syedimtiyazhassan.org [email protected] http://www.jamiahamdard.edu

Transcript of Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE...

Page 1: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

Real World Data AnalysisPANDAS

PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES

DR. SYED IMTIYAZ HASSANASSISTANT PROFESSOR, DEPARTMENT. OF CSE, JAMIA HAMDARD(DEEMED TO BE UNIVERSITY), NEW DELHI, INDIA.https://[email protected]://www.jamiahamdard.edu

Page 2: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

INTRODUCTION

For fast, flexible, and expressive data structures.

Designed to make working with “relational” or “labeled” data.

Prepared from:

https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html

2

Page 3: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

WELL SUITED FOR

Tabular data with heterogeneously-typed columns.

Ordered and unordered (not necessarily fixed-frequency) time series data.

Arbitrary matrix data (homogeneously typed orheterogeneous) with row and column labels.

Any other form of observational / statistical data sets.

The data actually need not be labeled at all to beplaced into a pandas data structure.

3

Page 4: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

DATA STRUCTURES

Series: 1D labeled homogeneously-typed array.

DataFrame: General 2D labeled, size-mutabletabular structure with potentially heterogeneously-typed column.

4

• import numpy as np

• import pandas as pd

• s = pd.Series([1, 3, 5, np.nan, 6, 8])

• s

Page 5: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

SERIES

A Series by passing a list of values, letting pandascreate a default integer index.

5

import numpy as npimport pandas as pds = pd.Series([1, 3, 5, np.nan, 6, 8])s

Page 6: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

OBJECT CREATION

A DataFrame by passing a NumPy array, with a:

datetime index and

labeled columns.

NumPy arrays have one dtype for the entire array, while pandasDataFrames have one dtype per column.

labeled columns

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))df

datetime index

6

dates = pd.date_range('20130101', periods=6)dates

Page 7: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

OBJECT CREATION A DataFrame by passing a dict of objects that can be converted

to series-like.

DataFrame

df2 = pd.DataFrame({'A': 1.,'B': pd.Timestamp('20130102'),'C': pd.Series(1, index=list(range(4)), dtype='float32'),'D': np.array([3] * 4, dtype='int32'),'E': pd.Categorical(["test", "train", "test", "train"]),'F': 'foo'})

df2

7

df2.dtypes

Page 8: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

VIEWING DATA df.head()

df.tail(3)

df.index

df.columns

df.describe()

df.T

df.to_numpy()

8

Page 9: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

SORTING

By Axis

By Values

By Axis

df.sort_index(axis=1, ascending=False)

datetime index

9

df.sort_values(by='B')

Page 10: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

SELECTION

Getting Selection by Label

df.loc

df.at

Selection by Position df.iloc

df.iat

Boolean Indexing

10

Page 11: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

GETTING Selecting a single column, which yields a Series, equivalent todf.A

df['A']

Selecting via [], which slices the rows.

11

df[0:3]

df['20130102':'20130104']

df

df.A

Page 12: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

SELECTION BY LABEL

Selecting on a multi-axis by label.

12

df1 = pd.DataFrame(np.random.randn(6, 4))

df1.loc[0]

df.loc[dates[0]]

df.loc[:, ['A', 'B']]

df.loc['20130102':'20130104', ['A', 'B']]

df.loc['20130102', ['A', 'B']]

df.loc[dates[0], 'A']

df.at[dates[0], 'A'] #Fast

Page 13: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

SELECTION BY POSITION

13

df.iloc[3]

df.iloc[3:5, 0:2]

df.iloc[[1, 2, 4], [0, 2]]

df.iloc[1:3, :]

df.iloc[:, 1:3]

df.iat[1, 1]

df.iloc[1, 1]

Page 14: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

BOOLEAN INDEXING

14

df[df.A > 0] df[df > 0]

df2 = df.copy()df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']df2

df2[df2['E'].isin(['two', 'four'])]

Page 15: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

SETTING

15

df['F'] = s1s1

df.iat[0, 1] = 0df

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

df.at[dates[0], 'A'] = 0df

df2 = df.copy()df2[df2 > 0] = -df2df2

Page 16: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

MISSING DATA

df1 = df.copy()df1.dropna(how='any')

16

df1.fillna(value=5)

pd.isna(df1)

Drop any rows that have missing data.

Filling missing data.

Get the Boolean mask where values are nan.

Page 17: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

OPERATIONS

Stats

Apply

Concat

Join

Append

Grouping

17

Page 18: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

STATSAPPLY

df.mean()

18

df.mean(1)

df.apply(np.cumsum)

Same operation on the other axis

Operations in general exclude missing data.

df.apply(lambda x: x.max() - x.min())

Applying functions to the data.

Page 19: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

HISTOGRAM

s = pd.Series(np.random.randint(0, 7, size=10))s

19

s.value_counts()

Page 20: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

CONCAT

df = pd.DataFrame(np.random.randn(10, 4))

df

20

pieces = [df[:3], df[3:7], df[7:]]pieces

pd.concat(pieces)

Page 21: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

JOIN

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})left

21

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

right

pd.merge(left, right, on='key')

Page 22: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

JOIN

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})left

22

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})right

pd.merge(left, right, on='key')

Page 23: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

APPEND

df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])df

23

s = df.iloc[3]s

df.append(s, ignore_index=True)

Page 24: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

GROUPING

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],'B': ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],'C': np.random.randn(8),'D': np.random.randn(8)})

df

24

df.groupby('A').sum()

df.groupby(['A', 'B']).sum()

Page 25: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

PLOTTING

25

Page 26: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

PLOTTING

26

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])

df

df.plot()

df.plot()

df = df.cumsum()

Page 27: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

DATA FILES

27

Format Type Data Description Reader Writer

text CSV read_csv to_csv

text JSON read_json to_json

text HTML read_html to_html

text Local clipboard read_clipboard to_clipboard

binary MS Excel read_excel to_excel

binary HDF5 Format read_hdf to_hdf

binary Feather Format read_feather to_feather

binary Parquet Format read_parquet to_parquet

binary Msgpack read_msgpack to_msgpack

binary Stata read_stata to_stata

binary SAS read_sas

binary Python Pickle Format read_pickle to_pickle

SQL SQL read_sql to_sql

SQL Google Big Query read_gbq to_gbq

df.to_csv('foo.csv')

pd.read_csv('foo.csv')

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

df.to_excel('foo.xlsx', sheet_name='Sheet1')

Page 28: Real World Data Analysis PANDAS · 5.03.2019  · Real World Data Analysis PANDAS PYTHON PACKAGE FOR FAST, FLEXIBLE, AND EXPRESSIVE DATA STRUCTURES DR. SYED IMTIYAZ HASSAN ASSISTANT

THANK YOU