Post on 07-Jul-2018
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
1/19
12 Useful Pandas Techniques inPython for Data Manipulation
Introduction
Python is fast becoming the preferred language for data scientists – and for
good reasons. It proides the larger ecosystem of a programming language and
the depth of good scientific computation libraries. If you are starting to learn
Python! hae a loo" at learning path on Python.
#mong its scientific computation libraries! I found Pandas to be the most useful
for data science operations. Pandas! along $ith %ci"it&learn proides almost the
entire stac" needed by a data scientist. This article focuses on proiding 12
ways for data manipulation in Python. I'e also shared some tips &
tricks $hich $ill allo$ you to work faster .
I $ould recommend that you loo" at the codes for data e(ploration before going
ahead. To help you understand better! I'e ta"en a data set to perform these
operations and manipulations.
Data Set: I'e used the data set of )oan Prediction problem. Do$nload the data
set and get started.
http://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-data-science-python/http://www.analyticsvidhya.com/blog/2015/02/data-exploration-preparation-model/http://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://www.analyticsvidhya.com/blog/2015/02/data-exploration-preparation-model/http://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-data-science-python/
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
2/19
)et's get started
I'll start by importing modules and loading the data set into Python enironment*
import pandas as pd
import numpy as np
data = pd.read_csv("train.csv", index_col="Loan_ID")
+1 – ,oolean Inde(ing
-hat do you do! if you $ant to filter alues of a column based on conditions
from another set of columns /or instance! $e $ant a list of all females who
are not graduate and got a loan. ,oolean inde(ing can help here. 0ou can use
the follo$ing code*
data.loc[(data["Gender"]=="emale") ! (data["ducation"]=="#ot Graduate") !
(data["Loan_$tatus"]=="%"), ["Gender","ducation","Loan_$tatus"]]
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
3/19
ead More* Pandas %electing and Inde(ing
+2 – #pply /unctionIt is one of the commonly used functions for playing $ith data and creating ne$
ariables. Apply returns some alue after passing each ro$column of a data
frame $ith some function. The function can be both default or user&defined. /or
instance! here it can be used to find the +missing alues in each ro$ and
column.
&'reate a ne unction*
de num_missin+(x)*
return sum(x.isnull())
&pplyin+ per column*
print "-issin+ values per column*"
print data.apply(num_missin+, axis=) &axis= de/nes t0at unction is to 1e
applied on eac0 column
&pplyin+ per ro*
print "2n-issin+ values per ro*"
http://pandas.pydata.org/pandas-docs/stable/indexing.htmlhttp://pandas.pydata.org/pandas-docs/stable/indexing.html
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
4/19
print data.apply(num_missin+, axis=3).0ead() &axis=3 de/nes t0at unction is to 1e
applied on eac0 ro
Thus $e get the desired result.
3ote* head45 function is used in second output because it contains many ro$s.ead More* Pandas eference 4apply5
+6 – Imputing missing files
7fillna45' does it in one go. It is used for updating missing alues $ith the oerall
meanmodemedian of the column. )et's impute the 78ender'! 7Married' and
7%elf9:mployed' columns $ith their respectie modes.
&irst e import a unction to determine t0e mode
rom scipy.stats import mode
mode(data[4Gender4])
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.applyhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
5/19
;utput* ModeResult(mode=array([‘Male’], dtype=object), count=array([489]))
This returns both mode and count. emember that mode can be an array as
there can be multiple alues $ith high frequency. -e $ill ta"e the first one by
default al$ays using*
mode(data[4Gender4]).mode[]
3o$ $e can fill the missing alues and chec" using technique +2.
&Impute t0e values*
data[4Gender4]./llna(mode(data[4Gender4]).mode[], inplace=5rue)
data[4-arried4]./llna(mode(data[4-arried4]).mode[], inplace=5rue)
data[4$el_mployed4]./llna(mode(data[4$el_mployed4]).mode[], inplace=5rue)
o c0ec6 t0e &missin+ values a+ain to con/rm*
print data.apply(num_missin+, axis=)
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
6/19
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
7/19
If you notice the output of step +6! it has a strange property. :ach inde( is made
up of a combination of 6 alues. This is called Multi&Inde(ing. It helps in
performing operations really fast.
Aontinuing the e(ample from +6! $e hae the alues for each group but they
hae not been imputed.
This can be done using the arious techniques learned till no$.
&iterate only t0rou+0 ros it0 missin+ Loanmount
or i,ro in data.loc[data[4Loanmount4].isnull(),*].iterros()*
ind = tuple([ro[4Gender4],ro[4-arried4],ro[4$el_mployed4]])
data.loc[i,4Loanmount4] = impute_+rps.loc[ind].values[]
o c0ec6 t0e &missin+ values a+ain to con/rm*
print data.apply(num_missin+, axis=)
3ote*
1. Multi&inde( requires tuple for defining groups of indices in loc statement.This a tuple used in function.
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
8/19
2. The .aluesBC suffi( is required because! by default a series element is
returned $hich has an inde( not matching $ith that of the dataframe. In this
case! a direct assignment gies an error.
+E. Arosstab
This function is used to get an initial >feel? 4ie$5 of the data. Aredit9
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
9/19
3o$! it is eident that people $ith a credit history hae much higher chances of
getting a loan as FCG people $ith credit history got a loan as compared to only
HG $ithout credit history.
,ut that's not it. It tells an interesting story. %ince I "no$ that haing a credit
history is super important! $hat if I predict loan status to be 0 for ones $ith
credit history and 3 other$ise. %urprisingly! $e'll be right F26JFK=EC times out
of E1= $hich is a $hopping J@GL
I $on't blame you if you're $ondering $hy the hell do $e need statistical
models. ,ut trust me! increasing the accuracy by een C.CC1G beyond this
mar" is a challenging tas". -ould you ta"e thischallenge
Note: J@G is on train set. The test set $ill be slightly different but close. #lso! I
hope this gies some intuition into $hy een a C.C@G increase in accuracy can
result in ump of @CC ran"s on the Naggle leaderboard.
ead More* Pandas eference 4crosstab5
+J – Merge Data/rames
Merging dataframes become essential $hen $e hae information coming from
different sources to be collated. Aonsider a hypothetical case $here the
aerage property rates 4I3 per sq meters5 is aailable for different property
types. )et's define a dataframe as*
prop_rates = pd.Datarame([3, ;, 3r1an4],columns=[4rates4])
http://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.htmlhttp://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
10/19
prop_rates
3o$ $e can merge this information $ith the original dataframe as*
data_mer+ed = data.mer+e(ri+0t=prop_rates,
0o=4inner4,let_on=4?roperty_rea4,ri+0t_index=5rue, sort=alse)
data_mer+ed.pivot_ta1le(values=4'redit_7istory4,index=[4?roperty_rea4,4rates4],
a++unc=len)
The piot table alidates successful merge operation. 3ote that the 7alues'
argument is irreleant here because $e are simply counting the alues.
eadMore* Pandas eference 4merge5
+F – %orting Data/rames
Pandas allo$ easy sorting based on multiple columns. This can be done as*
data_sorted = data.sort_values([4pplicantIncome4,4'oapplicantIncome4],
ascendin+=alse)
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html#pandas.DataFrame.mergehttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html#pandas.DataFrame.merge
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
11/19
data_sorted[[4pplicantIncome4,4'oapplicantIncome4]].0ead(3)
3ote* Pandas >sort? function is no$ deprecated. -e should use >sort9alues?
instead.
More* Pandas eference 4sort9alues5
+H – Plotting 4,o(plot O
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
12/19
data.0ist(column="pplicantIncome",1y="Loan_$tatus",1ins=A)
This sho$s that income is not a big deciding factor on its o$n as there is no
appreciable difference bet$een the people $ho receied and $ere denied the
loan.
ead More* Pandas eference 4hist5 Pandas eference 4bo(plot5
+1C – Aut function for binning
%ometimes numerical alues ma"e more sense if clustered together. /or e(ample! if $e're trying to model traffic 4+cars on road5 $ith time of the day
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html#pandas.DataFrame.histhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html#pandas.DataFrame.boxplothttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html#pandas.DataFrame.histhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html#pandas.DataFrame.boxplot
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
13/19
4minutes5. The e(act minute of an hour might not be that releant for predicting
traffic as compared to actual period of the day li"e >Morning?! >#fternoon?!
>:ening?! >3ight?! >)ate 3ight?. Modeling traffic this $ay $ill be more intuitie
and $ill aoid oerfitting.
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
14/19
&Binnin+ usin+ cut unction o pandas
colBin =
pd.cut(col,1ins=1rea6_points,ri+0t=alse,la1els=la1els,include_loest=5rue)
return colBin
&Binnin+ a+e*
cut_points = [,3E,3]
la1els = ["lo","medium","0i+0","very 0i+0"]
data["Loanmount_Bin"] = 1innin+(data["Loanmount"], cut_points, la1els)
print pd.value_counts(data["Loanmount_Bin"], sort=alse)
ead More* Pandas eference 4cut5
+11 – Aoding nominal data
;ften! $e find a case $here $e'e to modify the categories of a nominal
ariable. This can be due to arious reasons*
1. %ome algorithms 4li"e )ogistic egression5 require all inputs to be numeric.
%o nominal ariables are mostly coded as C! 1Q.4n&15
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.cut.htmlhttp://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.cut.html
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
15/19
2. %ometimes a category might be represented in 2 $ays. /or e.g. temperature
might be recorded as >Medium?! >)o$?! >
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
16/19
print pd.value_counts(data["Loan_$tatus_'oded"])
%imilar counts before and after proes the coding.
ead More* Pandas eference 4replace5
+12 – Iterating oer ro$s of a dataframe
This is not a frequently used operation. %till! you don't $ant to get stuc". ight
#t times you may need to iterate through all ro$s using a for loop. /or instance!
one common problem $e face is the incorrect treatment of ariables in Python.
This generally happens $hen*
1. 3ominal ariables $ith numeric categories are treated as numerical.
2. 3umeric ariables $ith characters entered in one of the ro$s 4due to a data
error5 are considered categorical.
%o it's generally a good idea to manually define the column types. If $e chec"
the data types of all columns*
&'0ec6 current type*
data.dtypes
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replacehttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
17/19
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
18/19
ote* astype is used to assi+n types
or i, ro in col5ypes.iterros()* &i* datarame indexH ro* eac0 ro in series ormat
i ro[4eature4]=="cate+orical"*
data[ro[4eature4]]=data[ro[4eature4]].astype(np.o1ect)
eli ro[4eature4]=="continuous"*
data[ro[4eature4]]=data[ro[4eature4]].astype(np.9oat)
print data.dtypes
3o$ the credit history column is modified to 7obect' type $hich is used for
representing nominal ariables in Pandas.
ead More* Pandas eference 4iterro$s5
:nd 3otes
In this article! $e coered arious functions of Pandas $hich can ma"e our lifeeasy $hile performing data e(ploration and feature engineering. #lso! $e
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrowshttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows
8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation
19/19
defined some generic functions $hich can be reused for achieing similar
obectie on different datasets.
Also See* If you hae any doubts pertaining to Pandas or Python in general!
feel free to discuss $ith us.
Did you find the article useful Do you use some better 4easierfaster5
techniques for performing the tas"s discussed aboe Do you thin" there are
better alternaties to Pandas in Python -e'll be glad if you share your
thoughts as comments belo$.
http://discuss.analyticsvidhya.com/http://discuss.analyticsvidhya.com/