12 Useful Pandas Techniques in Python for Data Manipulation

download 12 Useful Pandas Techniques in Python for Data Manipulation

of 19

Transcript of 12 Useful Pandas Techniques in Python for Data Manipulation

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    1/19

    12 Useful Pandas Techniques inPython for Data Manipulation

    Introduction

    Python is fast becoming the preferred language for data scientists – and for 

    good reasons. It proides the larger ecosystem of a programming language and

    the depth of good scientific computation libraries. If you are starting to learn

    Python! hae a loo" at learning path on Python.

     #mong its scientific computation libraries! I found Pandas to be the most useful

    for data science operations. Pandas! along $ith %ci"it&learn proides almost the

    entire stac" needed by a data scientist. This article focuses on proiding 12

    ways for data manipulation in Python. I'e also shared some tips &

    tricks $hich $ill allo$ you to work faster .

    I $ould recommend that you loo" at the codes for data e(ploration before going

    ahead. To help you understand better! I'e ta"en a data set to perform these

    operations and manipulations.

    Data Set: I'e used the data set of )oan Prediction problem. Do$nload the data

    set and get started.

    http://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-data-science-python/http://www.analyticsvidhya.com/blog/2015/02/data-exploration-preparation-model/http://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://www.analyticsvidhya.com/blog/2015/02/data-exploration-preparation-model/http://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-business-intelligence-big-data/learning-path-data-science-python/

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    2/19

    )et's get started

    I'll start by importing modules and loading the data set into Python enironment*

    import pandas as pd

    import numpy as np

    data = pd.read_csv("train.csv", index_col="Loan_ID")

     

    +1 – ,oolean Inde(ing

    -hat do you do! if you $ant to filter alues of a column based on conditions

    from another set of columns /or instance! $e $ant a list of all females who

    are not graduate and got a loan. ,oolean inde(ing can help here. 0ou can use

    the follo$ing code*

    data.loc[(data["Gender"]=="emale") ! (data["ducation"]=="#ot Graduate") !

    (data["Loan_$tatus"]=="%"), ["Gender","ducation","Loan_$tatus"]]

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    3/19

    ead More* Pandas %electing and Inde(ing

     

    +2 – #pply /unctionIt is one of the commonly used functions for playing $ith data and creating ne$

    ariables. Apply returns some alue after passing each ro$column of a data

    frame $ith some function. The function can be both default or user&defined. /or 

    instance! here it can be used to find the +missing alues in each ro$ and

    column.

    &'reate a ne unction*

    de num_missin+(x)*

      return sum(x.isnull())

    &pplyin+ per column*

    print "-issin+ values per column*"

    print data.apply(num_missin+, axis=) &axis= de/nes t0at unction is to 1e

    applied on eac0 column

    &pplyin+ per ro*

    print "2n-issin+ values per ro*"

    http://pandas.pydata.org/pandas-docs/stable/indexing.htmlhttp://pandas.pydata.org/pandas-docs/stable/indexing.html

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    4/19

    print data.apply(num_missin+, axis=3).0ead() &axis=3 de/nes t0at unction is to 1e

    applied on eac0 ro

    Thus $e get the desired result.

    3ote* head45 function is used in second output because it contains many ro$s.ead More* Pandas eference 4apply5

     

    +6 – Imputing missing files

    7fillna45' does it in one go. It is used for updating missing alues $ith the oerall

    meanmodemedian of the column. )et's impute the 78ender'! 7Married' and

    7%elf9:mployed' columns $ith their respectie modes.

    &irst e import a unction to determine t0e mode

    rom scipy.stats import mode

    mode(data[4Gender4])

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.applyhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    5/19

    ;utput* ModeResult(mode=array([‘Male’], dtype=object), count=array([489]))

    This returns both mode and count. emember that mode can be an array as

    there can be multiple alues $ith high frequency. -e $ill ta"e the first one by

    default al$ays using*

    mode(data[4Gender4]).mode[]

    3o$ $e can fill the missing alues and chec" using technique +2.

    &Impute t0e values*

    data[4Gender4]./llna(mode(data[4Gender4]).mode[], inplace=5rue)

    data[4-arried4]./llna(mode(data[4-arried4]).mode[], inplace=5rue)

    data[4$el_mployed4]./llna(mode(data[4$el_mployed4]).mode[], inplace=5rue)

    o c0ec6 t0e &missin+ values a+ain to con/rm*

    print data.apply(num_missin+, axis=)

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    6/19

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    7/19

    If you notice the output of step +6! it has a strange property. :ach inde( is made

    up of a combination of 6 alues. This is called Multi&Inde(ing. It helps in

    performing operations really fast.

    Aontinuing the e(ample from +6! $e hae the alues for each group but they

    hae not been imputed.

    This can be done using the arious techniques learned till no$.

    &iterate only t0rou+0 ros it0 missin+ Loanmount

    or i,ro in data.loc[data[4Loanmount4].isnull(),*].iterros()*

      ind = tuple([ro[4Gender4],ro[4-arried4],ro[4$el_mployed4]])

      data.loc[i,4Loanmount4] = impute_+rps.loc[ind].values[]

    o c0ec6 t0e &missin+ values a+ain to con/rm*

    print data.apply(num_missin+, axis=)

    3ote*

    1. Multi&inde( requires tuple for defining groups of indices in loc statement.This a tuple used in function.

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    8/19

    2. The .aluesBC suffi( is required because! by default a series element is

    returned $hich has an inde( not matching $ith that of the dataframe. In this

    case! a direct assignment gies an error.

     

    +E. Arosstab

    This function is used to get an initial >feel? 4ie$5 of the data. Aredit9

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    9/19

    3o$! it is eident that people $ith a credit history hae much higher chances of 

    getting a loan as FCG people $ith credit history got a loan as compared to only

    HG $ithout credit history.

    ,ut that's not it. It tells an interesting story. %ince I "no$ that haing a credit

    history is super important! $hat if I predict loan status to be 0 for ones $ith

    credit history and 3 other$ise. %urprisingly! $e'll be right F26JFK=EC times out

    of E1= $hich is a $hopping J@GL

    I $on't blame you if you're $ondering $hy the hell do $e need statistical

    models. ,ut trust me! increasing the accuracy by een C.CC1G beyond this

    mar" is a challenging tas". -ould you ta"e thischallenge

    Note: J@G is on train set. The test set $ill be slightly different but close. #lso! I

    hope this gies some intuition into $hy een a C.C@G increase in accuracy can

    result in ump of @CC ran"s on the Naggle leaderboard.

    ead More* Pandas eference 4crosstab5

     

    +J – Merge Data/rames

    Merging dataframes become essential $hen $e hae information coming from

    different sources to be collated. Aonsider a hypothetical case $here the

    aerage property rates 4I3 per sq meters5 is aailable for different property

    types. )et's define a dataframe as*

    prop_rates = pd.Datarame([3, ;, 3r1an4],columns=[4rates4])

    http://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.htmlhttp://datahack.analyticsvidhya.com/contest/practice-problem-loan-predictionhttp://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.crosstab.html

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    10/19

    prop_rates

    3o$ $e can merge this information $ith the original dataframe as*

    data_mer+ed = data.mer+e(ri+0t=prop_rates,

    0o=4inner4,let_on=4?roperty_rea4,ri+0t_index=5rue, sort=alse)

    data_mer+ed.pivot_ta1le(values=4'redit_7istory4,index=[4?roperty_rea4,4rates4],

    a++unc=len)

    The piot table alidates successful merge operation. 3ote that the 7alues'

    argument is irreleant here because $e are simply counting the alues.

    eadMore* Pandas eference 4merge5

     

    +F – %orting Data/rames

    Pandas allo$ easy sorting based on multiple columns. This can be done as*

    data_sorted = data.sort_values([4pplicantIncome4,4'oapplicantIncome4],

    ascendin+=alse)

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html#pandas.DataFrame.mergehttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html#pandas.DataFrame.merge

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    11/19

    data_sorted[[4pplicantIncome4,4'oapplicantIncome4]].0ead(3)

    3ote* Pandas >sort? function is no$ deprecated. -e should use >sort9alues?

    instead.

    More* Pandas eference 4sort9alues5

     

    +H – Plotting 4,o(plot O

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    12/19

    data.0ist(column="pplicantIncome",1y="Loan_$tatus",1ins=A)

    This sho$s that income is not a big deciding factor on its o$n as there is no

    appreciable difference bet$een the people $ho receied and $ere denied the

    loan.

    ead More* Pandas eference 4hist5  Pandas eference 4bo(plot5

     

    +1C – Aut function for binning

    %ometimes numerical alues ma"e more sense if clustered together. /or e(ample! if $e're trying to model traffic 4+cars on road5 $ith time of the day

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html#pandas.DataFrame.histhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html#pandas.DataFrame.boxplothttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html#pandas.DataFrame.histhttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html#pandas.DataFrame.boxplot

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    13/19

    4minutes5. The e(act minute of an hour might not be that releant for predicting

    traffic as compared to actual period of the day li"e >Morning?! >#fternoon?!

    >:ening?! >3ight?! >)ate 3ight?. Modeling traffic this $ay $ill be more intuitie

    and $ill aoid oerfitting.

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    14/19

      &Binnin+ usin+ cut unction o pandas

      colBin =

    pd.cut(col,1ins=1rea6_points,ri+0t=alse,la1els=la1els,include_loest=5rue)

      return colBin

    &Binnin+ a+e*

    cut_points = [,3E,3]

    la1els = ["lo","medium","0i+0","very 0i+0"]

    data["Loanmount_Bin"] = 1innin+(data["Loanmount"], cut_points, la1els)

    print pd.value_counts(data["Loanmount_Bin"], sort=alse)

    ead More* Pandas eference 4cut5

     

    +11 – Aoding nominal data

    ;ften! $e find a case $here $e'e to modify the categories of a nominal

    ariable. This can be due to arious reasons*

    1. %ome algorithms 4li"e )ogistic egression5 require all inputs to be numeric.

    %o nominal ariables are mostly coded as C! 1Q.4n&15

    http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.cut.htmlhttp://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.cut.html

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    15/19

    2. %ometimes a category might be represented in 2 $ays. /or e.g. temperature

    might be recorded as >Medium?! >)o$?! >

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    16/19

    print pd.value_counts(data["Loan_$tatus_'oded"])

    %imilar counts before and after proes the coding.

    ead More* Pandas eference 4replace5

     

    +12 – Iterating oer ro$s of a dataframe

    This is not a frequently used operation. %till! you don't $ant to get stuc". ight

     #t times you may need to iterate through all ro$s using a for loop. /or instance!

    one common problem $e face is the incorrect treatment of ariables in Python.

    This generally happens $hen*

    1. 3ominal ariables $ith numeric categories are treated as numerical.

    2. 3umeric ariables $ith characters entered in one of the ro$s 4due to a data

    error5 are considered categorical.

    %o it's generally a good idea to manually define the column types. If $e chec"

    the data types of all columns*

    &'0ec6 current type*

    data.dtypes

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replacehttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    17/19

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    18/19

    ote* astype is used to assi+n types

    or i, ro in col5ypes.iterros()* &i* datarame indexH ro* eac0 ro in series ormat

      i ro[4eature4]=="cate+orical"*

      data[ro[4eature4]]=data[ro[4eature4]].astype(np.o1ect)

      eli ro[4eature4]=="continuous"*

      data[ro[4eature4]]=data[ro[4eature4]].astype(np.9oat)

      print data.dtypes

    3o$ the credit history column is modified to 7obect' type $hich is used for 

    representing nominal ariables in Pandas.

    ead More* Pandas eference 4iterro$s5

     

    :nd 3otes

    In this article! $e coered arious functions of Pandas $hich can ma"e our lifeeasy $hile performing data e(ploration and feature engineering. #lso! $e

    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrowshttp://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows

  • 8/18/2019 12 Useful Pandas Techniques in Python for Data Manipulation

    19/19

    defined some generic functions $hich can be reused for achieing similar 

    obectie on different datasets.

    Also See* If you hae any doubts pertaining to Pandas or Python in general!

    feel free to discuss $ith us.

    Did you find the article useful Do you use some better 4easierfaster5

    techniques for performing the tas"s discussed aboe Do you thin" there are

    better alternaties to Pandas in Python -e'll be glad if you share your 

    thoughts as comments belo$.

    http://discuss.analyticsvidhya.com/http://discuss.analyticsvidhya.com/