DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of...

DSC 201: Data Analysis & Visualization

Data Merging Dr. David Koop

D. Koop, DSC 201, Fall 2018

Data Cleaning

�2D. Koop, DSC 201, Fall 2018

6F INDINGS

we got about the future of the data science,

the most salient takeaway was how excited our

respondents were about the evolution of the

field. They cited things in their own practice, how

they saw their jobs getting more interesting and

less repetitive, all while expressing a real and

broad enthusiasm about the value of the work in

their organization.

As data science becomes more commonplace and

simultaneously a bit demystified, we expect this

trend to continue as well. After all, last year’s

respondents were just as excited about their

work (about 79% were “satisfied” or better).

How a Data Scientist Spends Their Day

Here’s where the popular view of data scientists diverges pretty significantly from reality. Generally,

we think of data scientists building algorithms, exploring data, and doing predictive analysis. That’s

actually not what they spend most of their time doing, however.

As you can see from the chart above, 3 out of every 5 data scientists we surveyed actually spend the

most time cleaning and organizing data. You may have heard this referred to as “data wrangling” or

compared to digital janitor work. Everything from list verification to removing commas to debugging

databases–that time adds up and it adds up immensely. Messy data is by far the more time- consuming

aspect of the typical data scientist’s work flow. And nearly 60% said they simply spent too much

time doing it.

Data scientist job satisfaction

60%

19%

9%

4%5%3%

Building training sets: 3%

Cleaning and organizing data: 60%

Collecting data sets; 19%

Mining data for patterns: 9%

Refining algorithms: 4%

Other: 5%

What data scientists spend the most time doing

4.05

4

3

2

1

35%

47%

12%

6%

1%

This takes a lot of time!


[CrowdFlower Data Science Report, 2016]

http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Data Cleaning Operations• Fill missing data • Drop missing data • Remove duplicates • Modify data (map strings, arithmetic expressions) • Replace values (e.g. -999 → NaN) • Clamping values


Computing Indicator Values• Useful for machine learning • Want to take possible values and map them to 0-1 indicators • Example:

• Example: Genres in movies


4 40 54 4dtype: int64

Computing Indicator/Dummy VariablesAnother type of transformation for statistical modeling or machine learning applica‐tions is converting a categorical variable into a “dummy” or “indicator” matrix. If acolumn in a DataFrame has k distinct values, you would derive a matrix or Data‐Frame with k columns containing all 1s and 0s. pandas has a get_dummies functionfor doing this, though devising one yourself is not difficult. Let’s return to an earlierexample DataFrame:

In [109]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], .....: 'data1': range(6)})

In [110]: pd.get_dummies(df['key'])Out[110]: a b c0 0 1 01 0 1 02 1 0 03 0 0 14 1 0 05 0 1 0

In some cases, you may want to add a prefix to the columns in the indicator Data‐Frame, which can then be merged with the other data. get_dummies has a prefix argu‐ment for doing this:

In [111]: dummies = pd.get_dummies(df['key'], prefix='key')

In [112]: df_with_dummy = df[['data1']].join(dummies)

In [113]: df_with_dummyOut[113]: data1 key_a key_b key_c0 0 0 1 01 1 0 1 02 2 1 0 03 3 0 0 14 4 1 0 05 5 0 1 0

If a row in a DataFrame belongs to multiple categories, things are a bit more compli‐cated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail inChapter 14:

208 | Chapter 7: Data Cleaning and Preparation

String Transformation• One of the reasons for Python's popularity is string/text processing • split(<delimiter>): break a string into pieces:

- s = "12,13, 14" slist = s.split(',') # ["12", "13", " 14"]

• <delimiter>.join([<str>]): join several strings by a delimiter - ":".join(slist) # "12:13: 14"

• strip(): remove leading and trailing whitespace - [p.strip() for p in slist] # ["12", "13", "14"]

• replace(<from>,<to>): change substrings to another substring - s.replace(',', ':') # "12:13: 14"

• upper()/lower(): casing - "AbCd".upper () # "ABCD" - "AbCd".lower() # "abcd"


Regular Expressions• AKA regex • A syntax to better specify how to decompose strings • Look for patterns rather than specific characters • "31" in "The last day of December is 12/31/2016." • May work for some questions but now suppose I have other lines

like: "The last day of September is 9/30/2016." • …and I want to find dates that look like: • <numbers>/<numbers>/<numbers>

• Cannot search for every combination! • \d+/\d+/\d+


Regular Expressions in Python• import re

• re.search(<pattern>, <str_to_check>)

- Returns None if no match, information about the match otherwise • Capturing information about what is in a string → parentheses • (\d+)/\d+/\d+ will capture information about the month • match = re.search('(\d+)/\d+/\d+','12/31/2016')if match: match.group() # 12

• re.findall(<pattern>, <str_to_check>)

- Finds all matches in the string, search only finds the first match • Can pass in flags to alter methods: e.g. re.IGNORECASE


Pandas String Methods• Any column or series can have the string methods (e.g. replace,

split) applied to the entire series • Fast (vectorized) on whole columns or datasets • use .str.<method_name> • .str is important!

- data = pd.Series({'Dave': '[email protected]', 'Steve': '[email protected]', 'Rob': '[email protected]', 'Wes': np.nan}) data.str.contains('gmail') data.str.split('@').str[1] data.str[-3:]


mailto:[email protected]

mailto:[email protected]

Assignment 4• http://www.cis.umassd.edu/~dkoop/dsc201-2018fa/

assignment4.html • Clean and transform hurdat2.txt into hurdat2.csv


http://www.cis.umassd.edu/~dkoop/dsc201-2018fa/assignment4.html

http://www.cis.umassd.edu/~dkoop/dsc201-2018fa/assignment4.html

Quiz 2• Next Tuesday • Same format as Quiz 1: Multiple Choice and Short Answer • Quiz at the beginning of the class • Focus on material since the midterm • Pandas, reading/writing data, data cleaning


Databases• Databases:

- Have been around for years - Organize data by tables, allow powerful queries - Most support concurrency: allowing multiple users to work with

the database at once - Provide many features to ensure data integrity, security

• Database Management Systems (DBMS): software that manages databases and facilitates adding, updating, and removing data as well as queries over the data

• Main language used to interact with databases: Structured Query Language (SQL)


Relational Databases• A specific model for databases [Codd, 1969] • Extremely popular, supported by most major DBMS (IBM DB2,

SQLServer, mySQL, etc.) • Consists of relations (tables) made up of tuples (rows) • Relations reference each other!

- Types of relationships: one-to-one, many-to-one, many-to-many • Each tuple has a key; to reference a tuple in another relation, use a

foreign key in the current relation


Example: Football Game Data• Data about football games, teams,

and players- Game is between two Teams- Each Team has Players

• For each game, we could specify every player and all of their information… why is this bad?


Example: Football Game Data• Data about football games, teams,

and players- Game is between two Teams- Each Team has Players

• For each game, we could specify every player and all of their information… why is this bad?

• Normalization: reduce redundancy, keep information that doesn't change separate

• 3 Relations: Team, Player, Game• Each relation only encodes the data

specific to what it represents


Id Name Height Weight

Player

Id Name Wins Losses

Team

Id Location Date

Game

Example: Football Game Data• Have each game store the id of the

home team and the id of the away team (one-to-one)

• Have each player store the id of the team he plays on (many-to-one)

• What happens if a player plays on 2+ teams?


Id Name Height Weight TeamId

Player

Id Name Wins Losses

Team

Id Location Date Home Away

Game

How does this relate to pandas?• DataFrames in pandas are ~relations (tables) • We may wish to normalize data in a similar manner in pandas • However, operating on 2+ DataFrames at the same time can be

unwieldy, can we merge them together? • Two potential operations:

- Have football game data (just the Game table) from 2013, 2014, and 2015 and wish to merge the data into one data frame

- Have football game data and wish to find the average temperature of the cities where the games were played


Concatenation• Take two data frames with the same columns and add more rows • pd.concat([data-frame-1, data-frame-2, …])

• Default is to add rows (axis=0), but can also add columns (axis=1) • Can also concatenate Series into a data frame. • concat preserves the index so this can be confusing if you have two

default indices (0,1,2,3…)—they will appear twice - Use ignore_index=True to get a 0,1,2…


Merges (aka Joins)• Need to merge data from one DataFrame with data from another

DataFrame • Example: Football game data merged with temperature data


Id Location Date Home Away0 Boston 9/2 1 151 Boston 9/9 1 72 Cleveland 9/16 12 13 San Diego 9/23 21 1

GamewId City Date Temp0 Boston 9/2 721 Boston 9/3 68… … … …7 Boston 9/9 75… … … …21 Boston 9/23 54… … … …36 Cleveland 9/16 81

Weather

No data for San Diego

Merges (aka Joins)• Want to join the two tables based on the location and date • Location and date are the keys for the join • What happens when we have missing data? • Merges are ordered: there is a left and a right side • Four types of joins:

- Inner: intersection of keys (match on both sides) - Outer: union of keys (if there is no match on other side, still

include with NaN to indicate missing data) - Left: always have rows from left table (no unmatched right data) - Right: like left, but with no unmatched left data


Inner Strategy


Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 01 Boston 9/9 1 7 75 72 Cleveland 9/16 12 1 81 36

Merged

No San Diego entry

Outer Strategy


Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 0

NaN Boston 9/3 NaN NaN 68 1… … … … … … …1 Boston 9/9 1 7 75 7

NaN Boston 9/10 NaN NaN 76 8… … … … … … …

NaN Cleveland 9/2 NaN NaN 61 22… … … … … … …2 Cleveland 9/16 12 1 81 36… … … … … … …3 San Diego 9/23 21 1 NaN NaN

Merged

Left Strategy


Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 01 Boston 9/9 1 7 75 72 Cleveland 9/16 12 1 81 363 San Diego 9/23 21 1 NaN NaN

Merged

Right Strategy


Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 0

NaN Boston 9/3 NaN NaN 68 1… … … … … … …1 Boston 9/9 1 7 75 7

NaN Boston 9/10 NaN NaN 76 8… … … … … … …

NaN Cleveland 9/2 NaN NaN 61 22… … … … … … …2 Cleveland 9/16 12 1 81 36… … … … … … …

Merged

No San Diego entry

Hierarchical Indexing (Multiple Keys)• We might have multiple keys to identify a single piece of data • Example: Football records for each team over multiple years

- Identify a specific row by both the team name and the year - Can think about this as a tuple (<team_name>, <year>)

• pandas supports this via hierarchical indexing (MultiIndex) • display mirrors the hierarchical nature of the data


Example• data = [{"W": 11, "L": 5}, {"W": 6, "L": 10}, {"W": 12, "L": 4}, {"W": 8, "L": 8}, {"W": 2, "L": 14}] index = [["Boston", "Boston", "San Diego", "San Diego", "Cleveland"], [2007, 2008, 2007, 2008, 2007]] df = pd.DataFrame(data, columns=["W", "L"], index=pd.MultiIndex.from_arrays( index, names=('Team', 'Year')))


W L

Team Year

Boston2007 11 5

2008 6 10

San Diego2007 12 4

2008 8 8

Cleveland 2007 2 14

How do we access a row? or slice?


MultiIndex Row Access and Slicing• df.loc["Boston", 2007]

• Remember that loc uses the index values, iloc uses integers • Note: df.iloc[0] gets the first row, not df.iloc[0,0] • Can get a subset of the data using partial indices

- df.loc["Boston"] returns both 2007 and 2008 data • What about slicing?

- df.loc["Boston":"Cleveland"] → ERROR! - Need to have sorted data for this to make sense - df = df.sort_index() - df.loc["Boston":"Cleveland"] → inclusive! - df.loc[(slice("Boston","Cleveland"),2007),:]


Reorganizing the MultiIndex• swaplevel: switch the order of the levels

- df = df.swaplevel("Year","Team") - df.sort_index()

• Can do summary statistics by level - df.sum(level="Team")

• Reset the index (back to numbers) - df.reset_index()

• Promote columns to be the indices - df.set_index(["Team", "Year"])


W L

Year Team

2007

Boston 11 5

Cleveland 2 14

San Diego 12 4

2008Boston 6 10

San Diego 8 8

Data Merging in Pandas• pd.merge(left, right, …)

• Default merge: join on matching column names • Better: specify the column name(s) to join on via on kwarg

- If column names differ, use left_on and right_on - Multiple keys: use a list

• how kwarg specifies the type of join ("inner", "outer", "left", "right")

• Can add suffixes to column names when they appear in both tables, but are not being joined on

• Can also merge using the index by setting left_index or right_index to True


Table 8-2. merge function argumentsArgument Descriptionleft DataFrame to be merged on the left side.right DataFrame to be merged on the right side.how One of 'inner', 'outer', 'left', or 'right'; defaults to 'inner'.on Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys

given, will use the intersection of the column names in left and right as the join keys.left_on Columns in left DataFrame to use as join keys.right_on Analogous to left_on for left DataFrame.left_index Use row index in left as its join key (or keys, if a MultiIndex).right_index Analogous to left_index.sort Sort merged data lexicographically by join keys; True by default (disable to get better performance in

some cases on large datasets).suffixes Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y') (e.g., if

'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result).copy If False, avoid copying data into resulting data structure in some exceptional cases; by default always

copies.indicator Adds a special column _merge that indicates the source of each row; values will be 'left_only',

'right_only', or 'both' based on the origin of the joined data in each row.

Merging on IndexIn some cases, the merge key(s) in a DataFrame will be found in its index. In thiscase, you can pass left_index=True or right_index=True (or both) to indicate thatthe index should be used as the merge key:

In [56]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], ....: 'value': range(6)})

In [57]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [58]: left1Out[58]: key value0 a 01 b 12 a 23 a 34 b 45 c 5

In [59]: right1Out[59]: group_vala 3.5b 7.0

232 | Chapter 8: Data Wrangling: Join, Combine, and Reshape

Merge Arguments


[W. McKinney, Python for Data Analysis]

DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of...

Documents

Transcript of DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of...