DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of...

31
DSC 201: Data Analysis & Visualization Data Merging Dr. David Koop D. Koop, DSC 201, Fall 2018

Transcript of DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of...

Page 1: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

DSC 201: Data Analysis & Visualization

Data Merging Dr. David Koop

D. Koop, DSC 201, Fall 2018

Page 2: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Data Cleaning

�2D. Koop, DSC 201, Fall 2018

Page 3: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

6F INDINGS

we got about the future of the data science,

the most salient takeaway was how excited our

respondents were about the evolution of the

field. They cited things in their own practice, how

they saw their jobs getting more interesting and

less repetitive, all while expressing a real and

broad enthusiasm about the value of the work in

their organization.

As data science becomes more commonplace and

simultaneously a bit demystified, we expect this

trend to continue as well. After all, last year’s

respondents were just as excited about their

work (about 79% were “satisfied” or better).

How a Data Scientist Spends Their Day

Here’s where the popular view of data scientists diverges pretty significantly from reality. Generally,

we think of data scientists building algorithms, exploring data, and doing predictive analysis. That’s

actually not what they spend most of their time doing, however.

As you can see from the chart above, 3 out of every 5 data scientists we surveyed actually spend the

most time cleaning and organizing data. You may have heard this referred to as “data wrangling” or

compared to digital janitor work. Everything from list verification to removing commas to debugging

databases–that time adds up and it adds up immensely. Messy data is by far the more time- consuming

aspect of the typical data scientist’s work flow. And nearly 60% said they simply spent too much

time doing it.

Data scientist job satisfaction

60%

19%

9%

4%5%3%

Building training sets: 3%

Cleaning and organizing data: 60%

Collecting data sets; 19%

Mining data for patterns: 9%

Refining algorithms: 4%

Other: 5%

What data scientists spend the most time doing

4.05

4

3

2

1

35%

47%

12%

6%

1%

This takes a lot of time!

�3D. Koop, DSC 201, Fall 2018

[CrowdFlower Data Science Report, 2016]

Page 4: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Data Cleaning Operations• Fill missing data • Drop missing data • Remove duplicates • Modify data (map strings, arithmetic expressions) • Replace values (e.g. -999 → NaN) • Clamping values

�4D. Koop, DSC 201, Fall 2018

Page 5: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Computing Indicator Values• Useful for machine learning • Want to take possible values and map them to 0-1 indicators • Example:

• Example: Genres in movies

�5D. Koop, DSC 201, Fall 2018

4 40 54 4dtype: int64

Computing Indicator/Dummy VariablesAnother type of transformation for statistical modeling or machine learning applica‐tions is converting a categorical variable into a “dummy” or “indicator” matrix. If acolumn in a DataFrame has k distinct values, you would derive a matrix or Data‐Frame with k columns containing all 1s and 0s. pandas has a get_dummies functionfor doing this, though devising one yourself is not difficult. Let’s return to an earlierexample DataFrame:

In [109]: df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], .....: 'data1': range(6)})

In [110]: pd.get_dummies(df['key'])Out[110]: a b c0 0 1 01 0 1 02 1 0 03 0 0 14 1 0 05 0 1 0

In some cases, you may want to add a prefix to the columns in the indicator Data‐Frame, which can then be merged with the other data. get_dummies has a prefix argu‐ment for doing this:

In [111]: dummies = pd.get_dummies(df['key'], prefix='key')

In [112]: df_with_dummy = df[['data1']].join(dummies)

In [113]: df_with_dummyOut[113]: data1 key_a key_b key_c0 0 0 1 01 1 0 1 02 2 1 0 03 3 0 0 14 4 1 0 05 5 0 1 0

If a row in a DataFrame belongs to multiple categories, things are a bit more compli‐cated. Let’s look at the MovieLens 1M dataset, which is investigated in more detail inChapter 14:

208 | Chapter 7: Data Cleaning and Preparation

Page 6: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

String Transformation• One of the reasons for Python's popularity is string/text processing • split(<delimiter>): break a string into pieces:

- s = "12,13, 14" slist = s.split(',') # ["12", "13", " 14"]

• <delimiter>.join([<str>]): join several strings by a delimiter - ":".join(slist) # "12:13: 14"

• strip(): remove leading and trailing whitespace - [p.strip() for p in slist] # ["12", "13", "14"]

• replace(<from>,<to>): change substrings to another substring - s.replace(',', ':') # "12:13: 14"

• upper()/lower(): casing - "AbCd".upper () # "ABCD" - "AbCd".lower() # "abcd"

�6D. Koop, DSC 201, Fall 2018

Page 7: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Regular Expressions• AKA regex • A syntax to better specify how to decompose strings • Look for patterns rather than specific characters • "31" in "The last day of December is 12/31/2016." • May work for some questions but now suppose I have other lines

like: "The last day of September is 9/30/2016." • …and I want to find dates that look like: • <numbers>/<numbers>/<numbers>

• Cannot search for every combination! • \d+/\d+/\d+

�7D. Koop, DSC 201, Fall 2018

Page 8: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Regular Expressions in Python• import re

• re.search(<pattern>, <str_to_check>)

- Returns None if no match, information about the match otherwise • Capturing information about what is in a string → parentheses • (\d+)/\d+/\d+ will capture information about the month • match = re.search('(\d+)/\d+/\d+','12/31/2016')if match: match.group() # 12

• re.findall(<pattern>, <str_to_check>)

- Finds all matches in the string, search only finds the first match • Can pass in flags to alter methods: e.g. re.IGNORECASE

�8D. Koop, DSC 201, Fall 2018

Page 9: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Pandas String Methods• Any column or series can have the string methods (e.g. replace,

split) applied to the entire series • Fast (vectorized) on whole columns or datasets • use .str.<method_name> • .str is important!

- data = pd.Series({'Dave': '[email protected]', 'Steve': '[email protected]', 'Rob': '[email protected]', 'Wes': np.nan}) data.str.contains('gmail') data.str.split('@').str[1] data.str[-3:]

�9D. Koop, DSC 201, Fall 2018

Page 10: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Assignment 4• http://www.cis.umassd.edu/~dkoop/dsc201-2018fa/

assignment4.html • Clean and transform hurdat2.txt into hurdat2.csv

�10D. Koop, DSC 201, Fall 2018

Page 11: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Quiz 2• Next Tuesday • Same format as Quiz 1: Multiple Choice and Short Answer • Quiz at the beginning of the class • Focus on material since the midterm • Pandas, reading/writing data, data cleaning

�11D. Koop, DSC 201, Fall 2018

Page 12: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Databases• Databases:

- Have been around for years - Organize data by tables, allow powerful queries - Most support concurrency: allowing multiple users to work with

the database at once - Provide many features to ensure data integrity, security

• Database Management Systems (DBMS): software that manages databases and facilitates adding, updating, and removing data as well as queries over the data

• Main language used to interact with databases: Structured Query Language (SQL)

�12D. Koop, DSC 201, Fall 2018

Page 13: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Relational Databases• A specific model for databases [Codd, 1969] • Extremely popular, supported by most major DBMS (IBM DB2,

SQLServer, mySQL, etc.) • Consists of relations (tables) made up of tuples (rows) • Relations reference each other!

- Types of relationships: one-to-one, many-to-one, many-to-many • Each tuple has a key; to reference a tuple in another relation, use a

foreign key in the current relation

�13D. Koop, DSC 201, Fall 2018

Page 14: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Example: Football Game Data• Data about football games, teams,

and players- Game is between two Teams- Each Team has Players

• For each game, we could specify every player and all of their information… why is this bad?

�14D. Koop, DSC 201, Fall 2018

Page 15: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Example: Football Game Data• Data about football games, teams,

and players- Game is between two Teams- Each Team has Players

• For each game, we could specify every player and all of their information… why is this bad?

• Normalization: reduce redundancy, keep information that doesn't change separate

• 3 Relations: Team, Player, Game• Each relation only encodes the data

specific to what it represents

�14D. Koop, DSC 201, Fall 2018

Id Name Height Weight

Player

Id Name Wins Losses

Team

Id Location Date

Game

Page 16: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Example: Football Game Data• Have each game store the id of the

home team and the id of the away team (one-to-one)

• Have each player store the id of the team he plays on (many-to-one)

• What happens if a player plays on 2+ teams?

�15D. Koop, DSC 201, Fall 2018

Id Name Height Weight TeamId

Player

Id Name Wins Losses

Team

Id Location Date Home Away

Game

Page 17: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

How does this relate to pandas?• DataFrames in pandas are ~relations (tables) • We may wish to normalize data in a similar manner in pandas • However, operating on 2+ DataFrames at the same time can be

unwieldy, can we merge them together? • Two potential operations:

- Have football game data (just the Game table) from 2013, 2014, and 2015 and wish to merge the data into one data frame

- Have football game data and wish to find the average temperature of the cities where the games were played

�16D. Koop, DSC 201, Fall 2018

Page 18: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Concatenation• Take two data frames with the same columns and add more rows • pd.concat([data-frame-1, data-frame-2, …])

• Default is to add rows (axis=0), but can also add columns (axis=1) • Can also concatenate Series into a data frame. • concat preserves the index so this can be confusing if you have two

default indices (0,1,2,3…)—they will appear twice - Use ignore_index=True to get a 0,1,2…

�17D. Koop, DSC 201, Fall 2018

Page 19: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Merges (aka Joins)• Need to merge data from one DataFrame with data from another

DataFrame • Example: Football game data merged with temperature data

�18D. Koop, DSC 201, Fall 2018

Id Location Date Home Away0 Boston 9/2 1 151 Boston 9/9 1 72 Cleveland 9/16 12 13 San Diego 9/23 21 1

GamewId City Date Temp0 Boston 9/2 721 Boston 9/3 68… … … …7 Boston 9/9 75… … … …21 Boston 9/23 54… … … …36 Cleveland 9/16 81

Weather

No data for San Diego

Page 20: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Merges (aka Joins)• Want to join the two tables based on the location and date • Location and date are the keys for the join • What happens when we have missing data? • Merges are ordered: there is a left and a right side • Four types of joins:

- Inner: intersection of keys (match on both sides) - Outer: union of keys (if there is no match on other side, still

include with NaN to indicate missing data) - Left: always have rows from left table (no unmatched right data) - Right: like left, but with no unmatched left data

�19D. Koop, DSC 201, Fall 2018

Page 21: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Inner Strategy

�20D. Koop, DSC 201, Fall 2018

Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 01 Boston 9/9 1 7 75 72 Cleveland 9/16 12 1 81 36

Merged

No San Diego entry

Page 22: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Outer Strategy

�21D. Koop, DSC 201, Fall 2018

Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 0

NaN Boston 9/3 NaN NaN 68 1… … … … … … …1 Boston 9/9 1 7 75 7

NaN Boston 9/10 NaN NaN 76 8… … … … … … …

NaN Cleveland 9/2 NaN NaN 61 22… … … … … … …2 Cleveland 9/16 12 1 81 36… … … … … … …3 San Diego 9/23 21 1 NaN NaN

Merged

Page 23: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Left Strategy

�22D. Koop, DSC 201, Fall 2018

Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 01 Boston 9/9 1 7 75 72 Cleveland 9/16 12 1 81 363 San Diego 9/23 21 1 NaN NaN

Merged

Page 24: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Right Strategy

�23D. Koop, DSC 201, Fall 2018

Id Location Date Home Away Temp wId0 Boston 9/2 1 15 72 0

NaN Boston 9/3 NaN NaN 68 1… … … … … … …1 Boston 9/9 1 7 75 7

NaN Boston 9/10 NaN NaN 76 8… … … … … … …

NaN Cleveland 9/2 NaN NaN 61 22… … … … … … …2 Cleveland 9/16 12 1 81 36… … … … … … …

Merged

No San Diego entry

Page 25: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Hierarchical Indexing (Multiple Keys)• We might have multiple keys to identify a single piece of data • Example: Football records for each team over multiple years

- Identify a specific row by both the team name and the year - Can think about this as a tuple (<team_name>, <year>)

• pandas supports this via hierarchical indexing (MultiIndex) • display mirrors the hierarchical nature of the data

�24D. Koop, DSC 201, Fall 2018

Page 26: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Example• data = [{"W": 11, "L": 5}, {"W": 6, "L": 10}, {"W": 12, "L": 4}, {"W": 8, "L": 8}, {"W": 2, "L": 14}] index = [["Boston", "Boston", "San Diego", "San Diego", "Cleveland"], [2007, 2008, 2007, 2008, 2007]] df = pd.DataFrame(data, columns=["W", "L"], index=pd.MultiIndex.from_arrays( index, names=('Team', 'Year')))

�25D. Koop, DSC 201, Fall 2018

W L

Team Year

Boston2007 11 5

2008 6 10

San Diego2007 12 4

2008 8 8

Cleveland 2007 2 14

Page 27: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

How do we access a row? or slice?

�26D. Koop, DSC 201, Fall 2018

Page 28: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

MultiIndex Row Access and Slicing• df.loc["Boston", 2007]

• Remember that loc uses the index values, iloc uses integers • Note: df.iloc[0] gets the first row, not df.iloc[0,0] • Can get a subset of the data using partial indices

- df.loc["Boston"] returns both 2007 and 2008 data • What about slicing?

- df.loc["Boston":"Cleveland"] → ERROR! - Need to have sorted data for this to make sense - df = df.sort_index() - df.loc["Boston":"Cleveland"] → inclusive! - df.loc[(slice("Boston","Cleveland"),2007),:]

�27D. Koop, DSC 201, Fall 2018

Page 29: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Reorganizing the MultiIndex• swaplevel: switch the order of the levels

- df = df.swaplevel("Year","Team") - df.sort_index()

• Can do summary statistics by level - df.sum(level="Team")

• Reset the index (back to numbers) - df.reset_index()

• Promote columns to be the indices - df.set_index(["Team", "Year"])

�28D. Koop, DSC 201, Fall 2018

W L

Year Team

2007

Boston 11 5

Cleveland 2 14

San Diego 12 4

2008Boston 6 10

San Diego 8 8

Page 30: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Data Merging in Pandas• pd.merge(left, right, …)

• Default merge: join on matching column names • Better: specify the column name(s) to join on via on kwarg

- If column names differ, use left_on and right_on - Multiple keys: use a list

• how kwarg specifies the type of join ("inner", "outer", "left", "right")

• Can add suffixes to column names when they appear in both tables, but are not being joined on

• Can also merge using the index by setting left_index or right_index to True

�29D. Koop, DSC 201, Fall 2018

Page 31: DSC 201: Data Analysis & Visualizationdkoop/dsc201-2018fa/lectures/lecture18.pdf• Consists of relations (tables) made up of tuples (rows) • Relations reference each other! - Types

Table 8-2. merge function argumentsArgument Descriptionleft DataFrame to be merged on the left side.right DataFrame to be merged on the right side.how One of 'inner', 'outer', 'left', or 'right'; defaults to 'inner'.on Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys

given, will use the intersection of the column names in left and right as the join keys.left_on Columns in left DataFrame to use as join keys.right_on Analogous to left_on for left DataFrame.left_index Use row index in left as its join key (or keys, if a MultiIndex).right_index Analogous to left_index.sort Sort merged data lexicographically by join keys; True by default (disable to get better performance in

some cases on large datasets).suffixes Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y') (e.g., if

'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result).copy If False, avoid copying data into resulting data structure in some exceptional cases; by default always

copies.indicator Adds a special column _merge that indicates the source of each row; values will be 'left_only',

'right_only', or 'both' based on the origin of the joined data in each row.

Merging on IndexIn some cases, the merge key(s) in a DataFrame will be found in its index. In thiscase, you can pass left_index=True or right_index=True (or both) to indicate thatthe index should be used as the merge key:

In [56]: left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], ....: 'value': range(6)})

In [57]: right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [58]: left1Out[58]: key value0 a 01 b 12 a 23 a 34 b 45 c 5

In [59]: right1Out[59]: group_vala 3.5b 7.0

232 | Chapter 8: Data Wrangling: Join, Combine, and Reshape

Merge Arguments

�30D. Koop, DSC 201, Fall 2018

[W. McKinney, Python for Data Analysis]