S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹...
Transcript of S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹...
![Page 1: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/1.jpg)
Class 8: Data wrangling I
February 15, 2018
These slides are licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
![Page 2: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/2.jpg)
General
2 / 24
![Page 3: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/3.jpg)
Annoucements
Reading for next class: R for Data Science - chapters 4 (short) and 5
Homework 1 posted on website, http://spring18.cds101.com, due Friday, February23rd by 11:59pm
RStudio cheatsheet resource
Will post cheatsheets on website soon
3 / 24
![Page 4: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/4.jpg)
What is data wrangling?
4 / 24
![Page 5: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/5.jpg)
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
5 / 24
![Page 6: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/6.jpg)
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)
5 / 24
![Page 7: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/7.jpg)
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)
Also encompasses the act of transforming data as a processing step to facilitateanalysis
5 / 24
![Page 8: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/8.jpg)
The word "wrangle"wrangle
verb
to tend or round up (cattle, horses, or other livestock).— dictionary.com
So, by analogy, "wrangling data" means to collect, clean, and organize digitalinformation (tend and round up)
Also encompasses the act of transforming data as a processing step to facilitateanalysis
Informal word, but data scientists will understand what you mean if you use it
5 / 24
![Page 9: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/9.jpg)
The word "wrangle"
Source: Digital image of a cowboy wrangling data, Digital image on likelihoodlog.com, accessed September 20, 2017,http://www.likelihoodlog.com/?p=1151
5 / 24
![Page 10: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/10.jpg)
ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized
6 / 24
![Page 11: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/11.jpg)
ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized
It would be nice if all datasets were like this! ...but they're the exceptions to therule
6 / 24
![Page 12: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/12.jpg)
ggplot2 needs clean/tidy datasetsDatasets such as mpg or rail_trail (Assignment 1) are small and nicelyorganized
It would be nice if all datasets were like this! ...but they're the exceptions to therule
Most raw datasets need cleaning, and this is where data scientists will spend mostof their time
Source: Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says, Digital image on forbes.com,accessed September 20, 2017, https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
6 / 24
![Page 13: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/13.jpg)
The "data wrangling" pipeline
7 / 24
![Page 14: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/14.jpg)
The "data wrangling" pipeline
import → obtain data and get it into R
7 / 24
![Page 15: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/15.jpg)
The "data wrangling" pipeline
import → obtain data and get it into R
tidy → reshape rows and columns to follow the Tidy data rules
7 / 24
![Page 16: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/16.jpg)
The "data wrangling" pipeline
import → obtain data and get it into R
tidy → reshape rows and columns to follow the Tidy data rules
transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.
7 / 24
![Page 17: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/17.jpg)
The "data wrangling" pipeline
import → obtain data and get it into R
tidy → reshape rows and columns to follow the Tidy data rules
transform → cleaning the dataset (not the same as tidying) as well as "slicing anddicing" the dataset for exploration and analysis.
7 / 24
![Page 18: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/18.jpg)
Data wrangling in R
8 / 24
![Page 19: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/19.jpg)
A few bits of R history
The �rst stable version of R, v1.0.0, was released on February 29, 2000.
R itself is an implementation of the S programming language, which was designedat Bell Laboratories in the mid-1970s.
Base R was built for statisticians and for doing data analysis, but not necessarilyfor modern Data Science
It's age and legacy brings along old implementations of data structures andabbreviated function (commands) names
Source: David Smith, Over 16 years of R project history, Revolutions blog, last updated on March 4, 2016, accessed September20, 2017, http://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html
9 / 24
![Page 20: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/20.jpg)
Modernizing R with tidyverse
Over the last 3 years, chief scientist at RStudio, Hadley Wickham, has brought Rinto the modern era with the tidyverse .
The tidyverse is an opinionated collection of R packages designed for data science.All packages share an underlying philosophy and common APIs.— Front page of the Tidyverse website
In practice, this meant reducing everything to a small, core set of commands thatall behave in a similar way.
10 / 24
![Page 21: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/21.jpg)
Core tidyverse
ggplot2 : ggplot2 is a system for declaratively creating graphics, based on TheGrammar of Graphics. You provide the data, tell ggplot2 how to map variables toaesthetics, what graphical primitives to use, and it takes care of the details.
dplyr : dplyr provides a grammar of data manipulation, providing a consistentset of verbs that solve the most common data manipulation challenges.
tidyr : tidyr provides a set of functions that help you get to tidy data. Tidy datais data with a consistent form: in brief, every variable goes in a column, and everycolumn is a variable.
Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/
11 / 24
![Page 22: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/22.jpg)
Core tidyverse
readr : readr provides a fast and friendly way to read rectangular data (like csv,tsv, and fwf). It is designed to �exibly parse many types of data found in the wild,while still cleanly failing when data unexpectedly changes.
purrr : purrr enhances R's functional programming (FP) toolkit by providing acomplete and consistent set of tools for working with functions and vectors. Onceyou master the basic concepts, purrr allows you to replace many for loops withcode that is easier to write and more expressive.
tibble : tibble is a modern re-imaginging of the data frame, keeping what timehas proven to be effective, and throwing out what it has not. Tibbles aredata.frames that are lazy and surly: they do less and complain more forcing you toconfront problems earlier, typically leading to cleaner, more expressive code.
Source: Tidyverse packages, tidyverse.com, accessed on September 20, 2017, https://www.tidyverse.org/packages/
11 / 24
![Page 23: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/23.jpg)
dplyr package
12 / 24
![Page 24: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/24.jpg)
Get copy of dplyr demo repository
On website, spring18.cds101.com, click Materials → Class 8
Obtain a copy of the linked repository
Load in RStudio, and follow along in demos
13 / 24
![Page 25: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/25.jpg)
select()
14 / 24
![Page 26: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/26.jpg)
select() demo
Follow along in RStudio
15 / 24
![Page 27: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/27.jpg)
%>% asideInstead of this:
We write this:
Show the order of transformations
Useful when we have to chain together many transformations!
select(presidential, name, party)
presidential %>% select(name, party)
16 / 24
![Page 28: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/28.jpg)
arrange()
17 / 24
![Page 29: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/29.jpg)
arrange() demo
Follow along in RStudio
18 / 24
![Page 30: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/30.jpg)
slice()
19 / 24
![Page 31: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/31.jpg)
slice() demo
Follow along in RStudio
20 / 24
![Page 32: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/32.jpg)
filter()
21 / 24
![Page 33: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/33.jpg)
Comparisons
Simple comparisons can be made using the following symbols:
> : greater than
>= : greater than or equal to
< : less than
<= : less than or equal to
!= : not equal
== : equal
22 / 24
![Page 34: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/34.jpg)
Logical operators
Source: Digital image of logical operations, R for Data Science website, accessed September 20, 2017,http://r4ds.had.co.nz/transform.html#logical-operators
23 / 24
![Page 35: S ¹ q¹ S Ë ê Ëx - CDS 101spring18.cds101.com/doc/class08_slides.pdf · 2018-05-03 · S}}¹ q¹5S S¹ÁmS%Ë ê%˹x February 15, 2018 These slides are licensed under a Creative](https://reader034.fdocuments.in/reader034/viewer/2022042417/5f3329336194fd20163af3b7/html5/thumbnails/35.jpg)
filter() demo
Follow along in RStudio
24 / 24