STAT 1261/2260: Principles of Data...

STAT 1261/2260: Principles of Data Science

Lecture 2 - Doing Data Science

Last lecture• What is Data Science?

• Course webpage: http://pitt.edu/~jub69/material/index_PDS_F19.html

A case study “More Tweets, More Votes?”• Research question:

Is social media a valid indicator of political behavior?

• Data:

542,969 tweets mentioning candidates selected from a random sample of 3,570,054,618, as well as Federal Election Commission data from 795 competitive races in the 2010 and 2012 U.S. congressional elections.

Data and Code Avalable Here

Potiential problems with social media data• Social media content is largely focused on entertainment and emotional expression,

potentially rendering it a poor measure of the behaviors and outcomes typically of interest to social scientists.

• Social media provide a self-selected sample of the electorate. They provide a biased, non-representative sample of the population:

– Twitter data do not accurately represent the sociodemographic makeup of the United States

– Right-leaning political communication channels are more active and densely connected than left-leaning channels

– Extroversion and openness to experiences are positively related to social media use, while emotional stability has a negative relationship

Previous studiesMany similar studies have been criticized for a variety of reasons, including:

• using a self-selected and biased sample of the population• investigating only a small number of elections• not using sociodemographic controls

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/23103

http://pitt.edu/~jub69/material/index_PDS_F19.html

Any remedies?The authors of “More Tweets, More Votes?” collected data on election outcomes from the 2010 and 2012 U.S. congressional elections from the Federal Election Commission.

Additionally, for 2010, they collected socio-demographic and electoral control variables commonly used in other research on electoral politics for all 435 U.S. House districts.

• Dependent variable: Republican Vote Share• Independent variables:

– Republican Tweet Share– Republican Incumbency (1 if Republican and 0 otherwise)– Baseline district partisanship (%McCain, percentage of the 2008 presidential

vote share captured by John McCain)– Conventional media coverage (CNN share)– Demographic variables (Median Household Income, Median Age, %College

Educated, %White, %Female)

Can you see how the researchers of this study are trying to address the issues previous studies?

Variable Definitions• Dependent variable: Republican vote share for each district i, denoted vS(i)

vS(i)=v R(i)

vD (i)+vR(i)×100

• Independent variable: Republican Tweet Share

t wS (i)=t wR(i)

t wR (i)+t wD(i)×100

• where t ws(i) represents the percentage of Twitter attention given to a particular candidate over his or her opponent in a particular race.

Why do you think the variables are defined in the above ways?

ResultsLet us only discuss the results from the data from the 2010 election.

• Bivariate model:

– Republican Tweet Share: 0.268 (0.022)***– Adjusted R-square: 0.26

Any concern?

• Full model:

– Republican Tweet Share: 0.022 (0.01)*– Adjusted R-square: 0.92

Any concern?

Interpretation of Results• The results show that the percentage of Republican-candidate name mentions

correlates with the Republican vote margin in the subsequent election.

• This finding persists even when controlling for incumbency, district partisanship, media coverage of the race, time, and demographic variables such as the district’s racial and gender composition.

Bivariate modelLet us examine the plot for the bivariate model.

Any violation of model assumptions?

• Constant variance• Normality• Linearity• Outliers• Influential points

Question: Can this model be used to do prediction? For example, for the next election? Any concern? Any idea of improving the study?

Big Data and Data Science• What is big data and what is data science?

– Is data science the science of Big Data?

– Is data science just an extension of statistics?• From wikipedia: Data Science is an interdisciplinary field about scientific methods,

processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

“Unstructured data” can include emails, videos, photos, social media, and other user-generated content.

• Data science often requires sorting through a great amount of information and writing algorithms to extract insights from this data.

Today• How do we learn Data Science?

• Data visualization

https://en.wikipedia.org/wiki/Data_science

Textbooks

Required Textbook• Baumer et al., Modern Data Science with R. CRC Press. [Textbook webpage:

https://mdsr-book.github.io/index.html]

Other Resources• Wickham and Grolemund, R for Data Science. O’Reilly. [http://r4ds.had.co.nz/]

http://r4ds.had.co.nz/

https://mdsr-book.github.io/index.html

Topics1. Introduction to Data Science2. Introduction to Data Science tools: R and RStudio3. Data Visualization4. Data Wrangling5. Ethics in Data Science6. Statistical thinking in Data Science7. Regression modeling8. Machine Learning, dimension reduction, clustering, classification9. Professional Reporting and reproducible analysis

How do we learn?• Learn data science by doing data science

• Use R and RStudio

• Two lectures and one recitation (lab/quiz) in a week

R• R is a free software environment for statistical computing and graphics, and is the best

data science language.• Two articles are as follows.

– Why you should learn R first for data science

– WHY R IS THE BEST DATA SCIENCE LANGUAGE TO LEARN TODAY

• Download R (URL https://www.r-project.org/)

https://www.r-project.org/

https://www.r-project.org/

http://sharpsightlabs.com/blog/r-recommend-data-science/

https://www.r-bloggers.com/why-you-should-learn-r-first-for-data-science/

RStudioRStudio is an open source and enterprise-ready professional software for R.

URL https://www.rstudio.com/

https://www.rstudio.com/

Rstudio screen

Three sections:

• R console• Workspace/History• Files/Plots/Packages/Help

How to learn R and RStudio• R is a language for data science.

• This entire course is about doing data science using R.

– Fridays classes will meet at STAT LAB (Posvar 1201).– We will begin using R on this Friday.

• If you’ve got a personal computer, install R and RStudio.

– Visit https://cran.r-project.org/ to install R, then visit https://www.rstudio.com/ to install RStudio.

– Take a look at “INSTALLING R and R Studio” document (at the course webpage)– Watch Lynda.com video at https://www.lynda.com/R-tutorials/Up-Running-

R/120612-2.html (Use your Pitt ID to log-in)• Computers in STAT LAB have R and RStudio.

– You can bring your laptop to the lab.

Data wrangling and data visualizationSee example data set, containing sex and height data of a group of people.

https://www.lynda.com/R-tutorials/Up-Running-R/120612-2.html

https://www.lynda.com/R-tutorials/Up-Running-R/120612-2.html

https://www.rstudio.com/

https://cran.r-project.org/

Timestamp Height Sex1 9/2/2014 13:40:36 75 Male2 9/2/2014 13:46:59 70 Male3 9/2/2014 13:59:20 68 Male4 9/2/2014 14:51:53 74 Male5 9/2/2014 15:16:15 61 Male6 9/2/2014 15:16:16 65 Female

Motivating data wranglingNote that some entries of Height are not in inches.

Timestamp Height Sex127 9/2/2014 15:16:56 5'7" Male150 9/2/2014 15:17:09 5'3" Female187 9/2/2014 15:18:00 5'8.11 Male202 9/2/2014 15:19:48 5'11 Male236 9/4/2014 0:46:45 5'9'' Male55 9/2/2014 15:16:37 165cm Female

Fixing this is part of what we call data wrangling.

Data wranglingAfter fixing the above issue, there are still some problems:

Timestamp Height Sex12 9/2/2014 15:16:23 6.00 Male40 9/2/2014 15:16:32 5.30 Female66 9/2/2014 15:16:41 511.00 Male84 9/2/2014 15:16:46 6.00 Male99 9/2/2014 15:16:50 2.00 Female126 9/2/2014 15:16:56 9000.00 Male194 9/2/2014 15:18:14 5.25 Female231 9/3/2014 21:43:00 5.50 Male235 9/3/2014 23:55:37 11111.00 Male241 9/4/2014 5:15:28 6.00 Female242 9/4/2014 6:31:03 6.50 Male244 9/4/2014 9:24:41 150.00 Female

We sometimes have to fix these “by hand.”

Understanding Univariate DataLook at the cumulative distribution function of univariate data

$$F(a) = {\rm Prob { }(Height} \le a)$$

DistributionsHistograms show: F (b)− F(a) for several intervals ¿

Easier to interpret than cumulative distribution functions

Normal DistributionsThe distribution of many outcomes in nature are approximated by the normal distribution:

• μ is the average (also called the mean)

• σ is the standard deviation

Normal ApproximationIf our data follows the normal distribution then μ and σ are a sufficient summary:

they tell us everything!

All we need to know is μ and σ .

Mean SDMale 70 3Female 65 3

How good is the normal approximation?Here are the real and the approximations of the cumulative probability distribution for males.

Height Real Approx 1 63 0.02 0.032 65 0.07 0.063 67 0.16 0.104 68 0.31 0.315 70 0.50 0.446 71 0.69 0.687 73 0.84 0.888 75 0.93 0.959 76 0.98 0.99

• Real is the observed (empirical) cumulative probability• Approx is the cumulative probability assuming Height is normally distributed

QQ-plotsObserved versus normal approximation quantiles

Two variables (scatterplot)

Normal approximation for two variablesMany data are bivariate normal.

• The blue line is called the regression line

Anscombe’s quartet

Anscombe’s quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed

They were constructed by the statistician Francis Anscombe to demonstrate

• The importance of graphing data before analyzing it• The effect of outliers and other influential observations on statistical properties.

Anscombe’s quartet (cont.)

• Top left is a simple linear relationship, corresponding to two variables correlated and following the assumption of normality.

• Top right is not distributed normally; while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant.

• Bottom left: the distribution is linear, but the calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816.

• Bottom right shows an example when one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.

Most data are not normalFor example, look at compensation for 199 US CEOs (2000)

• The distribution is severely right-skewed.• Average is $600,000 but 84%, not 50%, make less.• The normal approximation is not appropriate here.

STAT 1261/2260: Principles of Data...

Documents

Transcript of STAT 1261/2260: Principles of Data...