Wed, sept 16, 2015 using large data sets. announcements Presentation this Friday, Sept. 18 @ noon in...

40
wed, sept 16, 2015 using large data sets

Transcript of Wed, sept 16, 2015 using large data sets. announcements Presentation this Friday, Sept. 18 @ noon in...

wed, sept 16, 2015

using large data sets

announcements

• Presentation this Friday, Sept. 18 @ noon in our classroom, room 208– Should we archive Facebook? Why the users are

wrong and the NSA is right– Cathy Marshall, Texas A&M (formerly of

Microsoft Research)

Assignment: Analyzing a research article

• Select one of the seven research articles linked from our class schedule (listed under today’s readings, Sept 16)

• Using the Analyzing Research Articles handout (need one?), critique the article

• Focus on the purpose of the study, description of study design (participants, methods, how they collected data), data analysis and conclusions

• Don’t worry about specific statistical analysis methods• Due next Wednesday, Sept 23 – print or email to me by

class time• Format – whatever works for you (bullets, address some

but not necessarily all questions/points from handout)• Counts as one pop quiz (worth up to 2 points)

Rationale for assignment

• Use evaluative strategies and vocabulary to examine a scholarly research product – (the “analyzing information” part of our class title)

• Explore research that has been conducted using the same dataset that we will be using for our project, the General Social Survey

today’s line-up…

• Survey research as a method• General Social Survey GSS (dataset for our project)

• Project details, variables, think about groups• SPSS & Virtual Lab

What is a survey?

• A survey is a:– systematic method for gathering information– from (a sample of) entities– for the purposes of constructing quantitative

descriptors– of the attributes of the larger population of which

the entities are members

typically, surveys…

• Gather information by asking people questions

• Collect information by either (1) interviewers asking questions and recording responses or (2) respondents reading and recording their own answers

• Collect information from a subset of the population, a sample, rather than from all members

The SAGE Handbook of Online Research Methods (2008)

Based on probability

design*

Based on probability

design*

*units are selected using a method that ensures that each unit has a known, nonzero probability of being included

*units are selected using a method that ensures that each unit has a known, nonzero probability of being included

The reading spends a great length of time introducing the history of survey development.

Why is it important to understand the history of survey?

-Wanyi

1936 election and the Literary Digest survey

• Magazine had predicted every election since 1916

• Sent out 10 million surveys---and 2.4 million responded

• They said: Landon would win 57% of the vote

• What happened: 62% Roosevelt landslide

What went wrong?• Sample not representative• Lists came from subscriptions,

phone directories, club members• Phones were a luxury in 1936• Selection Bias toward the rich• Voluntary response: Republicans

were angry and more likely to respond• Context: Great Depression– 9 million unemployed– Real income down 33%– Massive discontent, strike waves

Polls vs. Surveys

• No clear distinction between the two terms– “Poll” often used for private sector opinion studies

• Use many of the same design features as studies that would be called surveys

– “Poll” rarely used to describe government or scientific surveys

• To me, the term poll implies either– A commercial or less-scientific study, or– A quick turn-around survey whose results may be of short-

term interest

Steps in conducting a survey

• Clearly state research objective(s)• Decide on survey mode(s)– How will you contact potential respondents? (web,

email, phone, etc.)– In what media will the survey be given? (web, email,

phone, etc.)

• Determine fielding strategy (how to maximize response rates?)

• Design the survey questions and the survey instrument

• Determine sample size and sampling strategy• Obtain Institutional Review Board (IRB) or other

approval as necessary– Are respondents promised confidentiality?– What is the impact if their survey responses become

known?

• Pre-test– Give the survey to some test subjects and get their feedback

• What works and what doesn’t?• Are you getting correct data/information?

– Revise and re-pre-test as necessary

Steps in conducting a survey

General Social Survey

• The GSS (General Social Survey) is a biannual personal interview survey of U.S. households conducted by the National Opinion Research Center (NORC). The first survey took place in 1972.

• Approximately 3000 American adults are interviewed in person for about 90 minutes and asked around 450 questions.

http://www3.norc.org/gss+website/

Purpose of GSS

• gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes over time

• to compare the United States to other societies

General Social Survey

• demographics & attitudes– The questionnaire contains a standard core of

demographic and attitudinal variables, plus certain topics of special interest selected for rotation (called "topical modules")

– Items include national spending priorities, drinking behavior, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and membership in voluntary associations

Data to Story Project

• Go over project description• Components– Project description– Search log– Annotated bibliography– Presentation– Participation in group

[handout]

variables

• variable – a characteristic that can vary in value among subjects in a sample or a population. We are interested in similarities and differences - variance

• types of variables– categorical – quantitative

categorical variable

• scale for measurement is a set of categories• examples:– Racial-ethnic group (white, black, Hispanic)– Political party identification (Dem., Repub., Indep.)– Vegetarian? (yes, no)– Mental health evaluation (well, mild symptom formation,

moderate symptom formation, impaired)– Happiness (very happy, pretty happy, not too happy)– Religious affiliation– Major

SPANKING: Categorical (Single) Do you strongly agree, agree, disagree, or strongly disagree that it is sometimes necessary to discipline a child with a good, hard spanking?

Categories: Code as:{strongly_agree} Strongly agree 5{agree} Agree 4{disagree} Disagree 3{strongly_disagree} Strongly disagree 2{dontknow} DON'T KNOW 1{refused} REFUSED 0

Sample question from GSS

scales of measurement

for categorical variables, two types:

nominal scale – unordered categorieso preference for president, race, gender, religious

affiliation, major opinion items (favor vs. oppose, yes vs. no)

ordinal scale – ordered categorieso political ideology (very liberal, liberal, moderate,

conservative, very conservative) o anxiety, stress, self esteem (high, medium, low)o mental impairment (none, mild, moderate, severe)o government spending on environment (up, same,

down)

PRES12: Categorical (Single) Did you vote for Obama or Romney?

Categories: Code as:Obama 5Romney 4Other Candidate (Specify) 3Didn’t vote for president 2Don’t know 1Refused 0

nominal scale – unordered categories

POLVIEWS: Categorical (Single) We hear a lot of talk these days about liberals and conservatives. I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1—to extremely conservative--point 7. Where would you place yourself on this scale?

Categories: Code as:Extremely liberal 7Liberal 6Slightly liberal 5Moderate, middle of the road 4Slightly conservative 3Conservative 2Extremely conservative 1DON'T KNOW 0REFUSED 8

ordinal scale – ordered categories

quantitative variable

• possible values differ in magnitude • examples:– Age, height– Annual income – Time spent on online each week– Reaction time to a stimulus – (e.g., cell phone while driving in experiment)– Number of “life events” in past year

use of statistics to describe, summarize, and explain or make sense of a given set of data

• Mean– Uses all of the data– Has desirable statistical properties– Affected by extreme high or low values (outliers - example)

– May not best characterize skewed distributions

• Median– Not affected by outliers– May better characterize skewed distributions

Comparison of mean and median

UNC Geography Majors Salaries Example

mid-1980's at the University of North Carolina, the average starting salary of geography students was well over $100,000

Correlation Causation

sample patterns from GSS data– median income of female respondents compared with

average income of male respondents– median level of education of respondents who own a

gun– number of female respondents who own a gun

compared with number of male respondents who own a gun

– average age of respondents who indicated the government should spend more on space exploration

– self-reported level of happiness compared with income level

sample characteristics of the GSS

• The sampling frame of the General Social Survey is all U.S. adults living in households. The sampling frame includes 97.3 % of all U.S. adults.

• Who does not live in a household?– college students in dorms– military personnel in barracks– prisoners– elderly persons in retirement homes

Does the GSS sample really draw from all the adults in its sample frame?

• After the GSS is sampled, only 70% of persons in the sample actually respond to the survey– 23% refuse or cut the survey off in the middle– 2% are unavailable or can’t be found– 5% are missing for other reasons

• In general, a response rate of 60% or more is considered minimally acceptable, but you should check your results in any way you can.

Let’s look at a GSS questionnaire

Start at page 31

where can you access SPSS?

• Odum Institute – Davis Library 2nd floor – ask lab assistant

• https://virtuallab.unc.edu• Lab in the Undergraduate Library (need to

confirm)

Notes…

• Bring a flash drive to the Odum lab on Monday – you may want to save your work

• Davis Library >> Room 219• The dataset that we are using (GSS 20124) is

available for download on our class website– schedule>>sept 21– the dataset is a .sav format – only opens with

SPSS