Problem Set 1 Sol

6
PS II 2015 Problem Set 1 : Sampling Solutions 1. Identify the relevant underlying population and suggest an appropriate sampling design/scheme that should be used to estimate a) The average age of the current batch of PGP I students. Population : All current PGP I students. Sampling scheme : (i) Select a simple random sample (without replacement) of the above population and calculate average age; (ii) if it can be assumed that each study group has a heterogeneous mix of students of all ages, then those groups can be assumed to be clusters and a simple random sample of clusters (groups) can be selected and the average age of the students in the selected clusters can be calculated. b) The average number of hours an IIMA student (PGP I, II, X, FPM, FDP, AFP) studies per week day outside of class. Population : All current PGP I, II, X, FPM, FDP & AFP students of IIMA. Sampling scheme : Use stratified sampling (with proportional allocation) from each of the above groups i.e first decide on the sample sizes to be drawn from each of the above groups using the proportional allocation rule and then select simple random samples of the above sizes from the groups. Calculate the average study hours for each of the samples. c) The proportion of non-vegetarians in Ahmedabad in 2013 Population : All permanent residents of Ahmedabad in 2013. Sampling scheme : Use stratified sampling (with proportional allocation, if possible) based on religion/caste/ethnicity since those may have an effect on food habit. 2. Suppose you want to know the proportion of IIMA PGP I students who went home during the last weekend. To estimate it you ask your close friends whether they went home or not. a) What kind of study are you conducting?

description

PS 2 Set 1

Transcript of Problem Set 1 Sol

Page 1: Problem Set 1 Sol

PS II 2015 Problem Set 1 : Sampling Solutions

1. Identify the relevant underlying population and suggest an appropriate sampling design/scheme that should be used to estimate

a) The average age of the current batch of PGP I students.

Population : All current PGP I students.

Sampling scheme : (i) Select a simple random sample (without replacement) of the above population and calculate average age; (ii) if it can be assumed that each study group has a heterogeneous mix of students of all ages, then those groups can be assumed to be clusters and a simple random sample of clusters (groups) can be selected and the average age of the students in the selected clusters can be calculated.

b) The average number of hours an IIMA student (PGP I, II, X, FPM, FDP, AFP) studies per week day outside of class.

Population : All current PGP I, II, X, FPM, FDP & AFP students of IIMA.

Sampling scheme : Use stratified sampling (with proportional allocation) from each of the above groups i.e first decide on the sample sizes to be drawn from each of the above groups using the proportional allocation rule and then select simple random samples of the above sizes from the groups. Calculate the average study hours for each of the samples.

c) The proportion of non-vegetarians in Ahmedabad in 2013

Population : All permanent residents of Ahmedabad in 2013.

Sampling scheme : Use stratified sampling (with proportional allocation, if possible)based on religion/caste/ethnicity since those may have an effect on food habit.

2. Suppose you want to know the proportion of IIMA PGP I students who went home during the last weekend. To estimate it you ask your close friends whether they went home or not.

a) What kind of study are you conducting?

i) Experimental ii) Simple random sampling iii) Observational Study iv) Stratified random samplingv) Convenience study b) What can be a possible source of bias in your study?

i) Sampling biasii) Non-response biasiii) Response biasiv) All of the above

Page 2: Problem Set 1 Sol

Here, it is very unlikely that anyone would refuse to respond or distort their responses. Moreover, the question itself would likely be quite straightforward. Thus, chances of response or non-response biases can be discounted.

3. For each of the following studies, explain whether an experiment or observational study would be more appropriate:

a) Whether or not smoking has an effect on coronary heart disease

Observational study : it is unethical to randomize subjects into smoking and non-smoking groups and follow them up over time to check/compare the proportion of those affected with coronary heart disease. Rather, a random sample of patients (with coronary heart disease) and healthy subjects can be interviewed with respect to their smoking history.

b) Whether class X scores tend to be positively associated with CAT scores

Observational study : it is not feasible to randomize students to higher and lower class X score categories and follow them up to check their CAT scores. Rather, it would be more realistic to select a random sample of students, record their CAT and class X scores to check for any association between the two.

c) Whether or not a special coupon attached to the outside of a catalogue makes recipients more likely to order products from a mail-order company

Experimental study : a random sample of subjects can be mailed the catalogue with coupon and another random group can be given the catalogue without the coupon. The proportion in each group who order the products can then be compared.

d) Whether longer hours doing Facebook tend to be negatively associated with lower grades:

Observational study : it is not realistic to randomize students into different Facebook-use categories and follow them up to check their grades. Rather it would be more realistic to select a random sample of students, record the number of hours/minutes they browse Facebook and their grades and analyse whether these are associated (or not).

e) Whether or not taking Aspirin can reduce heart attacks.

Experimental study : a group of patients at risk of heart ailments can be randomized to receive Aspirin or placebo and followed up over time to check what percentage in each group have heart attacks. 4. Suppose following are the values in some population: 5, 27, 4, 17, 4.5, 19, 2, 11, 3, 6, 13, 18. A sample of size 4 is taken, and is observed to be 3, 4, 4.5, 2. Is it most likely to be (a) a simple random sample, (b) a stratified sample or (c) a clustered sample? Give reason for your answer.

A stratified sample should have units from every strata; a cluster sample should have all units from the sampled clusters (each of which are heterogeneous). Hence this can only be a

Page 3: Problem Set 1 Sol

simple random sample as in either b) or c) there will be a mix of 1-digit and 2-digited numbers.

5. IIMA income : An agency wants to estimate the mean monthly income of the IIMA employees. The agency designs the following sampling plan to get the estimate:

Divide the IIMA employees into five different groups: senior faculty, junior faculty, officers, supervisors, and contract workers. The number of people from whom the information is collected in each category is proportional to the share of the employees in that category. On one weekday morning, the agency surveyors go around all the IIMA offices and collect income information from employees of different category until they reach the specified number for each category.

a) In this plan, is it right to classify the employees into different groups? Why or Why not?

Yes, stratification by employee category makes sense because average salaries are expected to vary substantially across employee categories. However, more groups/strata may be necessary since, for example, income of officers may vary quite a bit based on rank and/or seniority.

b) What kind of sample is this? Is this sample likely to give the agency a reasonable estimate of the mean income? Why or Why not? Can you think of a better sampling plan than this?

This is a stratified sample but not randomly selected from each stratum. The sample may be biased, for example, towards faculty who are likely to come to office in the mornings (it does not take into account those employees who come late to the office and hence suffers from undercoverage). Basically, the above sampling process does not ensure that every employee in each category is equally likely to be in the sample.

c) Someone recommends to the agency that they randomly select one of the employee categories and sample everyone in that category to estimate the mean income. Would this plan make sense to you?

Not at all; this is an example of cluster sampling (with very bad clusters since strata are being treated as clusters). It does not make sense because any one cluster of this kind will not represent the entire population of employees.

6. Break ups : Married people seem to have better health than their single peers. Break-ups take their toll with divorced or separated people reporting higher rates of illness (compared to their “happily-married” counterparts). This conclusion was based on a nationwide survey conducted by Statistics Canada* of 9755 people, aged 20 to 64 about their physical and mental health and relationship status, at 2 year intervals, beginning 1990.

a) What is the population of interest in this study ?

All Canadian adults aged 20-64 years.

b) Identify the explanatory and response variable.

Response variable : (a measure of) physical and mental health status.

Page 4: Problem Set 1 Sol

Explanatory variable : relationship status

c) Was this an experiment or observational study? Explain.

Observational study since physical/mental health and relationship status were observed/collected at regular time intervals for 9755 individuals.

* Wade, T. J and Pevalin, D. J, “Marital transitions and mental health”, Journal of health and social behaviour (2004), 45(2), 155-170.

7. Identify the bias : Suppose Ahmedabad Mirror (AM) designs a survey to estimate the proportion of the city’s adult residents who favor the legalization of drinking. Accordingly, it takes a list of the 1000 Amdavadis who have subscribed to this paper the longest, and sends each of them a questionnaire that asks “Do you think it is a good idea to legalize drinking in Ahmedabad, which will eventually broaden the tax base and hence contribute money to education and other critical infrastructure needs ?” After analysing results from the 50 people who respond, they report that 95% of Amdavadis are against legalization of drinking. Identify the bias that results from

a) Undercoverage : those who do not subscribe to AM or are recent subscribers were left out; thus a large chunk of the Ahmedabad population was not even considered.b) Sampling design : this is essentially a volunteer/convenience sample NOT a simple random sample of the city’s adult residents.c) Nonresponse : out of 1000 sampled subjects, only 50 responded i.e 950 did not respond. Maybe, those who felt strongly about the issue (of legalization of drinking) only responded.d) Response : question is somewhat convoluted and hence could have influenced the responses; it is quite likely that some of the negative responses were because a positive response would have been socially unacceptable.

8. Stock markets and mental health : An internet survey of 545 Hong Kong residents suggested that close daily monitoring of volatile financial affairs may not be good for your mental health. Subjects who felt that their financial future was out of control had the poorest mental health whereas those who felt that their financial future was secure had the best mental health.

a) Identify the population of interest for this survey

All Hong Kong (HK) residents who monitor stock markets/volatile financial affairs on a daily basis.

b) Is this an experiment or an observational study ?

Observational study since it is based on responses from an internet survey.

c) Is there any possibility of bias in the survey results? Explain.

Yes, there are some possibilities of biases as follows :

Page 5: Problem Set 1 Sol

Sampling bias/undercoverage : this is a volunteer/convenience sample, NOT a random sample of HK residents. Moreover, the survey did not reach those HK residents who do not have access to the internet.

Non-response bias : since this is an internet survey, a large proportion of the population might not have responded (number of internet users in HK is far higher than 545 !). In fact, only those who felt strongly about this issue might have responded.

Response bias : the responses might have been affected by the manner/language of the question asked; there is also the possibility that some respondents may have lied about their financial condition and/or mental health (since both of these are sensitive issues).