A Theoretical Framework for Adaptive Collection Designs

A Theoretical Framework for Adaptive Collection Designs

Jean-François Beaumont, Statistics CanadaDavid Haziza, Université de Montréal

International Total Survey Error WorkshopQuébec, June 19-22, 2011

Overview

Selected literature review

Framework

• Definition of the problem

• Choice of quality indicator and cost function

• Mathematical formulation of the problem

Solution and discussion

Conclusion2

Literature review: Groves & Heeringa (2006, JRSS, Series A)

Responsive designs: Use paradata to guide changes in the features of data collection in order to achieve higher quality estimates per unit cost

• Paradata: Data about data collection process

• Examples of features: mode of data collection, use of incentives , …

• Need to define quality and determine quality indicators

• Two main concepts: phase and phase capacity

3

Literature review: Groves & Heeringa (2006, JRSS, Series A)

Phase: Period of data collection during which the same set of methods is used

• Phase 1: gather information about design features

• Phases 2+: alter features (e.g., subsampling of nonrespondents, larger incentives,

…)

A phase is continued until its phase capacity is reached

• Judged by the stability of an indicator as the phase matures

4

Literature review: Schouten, Cobben & Bethlehem (2009,

SM) Goal: determine an indicator of nonresponse bias

as an alternative to response rates

Proposed a quality indicator, called R-indicator:

• Population standard deviation must be estimated

• Response probabilities, , must be estimated using some model

An issue: indicator depends on the proper choice of model (choice of auxiliary variables)

( ) 1 2 Pop.Std.Dev.( , ) , 0 ( ) 1iR i U R ρ ρ

i

5

Literature review: Schouten, Cobben & Bethlehem (2009,

SM) Another issue: indicator does not depend on the

variables of interest but nonresponse bias does

Maximal bias of :

is the unadjusted estimator of the population mean:

Two limitations of maximal bias (and R-indicator):

• unadjusted estimator is rarely used in practice

• depends on proper specification of

1 ( ) ( )

2

R S

ρ y

i

ˆNA

ˆNA

ˆr r

NA i i ii s i sw y w

6

Literature review: Peytchev, Riley, Rosen, Murphy & Lindblad (2010,

SRM)

Goal: Reduce nonresponse bias through case prioritization

Suggest targeting individuals with lower estimated response probabilities

• For instance, give them larger incentives or give interviewer incentives

• Their approach is basically equivalent to trying to increase the R-indicator (or achieving a more balanced sample)

Recommend using auxiliary variables that are associated with the variables of interest

7

Literature review: Laflamme & Karaganis (2010, ECQ)

Development and implementation of responsive designs for CATI surveys at Statistics Canada

Planning phase:

• before data collection starts (determination of strategies, analyses of previous data, …)

Initial collection phase:

• evaluate different indicators to determine when the next phase should start

Two Responsive Designs (RD) phases 8

Literature review: Laflamme & Karaganis (2010, EQC)

RD phase 1:

• prioritize cases (based on paradata or other information) with the objective of improving response rates

• increase the number of respondents (desirable)

RD phase 2:

• prioritize cases with the objective of reducing the variability of response rates between domains of interest (increasing R-indicator)

• likely reduce the variability of weight adjustments (desirable)

9

Literature review: Schouten, Calinescu & Luiten (2011, Stat. Netherlands)

First paper to propose a theoretical framework for adaptive survey designs

Suggest:

• Maximizing quality for a given cost; or

• Minimizing cost for a given quality

Requires a quality indicator (e.g., overall response rate, R-indicator, Maximal bias, …)

• Which one to use?

10

Definition of the problem

Adaptive collection design: Any procedure of calls prioritization or resources allocation that is dynamic as data collection progresses

• Use paradata (or other information) to adapt itself to what is observed during data collection

• Focus on calls prioritization

Our objective: Maximize quality for a given cost

Context: CATI surveys

11

Choice of quality indicator

Focus of the literature: Find collection designs that reduce nonresponse bias (or maximize R-indicator) of an unadjusted estimator

We think the focus should not be on nonresponse bias. Why?

• Any bias that can be removed at the collection stage can also be removed at the estimation stage

We suggest reducing nonresponse variance of an estimator adjusted for nonresponse

12

Quality indicator

Suppose we want to estimate the total:

Assuming that nonresponse is uniform within cells, an asymptotically unbiased estimator is:

Quality indicator: The nonresponse variance

1

ˆ ˆwithˆrg

Ggi rg

A gi gi sg g g

w ny

n

ii Uy

1 2,

1

ˆvar 1 1G

q A g g wy gg

s n S

ˆg q g q rg gE s E n s n

13

Overall cost

Overall cost:

, , , ,( 1)rg g rg

TOT g gi NR g R g gi NR gi s i s s

C m C C m C

,1

G

TOT TOT ggC C

14

,

,

:total number of attempts for unit

:cost of an unsuccessfulattempt

:cost of an interview

gi

NR g

R g

m i

C

C

Expected overall cost

Expected overall cost:

, , , ,

g

TOT g R g NR g g g NR g gii s

C C C n C m

,1

G

TOT q TOT TOT ggC E C s C

0 11

G

TOT g g gg

C n

15

,gi q gi gi gim E m s m p M

does not dependongi gm Assumption :

Mathematical formulation

Objective: Find that minimizes the nonresponse variance

subject to a fixed expected overall cost,

Solution:

Note:Equivalent to maximizing the R-indicator only in a very special scenario

ˆvarq A s

, 1,..., ,g g G

TOTC K

1

1 2 2,

,1

1 g wy g

g wy gg

n SS

16

Implementation

Find the effort (number of attempts) necessary to achieve the target response probability

Procedure: Select cases to be interviewed with probability proportional to the effort

Issues: 1) Avoid small estimated to avoid an unduly large effort

2) Might want to ensure that a certain time has elapsed between two consecutive calls

gieg

ln(1 )

ln(1 )g

gigi

ep

gie

gipgie

17

Graph of variance vs cost

Minimum nonresponse variance

Expected overall cost18

Revised solution

Solution of the optimization problem is found before data collection starts

May be a good idea to revise the solution periodically (e.g., daily)

• Some parameters might need to be modified

• Update remaining budget and expected overall cost

• The revised optimization problem is similar to the initial one

19

Revised solution

Solution (same as before):

Revised target response probability:

Effort:

20

1

1 2 2,

1

1 g wy g

gg

n S

g g rgg

g rg

n n

n n

Could be negative

ln(1 )

ln(1 )g

gigi

ep

Conclusion

Next steps:

• Simulation study

• Adapt the theory for practical applications

• Test in a real production environment

Which quality indicator? Nonresponse variance? Others?

Reduction of nonresponse bias: subsampling of nonrespondents

• Our approach could be used within the subsample

21

Thanks - Merci

For more information, please contact:

Pour plus d’information, veuillez contacter :

Jean-François Beaumont ([email protected])

David Haziza ([email protected])

22

mailto:[email protected]

mailto:[email protected]

A Theoretical Framework for Adaptive Collection Designs

Documents

Transcript of A Theoretical Framework for Adaptive Collection Designs