A survey design for a sensitive binary variable correlated ... · Title A survey design for a...

13
Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s) Yu, JW; Lu, Y; Tian, G Citation Journal of Probability and Statistics, 2013, v. 2013, article no. 827048 Issued Date 2013 URL http://hdl.handle.net/10722/191507 Rights Creative Commons: Attribution 3.0 Hong Kong License

Transcript of A survey design for a sensitive binary variable correlated ... · Title A survey design for a...

Page 1: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Title A survey design for a sensitive binary variable correlated withanother nonsensitive binary variable

Author(s) Yu JW Lu Y Tian G

Citation Journal of Probability and Statistics 2013 v 2013 article no827048

Issued Date 2013

URL httphdlhandlenet10722191507

Rights Creative Commons Attribution 30 Hong Kong License

Hindawi Publishing CorporationJournal of Probability and StatisticsVolume 2013 Article ID 827048 11 pageshttpdxdoiorg1011552013827048

Research ArticleA Survey Design for a Sensitive Binary Variable Correlated withAnother Nonsensitive Binary Variable

Jun-Wu Yu1 Yang Lu2 and Guo-Liang Tian3

1 School of Mathematics and Computational Science Hunan University of Science and Technology Xiangtan Hunan 411201 China2 School of Mathematics and Statistics Central China Normal University Wuhan Hubei 430079 China3Department of Statistics and Actuarial Science The University of Hong Kong Pokfulam Road Hong Kong

Correspondence should be addressed to Guo-Liang Tian gltianhkuhk

Received 30 August 2012 Accepted 19 May 2013

Academic Editor Zhidong Bai

Copyright copy 2013 Jun-Wu Yu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Tian et al (2007) introduced a so-called hidden sensitivity model for evaluating the association of two sensitive questions withbinary outcomes However in practice we sometimes need to assess the association between one sensitive binary variable (egwhether or not a drug user the number of sex partner being ⩽1 or gt1 and so on) and one nonsensitive binary variable (eg good orpoor health status with or without cervical cancer and so on) To address this issue by sufficiently utilizing the information con-tained in the non-sensitive binary variable in this paper we propose a new survey scheme called combination questionnairedesignmodel which consists of a main questionnaire and a supplemental questionnaire The introduction of the supplementalquestionnaire which is indeed a design of direct questioning can effectively reduce the noncompliance behavior since morerespondents will not be faced with the sensitive question Likelihood-based inferences including maximum likelihood estimatesvia the expectation-maximization algorithm asymptotic confidence intervals and bootstrap confidence intervals of parameters ofinterest are derived A likelihood ratio test is provided to test the association between the two binary random variables Bayesianinferences are also discussed Simulation studies are performed and a cervical cancer data set in Atlanta is used to illustrate theproposed methods

1 Introduction

Warner [1] introduced a randomized response technique toobtain truthful answers to questions with sensitive attributesUsing the Warner design Kraemer [2] derived a bivariatecorrelation between an attribute with polytomous responsesand an attribute with normally distributed responses Fox andTracy [3] derived estimation of the Pearson product momentcorrelation coefficient between two sensitive questions byassuming that randomized response observations can betreated as individual-level scores that are contaminated byrandom measurement error Edgell et al [4] consideredthe correlation between two sensitive questions using theunrelated question design or the additive constants designChristofides [5] presented a randomized response techniquewith two randomization devices to estimate the proportionof individuals having two sensitive characteristics at thesame time Kim and Warde [6] considered a multinomial

randomized response model which can handle untruthfulresponses They also derived the Pearson product momentcorrelation estimator which may be used to quantify thelinear relationship between two variables when multinomialresponse data are observed according to a randomizedresponse procedure However all these randomized responseprocedures make use of one or two randomizing deviceswhich (i) entail extra costs in both efficiency and complexity(ii) increase the cognitive load of randomized responsetechniques and (iii) allow for new sources of error such asmisunderstanding the randomized response procedures orcheating on the procedures [7]

From the perspective of incomplete categorical datadesign Tian et al [8] proposed a nonrandomized responsemodel (called the hidden sensitivity model) for assessing theassociation of two sensitive questions with binary outcomesTo protect respondentsrsquo privacy and to avoid the use of anyrandomization device they utilized a non-sensitive question

2 Journal of Probability and Statistics

in the questionnaire to indirectly obtain respondentsrsquo answersto the two sensitive questions In the hidden sensitivitymodel they implicitly assumed that all respondents arewilling to follow the design instructions In other words thenoncompliance behavior will not occur However in practicewe sometimes need to assess the association between onesensitive binary variable (eg whether or not a drug user thenumber of sex partner being ⩽1 or gt1 and so on) and onenonsensitive binary variable (eg good or poor health statuswith or without cervical cancer and so on) To our know-ledge the survey design for addressing this issue and cor-responding statistical analysis methods are not availableAlthough we could directly adopt the hidden sensitivitymodel the information contained in the nonsensitive binaryvariable cannot be utilized in the design Intuitively suchinformation can be used to enhance the degree of privacy pro-tection so that more respondents will not be faced with thesensitive questionThemajor objective of this paper is to pro-pose a new survey design and to develop corresponding sta-tistical methods for analyzing sensitive data collected by thistechnique

The rest of the paper is organized as follows In Section 2without using any randomizing device we propose a surveyscheme called combination questionnaire designmodelwhich consists of a main questionnaire and a supplementalquestionnaire Likelihood-based inferences including maxi-mum likelihood estimates via an expectationndashmaximization(EM) algorithm asymptotic confidence intervals and boot-strap confidence intervals of parameters of interest arederived in Section 3 A likelihood ratio test is also provided totest the association between the two binary random variablesIn Section 4we discuss Bayesian inferenceswhen prior infor-mation on parameters is available In Section 5 two simu-lation studies are performed to compare the efficiency ofthe proposed combination questionnaire model with that ofthe existing hidden sensitivity model of Tianet al [8] (iethe main questionnaire only) A cervical cancer data set inAtlanta is used in Section 6 to illustrate the proposed meth-ods A discussion and an appendix on the mode of a groupDirichlet density and a sampling method from it are alsopresented

2 The Survey Design

Assume that 119883 is a sensitive binary random variable 119884 isa non-sensitive binary random variable and they are corre-lated Let 119883 = 1 denote the sensitive class (eg 119883 = 1 if arespondent is a drug user) and 119883 = 0 denote the non-sen-sitive class (eg 119883 = 0 if a respondent is not a drug user)Furthermore let both 119884 = 1 (eg 119884 = 1 if a respondentreceives at least some college training) and 119884 = 0 (eg119884 = 0 if a respondent graduates at most from some highschool) be non-sensitive classes Define 120579 = (120579

1 120579

4)⊤

where 1205791= Pr(119883 = 0 119884 = 0) 120579

2= Pr(119883 = 1 119884 = 0)

1205793= Pr(119883 = 1 119884 = 1) and 120579

4= Pr(119883 = 0 119884 = 1) then

120579 isin T4 where

T119899= (119909

1 119909

119899)⊤

119909119894⩾ 0 119894 = 1 119899

119899

sum

119894=1

119909119894= 1 (1)

Table 1 The main questionnaire for the combination questionnairemodel

Category 119882 = 1 119882 = 2 119882 = 3

I 119883 = 0 119884 = 0 Block 1 mdash Block 2 mdash Block 3 mdashII 119883 = 1 119884 = 0 Category II Please put a tick in Block 2III 119883 = 1 119884 = 1 Category III Please put a tick in Block 3IV 119883 = 0 119884 = 1 Block 4 mdashNote Only 119883 = 1 is a sensitive class while 119883 = 0 119884 = 0 and 119884 = 1are nonsensitive classes

Table 2 The supplemental questionnaire for the combinationquestionnaire model

Category

119884 = 0Please put a tick in Block 5 mdash if you belong to

119884 = 0

119884 = 1Please put a tick in Block 6 mdash if you belong to

119884 = 1

Note Both the 119884 = 0 and 119884 = 1 are nonsensitive classes

denotes the 119899-dimensional closed simplex in R119899 The objec-tive is to make inferences on 120579 120579

119909= Pr (119883 = 1) = 120579

2+ 1205793

120579119910= Pr(119884 = 1) = 120579

4+ 1205793and the odds ratio 120595 = 120579

11205793(12057921205794)

The survey scheme consists of a main questionnaire anda supplemental questionnaire To design the main question-naire which is to be assigned to group 1 with 119899 respondents(119899 is specified by the investigators) we first introduce a non-sensitive question (say 119876

119882) with three possible answers

Assume that 119882 is a non-sensitive variate with trichotomousoutcomes associated with the 119876

119882and 119882 is independent of

both 119883 and 119884 Define 119901119894= Pr(119882 = 119894) for 119894 = 1 2 3 For

example let119882 = 1 (2 3) denote that a respondent was bornin JanuaryndashApril (MayndashAugust SeptemberndashDecember) andthus we could assume that 119901

119894asymp 13 The main questionnaire

is shown in Table 1 under which each respondent is asked toanswer the non-sensitive question 119876

119882

On the one hand since Category I (ie 119883 = 0 119884 = 0)and Category IV (ie 119883 = 0 119884 = 1) are non-sensitive toeach respondent it is reasonable to assume that a respondentis willing to provide hisher truthful answer by putting a tickin Block 119894 (119894 = 1 4) according to hisher true statusOn the other hand Category II (ie 119883 = 1 119884 = 0) andCategory III (ie 119883 = 1 119884 = 1) are usually sensitive torespondents In this case if a respondent belongs to CategoryII (III) heshe is designed to put a tick in Block 2 (3)

The supplemental questionnaire is designed as shown inTable 2 under which 119898 respondents (119898 is also specified bythe investigators) in group 2 are asked to put a tick in Block5 or Block 6 depending on their true status that is 119884 = 0

or 119884 = 1 Since both the 119884 = 0 and 119884 = 1 are non-sensitive classes the supplemental questionnaire is in fact adesign of direct questioningTherefore we call this design thecombination questionnaire model

Table 3 shows the cell probabilities 1205791198944

119894=1 the observed

frequencies 1198991198944

119894=1and the unobservable frequencies 119885

1198943

119894=1

for the main questionnaire The observed frequency 1198992is the

sumof the frequency of respondents belonging to Block 2 and

Journal of Probability and Statistics 3

Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire

Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901

11205791

11990121205791

11990131205791

1205791(1198851)

II 119883 = 1 119884 = 0 1205792(1198852)

III 119883 = 1 119884 = 1 1205793(1198853)

IV 119883 = 0 119884 = 1 1205794(1198994) 120579

4(1198994)

Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)

Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable

Table 4Cell probabilities andobserved counts for the supplementalquestionnaire

Category Total119884 = 0 Block 5 mdash 120579

1+ 1205792(1198980)

119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)

Total 1 (119898)

Note119898 = 1198980 + 1198981

the frequency of those belonging toCategory IIThe observedfrequency 119899

3is the sum of the frequency of respondents

belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum

4

119894=1119899119894= sum3

119894=1119885119894+ 1198994 we have

1198851= 119899minus(119885

2+1198853)minus1198994Thus only119885

2and119885

3are unobservable

Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire

Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579

1= sdot sdot sdot = 120579

4asymp 025

(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire

Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =

119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2

3 Likelihood-Based Inferences

In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables

31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898

respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =

119899 1198991 119899

4 denote the observed counts collected in the

main questionnaire (see Table 3) where sum4

119894=1119899119894= 119899 The

likelihood function of 120579 based on 119884obs119872 is

119871 (120579 | 119884obs119872) = (119899

1198991 1198992 1198993 1198994

)

times (11990111205791)1198991

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

(2)

Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered

in the supplemental questionnaire (see Table 4) where 1198980+

1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is

119871 (120579 | 119884obs119878) = (119898

1198980

) (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(3)

Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T

4is

119871CQ (120579 | 119884obs119878) prop 1205791198991

1

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(4)

where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model

Since the corresponding cell probabilities to the observedcounts 119899

2and 1198993in the group 1 are in the form of summation

(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit

expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899

2and 119899

3as incomplete

data we use the EM algorithm [9] to find the MLE of 120579The counts 119885

1198943

119894=1in Table 3 can be viewed as missing data

Briefly119885111988521198853 and 119899

4represent the counts of the respond-

ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911

2 1199113 and the

complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is

119871CQ (120579 | 119884COM) prop (

3

prod

119894=1

120579119911119894

119894)1205791198994

4times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(5)

where 1199111= 119899 minus (119911

2+ 1199113) minus 1198994

4 Journal of Probability and Statistics

By treating 1205791198944

119894=1as random variables we note that

the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=1199112

119873(1 +

1198980

119899 minus 1199113minus 1198994

)

1205793=1199113

119873(1 +

1198981

1199113+ 1198994

)

1205794=1198994

119873(1 +

1198981

1199113+ 1198994

)

(6)

Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579

119894(1199011198941205791+ 120579119894) that is

119885119894| (119884obs 120579) sim Binomial(119899

119894

120579119894

1199011198941205791+ 120579119894

) 119894 = 2 3 (7)

Therefore the E-step of the EM algorithm computes thefollowing conditional expectations

119864 (119885119894| 119884obs 120579) =

119899119894120579119894

1199011198941205791+ 120579i

119894 = 2 3 (8)

and the M-step updates (6) by replacing 1199112and 119911

3with

previous conditional expectations

Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)

32 Asymptotic Confidence Intervals Let 120579minus4

= (1205791 1205792 1205793)⊤

The asymptotic variance-covariance matrix of theMLE minus4

isthen given by Iminus1obs(minus4) where

Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)

120597120579minus4120597120579⊤

minus4

(9)

denotes the observed information matrix and ℓCQ(120579 |

119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have

ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901

21205791+ 1205792)

+ 1198993log (119901

31205791+ 1205793)

+ 1198994log (1 minus 120579

1minus 1205792minus 1205793)

+ 1198980log (120579

1+ 1205792) + 1198981log (1 minus 120579

1minus 1205792)

(10)

It is easy to show that

120597ℓCQ (120579 | 119884obs)

1205971205791

=1198991

1205791

minus1198994

1205794

+

3

sum

119894=2

119899i119901i119901i1205791 + 120579i

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205792

= minus1198994

1205794

+1198992

11990121205791+ 1205792

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205793

= minus1198994

1205794

+1198993

11990131205791+ 1205793

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

1

=1198991

1205792

1

+1198994

1205792

4

+

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

2

=1198994

1205792

4

+1198992

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

3

=1198994

1205792

4

+1198993

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205792

=1198994

1205792

4

+11989921199012

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205793

=1198994

1205792

4

+11989931199013

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057921205971205793

=1198994

1205792

4

(11)

where

120601 =1198980

(1205791+ 1205792)2+

1198981

(1 minus 1205791minus 1205792)2 (12)

Hence the observed information matrix can be expressed as

Iobs (120579minus4) = diag(1198991

1205792

1

0 0) +1198994

1205792

4

times 131⊤3+ A (13)

Journal of Probability and Statistics 5

whereA =

((((

(

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+120601

11989921199012

(11990121205791+ 1205792)2+120601

11989931199013

(11990131205791+1205793)2

11989921199012

(11990121205791+ 1205792)2+120601

1198992

(11990121205791+ 1205792)2+120601 0

11989931199013

(11990131205791+ 1205793)2 0

1198993

(11990131205791+1205793)2

))))

)

(14)

Let se(120579119894) denote the standard error of 120579

119894for 119894 = 1 2 3

Note that se(120579119894) can be estimated by the square root of the 119894th

diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579

119894) by se(120579

119894)Thus a 95normal-based asymptotic con-

fidence interval for 120579i can be constructed as

[120579119894minus 196 times se (120579

119894) 120579119894+ 196 times se (120579

119894)] 119894 = 1 2 3 (15)

Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of

120579minus4 For example 120579

4= 1 minus sum

3

119894=1120579i and the odds ratio 120595 =

12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used

to approximate the standard error of 120599 = ℎ(minus4) and a 95

normal-based asymptotic confidence interval for 120599 is given by

[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)

where

se (120599) = (120597120599

120597120579minus4

)

Iminus1obs (120579minus4) (120597120599

120597120579minus4

)

10038161003816100381610038161003816100381610038161003816120579minus4= minus4

12

(17)

33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579

minus4) Based on the obtained MLE we inde-

pendently generate

(119899lowast

1 119899

lowast

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(18)

119898lowast

0sim Binomial (119898 120579

1+ 1205792) (19)

Having obtained 119884lowast

obs119872 = 119899 119899lowast

1 119899

lowast

4 and 119884

lowast

obs119878 =

119898119898lowast

0 119898lowast

1 where 119898

lowast

1= 119898 minus 119898

lowast

0 we can calculate the

bootstrap replication 120599lowast= ℎ(120579

lowast

minus4) based on 119884

lowast

obs = 119884lowast

obs119872

119884lowast

obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast

119892119866

119892=1 Consequently a (1 minus 120572)100 bootstrap

confidence interval for 120599 is given by

[120599119871 120599119880] (20)

where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles

of 120599lowast119892119866

119892=1 respectively

34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]

1198670 120595 = 1 against 119867

1 120595 = 1 (21)

The likelihood ratio statistic is defined by

Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)

where 119877denotes the restrictedMLE of 120579 under119867

0 denotes

the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)

To find the restricted MLE 119877 we also employ the 119864119872

algorithm Under1198670 12057911205793= 12057921205794 we have

1205791= (1 minus 120579

119909) (1 minus 120579

119910)

1205792= 120579119909(1 minus 120579

119910)

1205793= 120579119909120579119910

1205794= (1 minus 120579

119909) 120579119910

(23)

In otherwords under1198670we only have two free parameters 120579

119909

and 120579119910 Having obtained the restrictedMLEs 120579

119909119877and 120579119910119877

wecan compute the restricted MLE

119877= (1205791119877

1205794119877

)⊤ from

(23) by

1205791119877

= (1 minus 120579119909119877

) (1 minus 120579119910119877

)

1205792119877

= 120579119909119877

(1 minus 120579119910119877

)

1205793119877

= 120579119909119877

120579119910119877

1205794119877

= (1 minus 120579119909119877

) 120579119910119877

(24)

In what follows we consider the computation of therestricted MLEs 120579

119909119877and 120579

119910119877 Now the complete-data like-

lihood function (5) becomes

119871CQ (120579119909 120579119910| 119884com 1198670)

prop (1 minus 120579119909) (1 minus 120579

119910)1199111

120579119909(1 minus 120579

119910)1199112

times (120579119909120579119910)1199113

(1 minus 120579119909) 1205791199101198994

(1 minus 120579119910)1198980

1205791198981

119910

= 1205791199112+1199113

119909(1 minus 120579

119909)1199111+1198994

1205791199113+1198994+1198981

119910(1 minus 120579

119910)1199111+1199112+1198980

(25)

so that the restricted MLEs of 120579119909and 120579

119910based on the

complete-data are given by

120579119909=1199112+ 1199113

119899 120579

119910=1199113+ 1198994+ 1198981

119873 (26)

respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579

119894 are defined in (23) Finally under119867

0

Λ asymptotically follows chi-squared distribution with onedegree of freedom

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Hindawi Publishing CorporationJournal of Probability and StatisticsVolume 2013 Article ID 827048 11 pageshttpdxdoiorg1011552013827048

Research ArticleA Survey Design for a Sensitive Binary Variable Correlated withAnother Nonsensitive Binary Variable

Jun-Wu Yu1 Yang Lu2 and Guo-Liang Tian3

1 School of Mathematics and Computational Science Hunan University of Science and Technology Xiangtan Hunan 411201 China2 School of Mathematics and Statistics Central China Normal University Wuhan Hubei 430079 China3Department of Statistics and Actuarial Science The University of Hong Kong Pokfulam Road Hong Kong

Correspondence should be addressed to Guo-Liang Tian gltianhkuhk

Received 30 August 2012 Accepted 19 May 2013

Academic Editor Zhidong Bai

Copyright copy 2013 Jun-Wu Yu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Tian et al (2007) introduced a so-called hidden sensitivity model for evaluating the association of two sensitive questions withbinary outcomes However in practice we sometimes need to assess the association between one sensitive binary variable (egwhether or not a drug user the number of sex partner being ⩽1 or gt1 and so on) and one nonsensitive binary variable (eg good orpoor health status with or without cervical cancer and so on) To address this issue by sufficiently utilizing the information con-tained in the non-sensitive binary variable in this paper we propose a new survey scheme called combination questionnairedesignmodel which consists of a main questionnaire and a supplemental questionnaire The introduction of the supplementalquestionnaire which is indeed a design of direct questioning can effectively reduce the noncompliance behavior since morerespondents will not be faced with the sensitive question Likelihood-based inferences including maximum likelihood estimatesvia the expectation-maximization algorithm asymptotic confidence intervals and bootstrap confidence intervals of parameters ofinterest are derived A likelihood ratio test is provided to test the association between the two binary random variables Bayesianinferences are also discussed Simulation studies are performed and a cervical cancer data set in Atlanta is used to illustrate theproposed methods

1 Introduction

Warner [1] introduced a randomized response technique toobtain truthful answers to questions with sensitive attributesUsing the Warner design Kraemer [2] derived a bivariatecorrelation between an attribute with polytomous responsesand an attribute with normally distributed responses Fox andTracy [3] derived estimation of the Pearson product momentcorrelation coefficient between two sensitive questions byassuming that randomized response observations can betreated as individual-level scores that are contaminated byrandom measurement error Edgell et al [4] consideredthe correlation between two sensitive questions using theunrelated question design or the additive constants designChristofides [5] presented a randomized response techniquewith two randomization devices to estimate the proportionof individuals having two sensitive characteristics at thesame time Kim and Warde [6] considered a multinomial

randomized response model which can handle untruthfulresponses They also derived the Pearson product momentcorrelation estimator which may be used to quantify thelinear relationship between two variables when multinomialresponse data are observed according to a randomizedresponse procedure However all these randomized responseprocedures make use of one or two randomizing deviceswhich (i) entail extra costs in both efficiency and complexity(ii) increase the cognitive load of randomized responsetechniques and (iii) allow for new sources of error such asmisunderstanding the randomized response procedures orcheating on the procedures [7]

From the perspective of incomplete categorical datadesign Tian et al [8] proposed a nonrandomized responsemodel (called the hidden sensitivity model) for assessing theassociation of two sensitive questions with binary outcomesTo protect respondentsrsquo privacy and to avoid the use of anyrandomization device they utilized a non-sensitive question

2 Journal of Probability and Statistics

in the questionnaire to indirectly obtain respondentsrsquo answersto the two sensitive questions In the hidden sensitivitymodel they implicitly assumed that all respondents arewilling to follow the design instructions In other words thenoncompliance behavior will not occur However in practicewe sometimes need to assess the association between onesensitive binary variable (eg whether or not a drug user thenumber of sex partner being ⩽1 or gt1 and so on) and onenonsensitive binary variable (eg good or poor health statuswith or without cervical cancer and so on) To our know-ledge the survey design for addressing this issue and cor-responding statistical analysis methods are not availableAlthough we could directly adopt the hidden sensitivitymodel the information contained in the nonsensitive binaryvariable cannot be utilized in the design Intuitively suchinformation can be used to enhance the degree of privacy pro-tection so that more respondents will not be faced with thesensitive questionThemajor objective of this paper is to pro-pose a new survey design and to develop corresponding sta-tistical methods for analyzing sensitive data collected by thistechnique

The rest of the paper is organized as follows In Section 2without using any randomizing device we propose a surveyscheme called combination questionnaire designmodelwhich consists of a main questionnaire and a supplementalquestionnaire Likelihood-based inferences including maxi-mum likelihood estimates via an expectationndashmaximization(EM) algorithm asymptotic confidence intervals and boot-strap confidence intervals of parameters of interest arederived in Section 3 A likelihood ratio test is also provided totest the association between the two binary random variablesIn Section 4we discuss Bayesian inferenceswhen prior infor-mation on parameters is available In Section 5 two simu-lation studies are performed to compare the efficiency ofthe proposed combination questionnaire model with that ofthe existing hidden sensitivity model of Tianet al [8] (iethe main questionnaire only) A cervical cancer data set inAtlanta is used in Section 6 to illustrate the proposed meth-ods A discussion and an appendix on the mode of a groupDirichlet density and a sampling method from it are alsopresented

2 The Survey Design

Assume that 119883 is a sensitive binary random variable 119884 isa non-sensitive binary random variable and they are corre-lated Let 119883 = 1 denote the sensitive class (eg 119883 = 1 if arespondent is a drug user) and 119883 = 0 denote the non-sen-sitive class (eg 119883 = 0 if a respondent is not a drug user)Furthermore let both 119884 = 1 (eg 119884 = 1 if a respondentreceives at least some college training) and 119884 = 0 (eg119884 = 0 if a respondent graduates at most from some highschool) be non-sensitive classes Define 120579 = (120579

1 120579

4)⊤

where 1205791= Pr(119883 = 0 119884 = 0) 120579

2= Pr(119883 = 1 119884 = 0)

1205793= Pr(119883 = 1 119884 = 1) and 120579

4= Pr(119883 = 0 119884 = 1) then

120579 isin T4 where

T119899= (119909

1 119909

119899)⊤

119909119894⩾ 0 119894 = 1 119899

119899

sum

119894=1

119909119894= 1 (1)

Table 1 The main questionnaire for the combination questionnairemodel

Category 119882 = 1 119882 = 2 119882 = 3

I 119883 = 0 119884 = 0 Block 1 mdash Block 2 mdash Block 3 mdashII 119883 = 1 119884 = 0 Category II Please put a tick in Block 2III 119883 = 1 119884 = 1 Category III Please put a tick in Block 3IV 119883 = 0 119884 = 1 Block 4 mdashNote Only 119883 = 1 is a sensitive class while 119883 = 0 119884 = 0 and 119884 = 1are nonsensitive classes

Table 2 The supplemental questionnaire for the combinationquestionnaire model

Category

119884 = 0Please put a tick in Block 5 mdash if you belong to

119884 = 0

119884 = 1Please put a tick in Block 6 mdash if you belong to

119884 = 1

Note Both the 119884 = 0 and 119884 = 1 are nonsensitive classes

denotes the 119899-dimensional closed simplex in R119899 The objec-tive is to make inferences on 120579 120579

119909= Pr (119883 = 1) = 120579

2+ 1205793

120579119910= Pr(119884 = 1) = 120579

4+ 1205793and the odds ratio 120595 = 120579

11205793(12057921205794)

The survey scheme consists of a main questionnaire anda supplemental questionnaire To design the main question-naire which is to be assigned to group 1 with 119899 respondents(119899 is specified by the investigators) we first introduce a non-sensitive question (say 119876

119882) with three possible answers

Assume that 119882 is a non-sensitive variate with trichotomousoutcomes associated with the 119876

119882and 119882 is independent of

both 119883 and 119884 Define 119901119894= Pr(119882 = 119894) for 119894 = 1 2 3 For

example let119882 = 1 (2 3) denote that a respondent was bornin JanuaryndashApril (MayndashAugust SeptemberndashDecember) andthus we could assume that 119901

119894asymp 13 The main questionnaire

is shown in Table 1 under which each respondent is asked toanswer the non-sensitive question 119876

119882

On the one hand since Category I (ie 119883 = 0 119884 = 0)and Category IV (ie 119883 = 0 119884 = 1) are non-sensitive toeach respondent it is reasonable to assume that a respondentis willing to provide hisher truthful answer by putting a tickin Block 119894 (119894 = 1 4) according to hisher true statusOn the other hand Category II (ie 119883 = 1 119884 = 0) andCategory III (ie 119883 = 1 119884 = 1) are usually sensitive torespondents In this case if a respondent belongs to CategoryII (III) heshe is designed to put a tick in Block 2 (3)

The supplemental questionnaire is designed as shown inTable 2 under which 119898 respondents (119898 is also specified bythe investigators) in group 2 are asked to put a tick in Block5 or Block 6 depending on their true status that is 119884 = 0

or 119884 = 1 Since both the 119884 = 0 and 119884 = 1 are non-sensitive classes the supplemental questionnaire is in fact adesign of direct questioningTherefore we call this design thecombination questionnaire model

Table 3 shows the cell probabilities 1205791198944

119894=1 the observed

frequencies 1198991198944

119894=1and the unobservable frequencies 119885

1198943

119894=1

for the main questionnaire The observed frequency 1198992is the

sumof the frequency of respondents belonging to Block 2 and

Journal of Probability and Statistics 3

Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire

Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901

11205791

11990121205791

11990131205791

1205791(1198851)

II 119883 = 1 119884 = 0 1205792(1198852)

III 119883 = 1 119884 = 1 1205793(1198853)

IV 119883 = 0 119884 = 1 1205794(1198994) 120579

4(1198994)

Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)

Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable

Table 4Cell probabilities andobserved counts for the supplementalquestionnaire

Category Total119884 = 0 Block 5 mdash 120579

1+ 1205792(1198980)

119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)

Total 1 (119898)

Note119898 = 1198980 + 1198981

the frequency of those belonging toCategory IIThe observedfrequency 119899

3is the sum of the frequency of respondents

belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum

4

119894=1119899119894= sum3

119894=1119885119894+ 1198994 we have

1198851= 119899minus(119885

2+1198853)minus1198994Thus only119885

2and119885

3are unobservable

Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire

Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579

1= sdot sdot sdot = 120579

4asymp 025

(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire

Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =

119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2

3 Likelihood-Based Inferences

In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables

31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898

respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =

119899 1198991 119899

4 denote the observed counts collected in the

main questionnaire (see Table 3) where sum4

119894=1119899119894= 119899 The

likelihood function of 120579 based on 119884obs119872 is

119871 (120579 | 119884obs119872) = (119899

1198991 1198992 1198993 1198994

)

times (11990111205791)1198991

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

(2)

Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered

in the supplemental questionnaire (see Table 4) where 1198980+

1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is

119871 (120579 | 119884obs119878) = (119898

1198980

) (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(3)

Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T

4is

119871CQ (120579 | 119884obs119878) prop 1205791198991

1

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(4)

where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model

Since the corresponding cell probabilities to the observedcounts 119899

2and 1198993in the group 1 are in the form of summation

(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit

expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899

2and 119899

3as incomplete

data we use the EM algorithm [9] to find the MLE of 120579The counts 119885

1198943

119894=1in Table 3 can be viewed as missing data

Briefly119885111988521198853 and 119899

4represent the counts of the respond-

ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911

2 1199113 and the

complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is

119871CQ (120579 | 119884COM) prop (

3

prod

119894=1

120579119911119894

119894)1205791198994

4times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(5)

where 1199111= 119899 minus (119911

2+ 1199113) minus 1198994

4 Journal of Probability and Statistics

By treating 1205791198944

119894=1as random variables we note that

the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=1199112

119873(1 +

1198980

119899 minus 1199113minus 1198994

)

1205793=1199113

119873(1 +

1198981

1199113+ 1198994

)

1205794=1198994

119873(1 +

1198981

1199113+ 1198994

)

(6)

Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579

119894(1199011198941205791+ 120579119894) that is

119885119894| (119884obs 120579) sim Binomial(119899

119894

120579119894

1199011198941205791+ 120579119894

) 119894 = 2 3 (7)

Therefore the E-step of the EM algorithm computes thefollowing conditional expectations

119864 (119885119894| 119884obs 120579) =

119899119894120579119894

1199011198941205791+ 120579i

119894 = 2 3 (8)

and the M-step updates (6) by replacing 1199112and 119911

3with

previous conditional expectations

Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)

32 Asymptotic Confidence Intervals Let 120579minus4

= (1205791 1205792 1205793)⊤

The asymptotic variance-covariance matrix of theMLE minus4

isthen given by Iminus1obs(minus4) where

Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)

120597120579minus4120597120579⊤

minus4

(9)

denotes the observed information matrix and ℓCQ(120579 |

119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have

ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901

21205791+ 1205792)

+ 1198993log (119901

31205791+ 1205793)

+ 1198994log (1 minus 120579

1minus 1205792minus 1205793)

+ 1198980log (120579

1+ 1205792) + 1198981log (1 minus 120579

1minus 1205792)

(10)

It is easy to show that

120597ℓCQ (120579 | 119884obs)

1205971205791

=1198991

1205791

minus1198994

1205794

+

3

sum

119894=2

119899i119901i119901i1205791 + 120579i

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205792

= minus1198994

1205794

+1198992

11990121205791+ 1205792

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205793

= minus1198994

1205794

+1198993

11990131205791+ 1205793

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

1

=1198991

1205792

1

+1198994

1205792

4

+

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

2

=1198994

1205792

4

+1198992

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

3

=1198994

1205792

4

+1198993

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205792

=1198994

1205792

4

+11989921199012

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205793

=1198994

1205792

4

+11989931199013

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057921205971205793

=1198994

1205792

4

(11)

where

120601 =1198980

(1205791+ 1205792)2+

1198981

(1 minus 1205791minus 1205792)2 (12)

Hence the observed information matrix can be expressed as

Iobs (120579minus4) = diag(1198991

1205792

1

0 0) +1198994

1205792

4

times 131⊤3+ A (13)

Journal of Probability and Statistics 5

whereA =

((((

(

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+120601

11989921199012

(11990121205791+ 1205792)2+120601

11989931199013

(11990131205791+1205793)2

11989921199012

(11990121205791+ 1205792)2+120601

1198992

(11990121205791+ 1205792)2+120601 0

11989931199013

(11990131205791+ 1205793)2 0

1198993

(11990131205791+1205793)2

))))

)

(14)

Let se(120579119894) denote the standard error of 120579

119894for 119894 = 1 2 3

Note that se(120579119894) can be estimated by the square root of the 119894th

diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579

119894) by se(120579

119894)Thus a 95normal-based asymptotic con-

fidence interval for 120579i can be constructed as

[120579119894minus 196 times se (120579

119894) 120579119894+ 196 times se (120579

119894)] 119894 = 1 2 3 (15)

Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of

120579minus4 For example 120579

4= 1 minus sum

3

119894=1120579i and the odds ratio 120595 =

12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used

to approximate the standard error of 120599 = ℎ(minus4) and a 95

normal-based asymptotic confidence interval for 120599 is given by

[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)

where

se (120599) = (120597120599

120597120579minus4

)

Iminus1obs (120579minus4) (120597120599

120597120579minus4

)

10038161003816100381610038161003816100381610038161003816120579minus4= minus4

12

(17)

33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579

minus4) Based on the obtained MLE we inde-

pendently generate

(119899lowast

1 119899

lowast

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(18)

119898lowast

0sim Binomial (119898 120579

1+ 1205792) (19)

Having obtained 119884lowast

obs119872 = 119899 119899lowast

1 119899

lowast

4 and 119884

lowast

obs119878 =

119898119898lowast

0 119898lowast

1 where 119898

lowast

1= 119898 minus 119898

lowast

0 we can calculate the

bootstrap replication 120599lowast= ℎ(120579

lowast

minus4) based on 119884

lowast

obs = 119884lowast

obs119872

119884lowast

obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast

119892119866

119892=1 Consequently a (1 minus 120572)100 bootstrap

confidence interval for 120599 is given by

[120599119871 120599119880] (20)

where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles

of 120599lowast119892119866

119892=1 respectively

34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]

1198670 120595 = 1 against 119867

1 120595 = 1 (21)

The likelihood ratio statistic is defined by

Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)

where 119877denotes the restrictedMLE of 120579 under119867

0 denotes

the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)

To find the restricted MLE 119877 we also employ the 119864119872

algorithm Under1198670 12057911205793= 12057921205794 we have

1205791= (1 minus 120579

119909) (1 minus 120579

119910)

1205792= 120579119909(1 minus 120579

119910)

1205793= 120579119909120579119910

1205794= (1 minus 120579

119909) 120579119910

(23)

In otherwords under1198670we only have two free parameters 120579

119909

and 120579119910 Having obtained the restrictedMLEs 120579

119909119877and 120579119910119877

wecan compute the restricted MLE

119877= (1205791119877

1205794119877

)⊤ from

(23) by

1205791119877

= (1 minus 120579119909119877

) (1 minus 120579119910119877

)

1205792119877

= 120579119909119877

(1 minus 120579119910119877

)

1205793119877

= 120579119909119877

120579119910119877

1205794119877

= (1 minus 120579119909119877

) 120579119910119877

(24)

In what follows we consider the computation of therestricted MLEs 120579

119909119877and 120579

119910119877 Now the complete-data like-

lihood function (5) becomes

119871CQ (120579119909 120579119910| 119884com 1198670)

prop (1 minus 120579119909) (1 minus 120579

119910)1199111

120579119909(1 minus 120579

119910)1199112

times (120579119909120579119910)1199113

(1 minus 120579119909) 1205791199101198994

(1 minus 120579119910)1198980

1205791198981

119910

= 1205791199112+1199113

119909(1 minus 120579

119909)1199111+1198994

1205791199113+1198994+1198981

119910(1 minus 120579

119910)1199111+1199112+1198980

(25)

so that the restricted MLEs of 120579119909and 120579

119910based on the

complete-data are given by

120579119909=1199112+ 1199113

119899 120579

119910=1199113+ 1198994+ 1198981

119873 (26)

respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579

119894 are defined in (23) Finally under119867

0

Λ asymptotically follows chi-squared distribution with onedegree of freedom

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

2 Journal of Probability and Statistics

in the questionnaire to indirectly obtain respondentsrsquo answersto the two sensitive questions In the hidden sensitivitymodel they implicitly assumed that all respondents arewilling to follow the design instructions In other words thenoncompliance behavior will not occur However in practicewe sometimes need to assess the association between onesensitive binary variable (eg whether or not a drug user thenumber of sex partner being ⩽1 or gt1 and so on) and onenonsensitive binary variable (eg good or poor health statuswith or without cervical cancer and so on) To our know-ledge the survey design for addressing this issue and cor-responding statistical analysis methods are not availableAlthough we could directly adopt the hidden sensitivitymodel the information contained in the nonsensitive binaryvariable cannot be utilized in the design Intuitively suchinformation can be used to enhance the degree of privacy pro-tection so that more respondents will not be faced with thesensitive questionThemajor objective of this paper is to pro-pose a new survey design and to develop corresponding sta-tistical methods for analyzing sensitive data collected by thistechnique

The rest of the paper is organized as follows In Section 2without using any randomizing device we propose a surveyscheme called combination questionnaire designmodelwhich consists of a main questionnaire and a supplementalquestionnaire Likelihood-based inferences including maxi-mum likelihood estimates via an expectationndashmaximization(EM) algorithm asymptotic confidence intervals and boot-strap confidence intervals of parameters of interest arederived in Section 3 A likelihood ratio test is also provided totest the association between the two binary random variablesIn Section 4we discuss Bayesian inferenceswhen prior infor-mation on parameters is available In Section 5 two simu-lation studies are performed to compare the efficiency ofthe proposed combination questionnaire model with that ofthe existing hidden sensitivity model of Tianet al [8] (iethe main questionnaire only) A cervical cancer data set inAtlanta is used in Section 6 to illustrate the proposed meth-ods A discussion and an appendix on the mode of a groupDirichlet density and a sampling method from it are alsopresented

2 The Survey Design

Assume that 119883 is a sensitive binary random variable 119884 isa non-sensitive binary random variable and they are corre-lated Let 119883 = 1 denote the sensitive class (eg 119883 = 1 if arespondent is a drug user) and 119883 = 0 denote the non-sen-sitive class (eg 119883 = 0 if a respondent is not a drug user)Furthermore let both 119884 = 1 (eg 119884 = 1 if a respondentreceives at least some college training) and 119884 = 0 (eg119884 = 0 if a respondent graduates at most from some highschool) be non-sensitive classes Define 120579 = (120579

1 120579

4)⊤

where 1205791= Pr(119883 = 0 119884 = 0) 120579

2= Pr(119883 = 1 119884 = 0)

1205793= Pr(119883 = 1 119884 = 1) and 120579

4= Pr(119883 = 0 119884 = 1) then

120579 isin T4 where

T119899= (119909

1 119909

119899)⊤

119909119894⩾ 0 119894 = 1 119899

119899

sum

119894=1

119909119894= 1 (1)

Table 1 The main questionnaire for the combination questionnairemodel

Category 119882 = 1 119882 = 2 119882 = 3

I 119883 = 0 119884 = 0 Block 1 mdash Block 2 mdash Block 3 mdashII 119883 = 1 119884 = 0 Category II Please put a tick in Block 2III 119883 = 1 119884 = 1 Category III Please put a tick in Block 3IV 119883 = 0 119884 = 1 Block 4 mdashNote Only 119883 = 1 is a sensitive class while 119883 = 0 119884 = 0 and 119884 = 1are nonsensitive classes

Table 2 The supplemental questionnaire for the combinationquestionnaire model

Category

119884 = 0Please put a tick in Block 5 mdash if you belong to

119884 = 0

119884 = 1Please put a tick in Block 6 mdash if you belong to

119884 = 1

Note Both the 119884 = 0 and 119884 = 1 are nonsensitive classes

denotes the 119899-dimensional closed simplex in R119899 The objec-tive is to make inferences on 120579 120579

119909= Pr (119883 = 1) = 120579

2+ 1205793

120579119910= Pr(119884 = 1) = 120579

4+ 1205793and the odds ratio 120595 = 120579

11205793(12057921205794)

The survey scheme consists of a main questionnaire anda supplemental questionnaire To design the main question-naire which is to be assigned to group 1 with 119899 respondents(119899 is specified by the investigators) we first introduce a non-sensitive question (say 119876

119882) with three possible answers

Assume that 119882 is a non-sensitive variate with trichotomousoutcomes associated with the 119876

119882and 119882 is independent of

both 119883 and 119884 Define 119901119894= Pr(119882 = 119894) for 119894 = 1 2 3 For

example let119882 = 1 (2 3) denote that a respondent was bornin JanuaryndashApril (MayndashAugust SeptemberndashDecember) andthus we could assume that 119901

119894asymp 13 The main questionnaire

is shown in Table 1 under which each respondent is asked toanswer the non-sensitive question 119876

119882

On the one hand since Category I (ie 119883 = 0 119884 = 0)and Category IV (ie 119883 = 0 119884 = 1) are non-sensitive toeach respondent it is reasonable to assume that a respondentis willing to provide hisher truthful answer by putting a tickin Block 119894 (119894 = 1 4) according to hisher true statusOn the other hand Category II (ie 119883 = 1 119884 = 0) andCategory III (ie 119883 = 1 119884 = 1) are usually sensitive torespondents In this case if a respondent belongs to CategoryII (III) heshe is designed to put a tick in Block 2 (3)

The supplemental questionnaire is designed as shown inTable 2 under which 119898 respondents (119898 is also specified bythe investigators) in group 2 are asked to put a tick in Block5 or Block 6 depending on their true status that is 119884 = 0

or 119884 = 1 Since both the 119884 = 0 and 119884 = 1 are non-sensitive classes the supplemental questionnaire is in fact adesign of direct questioningTherefore we call this design thecombination questionnaire model

Table 3 shows the cell probabilities 1205791198944

119894=1 the observed

frequencies 1198991198944

119894=1and the unobservable frequencies 119885

1198943

119894=1

for the main questionnaire The observed frequency 1198992is the

sumof the frequency of respondents belonging to Block 2 and

Journal of Probability and Statistics 3

Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire

Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901

11205791

11990121205791

11990131205791

1205791(1198851)

II 119883 = 1 119884 = 0 1205792(1198852)

III 119883 = 1 119884 = 1 1205793(1198853)

IV 119883 = 0 119884 = 1 1205794(1198994) 120579

4(1198994)

Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)

Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable

Table 4Cell probabilities andobserved counts for the supplementalquestionnaire

Category Total119884 = 0 Block 5 mdash 120579

1+ 1205792(1198980)

119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)

Total 1 (119898)

Note119898 = 1198980 + 1198981

the frequency of those belonging toCategory IIThe observedfrequency 119899

3is the sum of the frequency of respondents

belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum

4

119894=1119899119894= sum3

119894=1119885119894+ 1198994 we have

1198851= 119899minus(119885

2+1198853)minus1198994Thus only119885

2and119885

3are unobservable

Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire

Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579

1= sdot sdot sdot = 120579

4asymp 025

(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire

Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =

119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2

3 Likelihood-Based Inferences

In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables

31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898

respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =

119899 1198991 119899

4 denote the observed counts collected in the

main questionnaire (see Table 3) where sum4

119894=1119899119894= 119899 The

likelihood function of 120579 based on 119884obs119872 is

119871 (120579 | 119884obs119872) = (119899

1198991 1198992 1198993 1198994

)

times (11990111205791)1198991

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

(2)

Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered

in the supplemental questionnaire (see Table 4) where 1198980+

1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is

119871 (120579 | 119884obs119878) = (119898

1198980

) (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(3)

Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T

4is

119871CQ (120579 | 119884obs119878) prop 1205791198991

1

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(4)

where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model

Since the corresponding cell probabilities to the observedcounts 119899

2and 1198993in the group 1 are in the form of summation

(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit

expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899

2and 119899

3as incomplete

data we use the EM algorithm [9] to find the MLE of 120579The counts 119885

1198943

119894=1in Table 3 can be viewed as missing data

Briefly119885111988521198853 and 119899

4represent the counts of the respond-

ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911

2 1199113 and the

complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is

119871CQ (120579 | 119884COM) prop (

3

prod

119894=1

120579119911119894

119894)1205791198994

4times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(5)

where 1199111= 119899 minus (119911

2+ 1199113) minus 1198994

4 Journal of Probability and Statistics

By treating 1205791198944

119894=1as random variables we note that

the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=1199112

119873(1 +

1198980

119899 minus 1199113minus 1198994

)

1205793=1199113

119873(1 +

1198981

1199113+ 1198994

)

1205794=1198994

119873(1 +

1198981

1199113+ 1198994

)

(6)

Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579

119894(1199011198941205791+ 120579119894) that is

119885119894| (119884obs 120579) sim Binomial(119899

119894

120579119894

1199011198941205791+ 120579119894

) 119894 = 2 3 (7)

Therefore the E-step of the EM algorithm computes thefollowing conditional expectations

119864 (119885119894| 119884obs 120579) =

119899119894120579119894

1199011198941205791+ 120579i

119894 = 2 3 (8)

and the M-step updates (6) by replacing 1199112and 119911

3with

previous conditional expectations

Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)

32 Asymptotic Confidence Intervals Let 120579minus4

= (1205791 1205792 1205793)⊤

The asymptotic variance-covariance matrix of theMLE minus4

isthen given by Iminus1obs(minus4) where

Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)

120597120579minus4120597120579⊤

minus4

(9)

denotes the observed information matrix and ℓCQ(120579 |

119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have

ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901

21205791+ 1205792)

+ 1198993log (119901

31205791+ 1205793)

+ 1198994log (1 minus 120579

1minus 1205792minus 1205793)

+ 1198980log (120579

1+ 1205792) + 1198981log (1 minus 120579

1minus 1205792)

(10)

It is easy to show that

120597ℓCQ (120579 | 119884obs)

1205971205791

=1198991

1205791

minus1198994

1205794

+

3

sum

119894=2

119899i119901i119901i1205791 + 120579i

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205792

= minus1198994

1205794

+1198992

11990121205791+ 1205792

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205793

= minus1198994

1205794

+1198993

11990131205791+ 1205793

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

1

=1198991

1205792

1

+1198994

1205792

4

+

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

2

=1198994

1205792

4

+1198992

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

3

=1198994

1205792

4

+1198993

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205792

=1198994

1205792

4

+11989921199012

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205793

=1198994

1205792

4

+11989931199013

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057921205971205793

=1198994

1205792

4

(11)

where

120601 =1198980

(1205791+ 1205792)2+

1198981

(1 minus 1205791minus 1205792)2 (12)

Hence the observed information matrix can be expressed as

Iobs (120579minus4) = diag(1198991

1205792

1

0 0) +1198994

1205792

4

times 131⊤3+ A (13)

Journal of Probability and Statistics 5

whereA =

((((

(

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+120601

11989921199012

(11990121205791+ 1205792)2+120601

11989931199013

(11990131205791+1205793)2

11989921199012

(11990121205791+ 1205792)2+120601

1198992

(11990121205791+ 1205792)2+120601 0

11989931199013

(11990131205791+ 1205793)2 0

1198993

(11990131205791+1205793)2

))))

)

(14)

Let se(120579119894) denote the standard error of 120579

119894for 119894 = 1 2 3

Note that se(120579119894) can be estimated by the square root of the 119894th

diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579

119894) by se(120579

119894)Thus a 95normal-based asymptotic con-

fidence interval for 120579i can be constructed as

[120579119894minus 196 times se (120579

119894) 120579119894+ 196 times se (120579

119894)] 119894 = 1 2 3 (15)

Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of

120579minus4 For example 120579

4= 1 minus sum

3

119894=1120579i and the odds ratio 120595 =

12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used

to approximate the standard error of 120599 = ℎ(minus4) and a 95

normal-based asymptotic confidence interval for 120599 is given by

[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)

where

se (120599) = (120597120599

120597120579minus4

)

Iminus1obs (120579minus4) (120597120599

120597120579minus4

)

10038161003816100381610038161003816100381610038161003816120579minus4= minus4

12

(17)

33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579

minus4) Based on the obtained MLE we inde-

pendently generate

(119899lowast

1 119899

lowast

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(18)

119898lowast

0sim Binomial (119898 120579

1+ 1205792) (19)

Having obtained 119884lowast

obs119872 = 119899 119899lowast

1 119899

lowast

4 and 119884

lowast

obs119878 =

119898119898lowast

0 119898lowast

1 where 119898

lowast

1= 119898 minus 119898

lowast

0 we can calculate the

bootstrap replication 120599lowast= ℎ(120579

lowast

minus4) based on 119884

lowast

obs = 119884lowast

obs119872

119884lowast

obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast

119892119866

119892=1 Consequently a (1 minus 120572)100 bootstrap

confidence interval for 120599 is given by

[120599119871 120599119880] (20)

where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles

of 120599lowast119892119866

119892=1 respectively

34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]

1198670 120595 = 1 against 119867

1 120595 = 1 (21)

The likelihood ratio statistic is defined by

Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)

where 119877denotes the restrictedMLE of 120579 under119867

0 denotes

the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)

To find the restricted MLE 119877 we also employ the 119864119872

algorithm Under1198670 12057911205793= 12057921205794 we have

1205791= (1 minus 120579

119909) (1 minus 120579

119910)

1205792= 120579119909(1 minus 120579

119910)

1205793= 120579119909120579119910

1205794= (1 minus 120579

119909) 120579119910

(23)

In otherwords under1198670we only have two free parameters 120579

119909

and 120579119910 Having obtained the restrictedMLEs 120579

119909119877and 120579119910119877

wecan compute the restricted MLE

119877= (1205791119877

1205794119877

)⊤ from

(23) by

1205791119877

= (1 minus 120579119909119877

) (1 minus 120579119910119877

)

1205792119877

= 120579119909119877

(1 minus 120579119910119877

)

1205793119877

= 120579119909119877

120579119910119877

1205794119877

= (1 minus 120579119909119877

) 120579119910119877

(24)

In what follows we consider the computation of therestricted MLEs 120579

119909119877and 120579

119910119877 Now the complete-data like-

lihood function (5) becomes

119871CQ (120579119909 120579119910| 119884com 1198670)

prop (1 minus 120579119909) (1 minus 120579

119910)1199111

120579119909(1 minus 120579

119910)1199112

times (120579119909120579119910)1199113

(1 minus 120579119909) 1205791199101198994

(1 minus 120579119910)1198980

1205791198981

119910

= 1205791199112+1199113

119909(1 minus 120579

119909)1199111+1198994

1205791199113+1198994+1198981

119910(1 minus 120579

119910)1199111+1199112+1198980

(25)

so that the restricted MLEs of 120579119909and 120579

119910based on the

complete-data are given by

120579119909=1199112+ 1199113

119899 120579

119910=1199113+ 1198994+ 1198981

119873 (26)

respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579

119894 are defined in (23) Finally under119867

0

Λ asymptotically follows chi-squared distribution with onedegree of freedom

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Journal of Probability and Statistics 3

Table 3 Cell probabilities observed and unobservable counts forthe main questionnaire

Category 119882 = 1 119882 = 2 119882 = 3 TotalI 119883 = 0 119884 = 0 119901

11205791

11990121205791

11990131205791

1205791(1198851)

II 119883 = 1 119884 = 0 1205792(1198852)

III 119883 = 1 119884 = 1 1205793(1198853)

IV 119883 = 0 119884 = 1 1205794(1198994) 120579

4(1198994)

Total 11990111205791(1198991) 11990121205791+ 1205792(1198992) 11990131205791+ 1205793(1198993) 1(119899)

Note 119899 = sum4119894=11198991198941198851 = 119899minus(1198852 +1198853)minus1198994 where (1198852 1198853) are unobservable

Table 4Cell probabilities andobserved counts for the supplementalquestionnaire

Category Total119884 = 0 Block 5 mdash 120579

1+ 1205792(1198980)

119884 = 1 Block 6 mdash 1205793+ 1205794(1198981)

Total 1 (119898)

Note119898 = 1198980 + 1198981

the frequency of those belonging toCategory IIThe observedfrequency 119899

3is the sum of the frequency of respondents

belonging to Block 3 and the frequency of those belongingto Category III Note that 119899 = sum

4

119894=1119899119894= sum3

119894=1119885119894+ 1198994 we have

1198851= 119899minus(119885

2+1198853)minus1198994Thus only119885

2and119885

3are unobservable

Table 4 shows the cell probabilities and observed counts forthe supplemental questionnaire

Remark 1 The design of the main questionnaire is similar tothat of the hidden sensitivitymodel of Tian et al [8] while thedesign of the supplemental questionnaire is indeed a designof direct questioning since both the 119884 = 0 and 119884 = 1 arenon-sensitive classes Table 1 shows that Categories II and IIIare two sensitive subclassesTherefore putting a tick in Block2 or Block 3 implies that the respondent could be suspectedwith the sensitive attribute Let 120579

1= sdot sdot sdot = 120579

4asymp 025

(see Table 3) then around half of the say 2119899 respondentswill be suspected with the sensitive attribute if only themain questionnaire is employed However besides the mainquestionnaire (with 119899 respondents) if the supplemental ques-tionnaire (see Table 2) with 119898 = 119899 respondents is also usedthen only half of the 119899 respondents will be suspected withthe sensitive attribute In other words in the proposed com-bination questionnaire model the information of the non-sensitive binary variable 119884 can be used to enhance the degreeof privacy protection so that more respondents will not befacedwith the sensitive questionThis is whywe introduce thesupplemental questionnaire besides the main questionnaire

Remark 2 In practice to simplify the design itself we suggestthat both the sample size 119899 in the main questionnaire andthe sample size 119898 in the supplemental questionnaire shouldbe fixed in advance rather than the total sample size 119873 =

119899 + 119898 being fixed In this way survey data can be collectedin two independent groups resulting in a relatively simplerstatistical analysis In addition the interviewees are randomlyassigned to either the group 1 or group 2

3 Likelihood-Based Inferences

In this section maximum likelihood estimates (MLEs) of the120579 and the odds ratio 120595 are derived by using the EM algo-rithm In addition asymptotic confidence intervals and thebootstrap confidence intervals of an arbitrary function of 120579are also provided Finally a likelihood ratio test is presentedfor testing the association between the two binary randomvariables

31 MLEs via the EM Algorithm A total of 119873 = 119899 + 119898

respondents are classified into two groups by a randomizationapproach such that 119899 respondents answer the questionsin the main questionnaire and 119898 respondents answer thequestions in the supplemental questionnaire Let 119884obsM =

119899 1198991 119899

4 denote the observed counts collected in the

main questionnaire (see Table 3) where sum4

119894=1119899119894= 119899 The

likelihood function of 120579 based on 119884obs119872 is

119871 (120579 | 119884obs119872) = (119899

1198991 1198992 1198993 1198994

)

times (11990111205791)1198991

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

(2)

Let 119884obs119878 = 1198981198980 1198981 denote the observed counts gathered

in the supplemental questionnaire (see Table 4) where 1198980+

1198981= 119898 The likelihood function of 120579 based on 119884obs119878 is

119871 (120579 | 119884obs119878) = (119898

1198980

) (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(3)

Let 119884obs = 119884obs119872 119884obs119878 Since 119884obs119872 and 119884obs119878 are inde-pendent the observed-data likelihood function of 120579 isin T

4is

119871CQ (120579 | 119884obs119878) prop 1205791198991

1

3

prod

119894=2

(1199011198941205791+ 120579119894)119899119894

1205791198994

4

times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(4)

where the subscript ldquoCQrdquodenotes the ldquocombination question-nairerdquo model

Since the corresponding cell probabilities to the observedcounts 119899

2and 1198993in the group 1 are in the form of summation

(ie 1199011198941205791+ 120579119894 119894 = 2 3) we cannot obtain the explicit

expressions of the MLEs of 120579 from the score equations of(4) By treating the observed counts 119899

2and 119899

3as incomplete

data we use the EM algorithm [9] to find the MLE of 120579The counts 119885

1198943

119894=1in Table 3 can be viewed as missing data

Briefly119885111988521198853 and 119899

4represent the counts of the respond-

ents belonging to Categories I II III and IV respectivelyThus we denote the latent data by 119884mis = 119911

2 1199113 and the

complete data by 119884com = 119884obs 119884mis Note that all 119901i areknown Consequently the complete-data likelihood functionfor 120579 is

119871CQ (120579 | 119884COM) prop (

3

prod

119894=1

120579119911119894

119894)1205791198994

4times (1205791+ 1205792)1198980

(1205793+ 1205794)1198981

(5)

where 1199111= 119899 minus (119911

2+ 1199113) minus 1198994

4 Journal of Probability and Statistics

By treating 1205791198944

119894=1as random variables we note that

the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=1199112

119873(1 +

1198980

119899 minus 1199113minus 1198994

)

1205793=1199113

119873(1 +

1198981

1199113+ 1198994

)

1205794=1198994

119873(1 +

1198981

1199113+ 1198994

)

(6)

Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579

119894(1199011198941205791+ 120579119894) that is

119885119894| (119884obs 120579) sim Binomial(119899

119894

120579119894

1199011198941205791+ 120579119894

) 119894 = 2 3 (7)

Therefore the E-step of the EM algorithm computes thefollowing conditional expectations

119864 (119885119894| 119884obs 120579) =

119899119894120579119894

1199011198941205791+ 120579i

119894 = 2 3 (8)

and the M-step updates (6) by replacing 1199112and 119911

3with

previous conditional expectations

Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)

32 Asymptotic Confidence Intervals Let 120579minus4

= (1205791 1205792 1205793)⊤

The asymptotic variance-covariance matrix of theMLE minus4

isthen given by Iminus1obs(minus4) where

Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)

120597120579minus4120597120579⊤

minus4

(9)

denotes the observed information matrix and ℓCQ(120579 |

119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have

ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901

21205791+ 1205792)

+ 1198993log (119901

31205791+ 1205793)

+ 1198994log (1 minus 120579

1minus 1205792minus 1205793)

+ 1198980log (120579

1+ 1205792) + 1198981log (1 minus 120579

1minus 1205792)

(10)

It is easy to show that

120597ℓCQ (120579 | 119884obs)

1205971205791

=1198991

1205791

minus1198994

1205794

+

3

sum

119894=2

119899i119901i119901i1205791 + 120579i

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205792

= minus1198994

1205794

+1198992

11990121205791+ 1205792

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205793

= minus1198994

1205794

+1198993

11990131205791+ 1205793

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

1

=1198991

1205792

1

+1198994

1205792

4

+

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

2

=1198994

1205792

4

+1198992

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

3

=1198994

1205792

4

+1198993

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205792

=1198994

1205792

4

+11989921199012

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205793

=1198994

1205792

4

+11989931199013

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057921205971205793

=1198994

1205792

4

(11)

where

120601 =1198980

(1205791+ 1205792)2+

1198981

(1 minus 1205791minus 1205792)2 (12)

Hence the observed information matrix can be expressed as

Iobs (120579minus4) = diag(1198991

1205792

1

0 0) +1198994

1205792

4

times 131⊤3+ A (13)

Journal of Probability and Statistics 5

whereA =

((((

(

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+120601

11989921199012

(11990121205791+ 1205792)2+120601

11989931199013

(11990131205791+1205793)2

11989921199012

(11990121205791+ 1205792)2+120601

1198992

(11990121205791+ 1205792)2+120601 0

11989931199013

(11990131205791+ 1205793)2 0

1198993

(11990131205791+1205793)2

))))

)

(14)

Let se(120579119894) denote the standard error of 120579

119894for 119894 = 1 2 3

Note that se(120579119894) can be estimated by the square root of the 119894th

diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579

119894) by se(120579

119894)Thus a 95normal-based asymptotic con-

fidence interval for 120579i can be constructed as

[120579119894minus 196 times se (120579

119894) 120579119894+ 196 times se (120579

119894)] 119894 = 1 2 3 (15)

Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of

120579minus4 For example 120579

4= 1 minus sum

3

119894=1120579i and the odds ratio 120595 =

12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used

to approximate the standard error of 120599 = ℎ(minus4) and a 95

normal-based asymptotic confidence interval for 120599 is given by

[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)

where

se (120599) = (120597120599

120597120579minus4

)

Iminus1obs (120579minus4) (120597120599

120597120579minus4

)

10038161003816100381610038161003816100381610038161003816120579minus4= minus4

12

(17)

33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579

minus4) Based on the obtained MLE we inde-

pendently generate

(119899lowast

1 119899

lowast

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(18)

119898lowast

0sim Binomial (119898 120579

1+ 1205792) (19)

Having obtained 119884lowast

obs119872 = 119899 119899lowast

1 119899

lowast

4 and 119884

lowast

obs119878 =

119898119898lowast

0 119898lowast

1 where 119898

lowast

1= 119898 minus 119898

lowast

0 we can calculate the

bootstrap replication 120599lowast= ℎ(120579

lowast

minus4) based on 119884

lowast

obs = 119884lowast

obs119872

119884lowast

obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast

119892119866

119892=1 Consequently a (1 minus 120572)100 bootstrap

confidence interval for 120599 is given by

[120599119871 120599119880] (20)

where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles

of 120599lowast119892119866

119892=1 respectively

34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]

1198670 120595 = 1 against 119867

1 120595 = 1 (21)

The likelihood ratio statistic is defined by

Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)

where 119877denotes the restrictedMLE of 120579 under119867

0 denotes

the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)

To find the restricted MLE 119877 we also employ the 119864119872

algorithm Under1198670 12057911205793= 12057921205794 we have

1205791= (1 minus 120579

119909) (1 minus 120579

119910)

1205792= 120579119909(1 minus 120579

119910)

1205793= 120579119909120579119910

1205794= (1 minus 120579

119909) 120579119910

(23)

In otherwords under1198670we only have two free parameters 120579

119909

and 120579119910 Having obtained the restrictedMLEs 120579

119909119877and 120579119910119877

wecan compute the restricted MLE

119877= (1205791119877

1205794119877

)⊤ from

(23) by

1205791119877

= (1 minus 120579119909119877

) (1 minus 120579119910119877

)

1205792119877

= 120579119909119877

(1 minus 120579119910119877

)

1205793119877

= 120579119909119877

120579119910119877

1205794119877

= (1 minus 120579119909119877

) 120579119910119877

(24)

In what follows we consider the computation of therestricted MLEs 120579

119909119877and 120579

119910119877 Now the complete-data like-

lihood function (5) becomes

119871CQ (120579119909 120579119910| 119884com 1198670)

prop (1 minus 120579119909) (1 minus 120579

119910)1199111

120579119909(1 minus 120579

119910)1199112

times (120579119909120579119910)1199113

(1 minus 120579119909) 1205791199101198994

(1 minus 120579119910)1198980

1205791198981

119910

= 1205791199112+1199113

119909(1 minus 120579

119909)1199111+1198994

1205791199113+1198994+1198981

119910(1 minus 120579

119910)1199111+1199112+1198980

(25)

so that the restricted MLEs of 120579119909and 120579

119910based on the

complete-data are given by

120579119909=1199112+ 1199113

119899 120579

119910=1199113+ 1198994+ 1198981

119873 (26)

respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579

119894 are defined in (23) Finally under119867

0

Λ asymptotically follows chi-squared distribution with onedegree of freedom

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

4 Journal of Probability and Statistics

By treating 1205791198944

119894=1as random variables we note that

the complete-data likelihood function (5) has the densityform of a grouped Dirichlet distribution [10] Ng et al [11]derived the mode of a grouped Dirichlet density with explicitexpressions (see the appendix) Hence from (A4) and (A5)the complete-data MLEs for 120579 are given by

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=1199112

119873(1 +

1198980

119899 minus 1199113minus 1198994

)

1205793=1199113

119873(1 +

1198981

1199113+ 1198994

)

1205794=1198994

119873(1 +

1198981

1199113+ 1198994

)

(6)

Given 119884obs and 120579 119885119894 follows the binomial distribution withparameters 119899i and 120579

119894(1199011198941205791+ 120579119894) that is

119885119894| (119884obs 120579) sim Binomial(119899

119894

120579119894

1199011198941205791+ 120579119894

) 119894 = 2 3 (7)

Therefore the E-step of the EM algorithm computes thefollowing conditional expectations

119864 (119885119894| 119884obs 120579) =

119899119894120579119894

1199011198941205791+ 120579i

119894 = 2 3 (8)

and the M-step updates (6) by replacing 1199112and 119911

3with

previous conditional expectations

Remark 3 Based on the observed-data likelihood function(4) we could use the Newton-Raphson algorithm to findthe MLEs of 120579 However it is well known that the Newton-Raphson algorithm does not necessarily increase the loglikelihood leading even to divergence sometimes [12 page172] In addition the NewtonndashRaphson algorithm is sensitiveto the initial values One advantage of using theEM algorithmin the current situation is that both the E- and M-stephave closed-form expressions More importantly the EMalgorithm and the data augmentation algorithm of Tannerand Wong [13] share the same data augmentation structurein the Bayesian settings (see Section 4 for more details)

32 Asymptotic Confidence Intervals Let 120579minus4

= (1205791 1205792 1205793)⊤

The asymptotic variance-covariance matrix of theMLE minus4

isthen given by Iminus1obs(minus4) where

Iobs (120579minus4) = minus1205972ℓCQ (120579 | 119884obs)

120597120579minus4120597120579⊤

minus4

(9)

denotes the observed information matrix and ℓCQ(120579 |

119884obs) = log 119871CQ(120579 | 119884obs) is the observed-data log-likelihoodfunction From (4) we have

ℓCQ (120579 | 119884obs) = 1198991log 1205791+ 1198992log (119901

21205791+ 1205792)

+ 1198993log (119901

31205791+ 1205793)

+ 1198994log (1 minus 120579

1minus 1205792minus 1205793)

+ 1198980log (120579

1+ 1205792) + 1198981log (1 minus 120579

1minus 1205792)

(10)

It is easy to show that

120597ℓCQ (120579 | 119884obs)

1205971205791

=1198991

1205791

minus1198994

1205794

+

3

sum

119894=2

119899i119901i119901i1205791 + 120579i

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205792

= minus1198994

1205794

+1198992

11990121205791+ 1205792

+1198980

1205791+ 1205792

minus1198981

1 minus 1205791minus 1205792

120597ℓCQ (120579 | 119884obs)

1205971205793

= minus1198994

1205794

+1198993

11990131205791+ 1205793

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

1

=1198991

1205792

1

+1198994

1205792

4

+

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

2

=1198994

1205792

4

+1198992

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

1205971205792

3

=1198994

1205792

4

+1198993

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205792

=1198994

1205792

4

+11989921199012

(11990121205791+ 1205792)2+ 120601

minus1205972ℓCQ (120579 | 119884obs)

12059712057911205971205793

=1198994

1205792

4

+11989931199013

(11990131205791+ 1205793)2

minus1205972ℓCQ (120579 | 119884obs)

12059712057921205971205793

=1198994

1205792

4

(11)

where

120601 =1198980

(1205791+ 1205792)2+

1198981

(1 minus 1205791minus 1205792)2 (12)

Hence the observed information matrix can be expressed as

Iobs (120579minus4) = diag(1198991

1205792

1

0 0) +1198994

1205792

4

times 131⊤3+ A (13)

Journal of Probability and Statistics 5

whereA =

((((

(

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+120601

11989921199012

(11990121205791+ 1205792)2+120601

11989931199013

(11990131205791+1205793)2

11989921199012

(11990121205791+ 1205792)2+120601

1198992

(11990121205791+ 1205792)2+120601 0

11989931199013

(11990131205791+ 1205793)2 0

1198993

(11990131205791+1205793)2

))))

)

(14)

Let se(120579119894) denote the standard error of 120579

119894for 119894 = 1 2 3

Note that se(120579119894) can be estimated by the square root of the 119894th

diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579

119894) by se(120579

119894)Thus a 95normal-based asymptotic con-

fidence interval for 120579i can be constructed as

[120579119894minus 196 times se (120579

119894) 120579119894+ 196 times se (120579

119894)] 119894 = 1 2 3 (15)

Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of

120579minus4 For example 120579

4= 1 minus sum

3

119894=1120579i and the odds ratio 120595 =

12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used

to approximate the standard error of 120599 = ℎ(minus4) and a 95

normal-based asymptotic confidence interval for 120599 is given by

[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)

where

se (120599) = (120597120599

120597120579minus4

)

Iminus1obs (120579minus4) (120597120599

120597120579minus4

)

10038161003816100381610038161003816100381610038161003816120579minus4= minus4

12

(17)

33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579

minus4) Based on the obtained MLE we inde-

pendently generate

(119899lowast

1 119899

lowast

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(18)

119898lowast

0sim Binomial (119898 120579

1+ 1205792) (19)

Having obtained 119884lowast

obs119872 = 119899 119899lowast

1 119899

lowast

4 and 119884

lowast

obs119878 =

119898119898lowast

0 119898lowast

1 where 119898

lowast

1= 119898 minus 119898

lowast

0 we can calculate the

bootstrap replication 120599lowast= ℎ(120579

lowast

minus4) based on 119884

lowast

obs = 119884lowast

obs119872

119884lowast

obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast

119892119866

119892=1 Consequently a (1 minus 120572)100 bootstrap

confidence interval for 120599 is given by

[120599119871 120599119880] (20)

where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles

of 120599lowast119892119866

119892=1 respectively

34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]

1198670 120595 = 1 against 119867

1 120595 = 1 (21)

The likelihood ratio statistic is defined by

Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)

where 119877denotes the restrictedMLE of 120579 under119867

0 denotes

the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)

To find the restricted MLE 119877 we also employ the 119864119872

algorithm Under1198670 12057911205793= 12057921205794 we have

1205791= (1 minus 120579

119909) (1 minus 120579

119910)

1205792= 120579119909(1 minus 120579

119910)

1205793= 120579119909120579119910

1205794= (1 minus 120579

119909) 120579119910

(23)

In otherwords under1198670we only have two free parameters 120579

119909

and 120579119910 Having obtained the restrictedMLEs 120579

119909119877and 120579119910119877

wecan compute the restricted MLE

119877= (1205791119877

1205794119877

)⊤ from

(23) by

1205791119877

= (1 minus 120579119909119877

) (1 minus 120579119910119877

)

1205792119877

= 120579119909119877

(1 minus 120579119910119877

)

1205793119877

= 120579119909119877

120579119910119877

1205794119877

= (1 minus 120579119909119877

) 120579119910119877

(24)

In what follows we consider the computation of therestricted MLEs 120579

119909119877and 120579

119910119877 Now the complete-data like-

lihood function (5) becomes

119871CQ (120579119909 120579119910| 119884com 1198670)

prop (1 minus 120579119909) (1 minus 120579

119910)1199111

120579119909(1 minus 120579

119910)1199112

times (120579119909120579119910)1199113

(1 minus 120579119909) 1205791199101198994

(1 minus 120579119910)1198980

1205791198981

119910

= 1205791199112+1199113

119909(1 minus 120579

119909)1199111+1198994

1205791199113+1198994+1198981

119910(1 minus 120579

119910)1199111+1199112+1198980

(25)

so that the restricted MLEs of 120579119909and 120579

119910based on the

complete-data are given by

120579119909=1199112+ 1199113

119899 120579

119910=1199113+ 1198994+ 1198981

119873 (26)

respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579

119894 are defined in (23) Finally under119867

0

Λ asymptotically follows chi-squared distribution with onedegree of freedom

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Journal of Probability and Statistics 5

whereA =

((((

(

3

sum

119894=2

1198991198941199012

119894

(1199011198941205791+ 120579119894)2+120601

11989921199012

(11990121205791+ 1205792)2+120601

11989931199013

(11990131205791+1205793)2

11989921199012

(11990121205791+ 1205792)2+120601

1198992

(11990121205791+ 1205792)2+120601 0

11989931199013

(11990131205791+ 1205793)2 0

1198993

(11990131205791+1205793)2

))))

)

(14)

Let se(120579119894) denote the standard error of 120579

119894for 119894 = 1 2 3

Note that se(120579119894) can be estimated by the square root of the 119894th

diagonal element of Iminus1obs(minus4) We denote the estimated valueof se(120579

119894) by se(120579

119894)Thus a 95normal-based asymptotic con-

fidence interval for 120579i can be constructed as

[120579119894minus 196 times se (120579

119894) 120579119894+ 196 times se (120579

119894)] 119894 = 1 2 3 (15)

Let 120599 = ℎ(120579minus4) be an arbitrary differentiable function of

120579minus4 For example 120579

4= 1 minus sum

3

119894=1120579i and the odds ratio 120595 =

12057911205793(12057921205794)The deltamethod (eg [14 page 34]) can be used

to approximate the standard error of 120599 = ℎ(minus4) and a 95

normal-based asymptotic confidence interval for 120599 is given by

[120599 minus 196 times se (120599) 120599 + 196 times se (120599)] (16)

where

se (120599) = (120597120599

120597120579minus4

)

Iminus1obs (120579minus4) (120597120599

120597120579minus4

)

10038161003816100381610038161003816100381610038161003816120579minus4= minus4

12

(17)

33 Bootstrap Confidence Intervals When the normal-basedasymptotic confidence interval like (15) is beyond the lowbound zero or the upper bound one the bootstrap approach[15] can be used to construct the bootstrap confidence inter-val of 120599 = ℎ(120579

minus4) Based on the obtained MLE we inde-

pendently generate

(119899lowast

1 119899

lowast

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(18)

119898lowast

0sim Binomial (119898 120579

1+ 1205792) (19)

Having obtained 119884lowast

obs119872 = 119899 119899lowast

1 119899

lowast

4 and 119884

lowast

obs119878 =

119898119898lowast

0 119898lowast

1 where 119898

lowast

1= 119898 minus 119898

lowast

0 we can calculate the

bootstrap replication 120599lowast= ℎ(120579

lowast

minus4) based on 119884

lowast

obs = 119884lowast

obs119872

119884lowast

obs119878 via theEM algorithm specified by (6) and (8) Indepen-dently repeating this process 119866 times we obtain 119866 bootstrapreplications 120599lowast

119892119866

119892=1 Consequently a (1 minus 120572)100 bootstrap

confidence interval for 120599 is given by

[120599119871 120599119880] (20)

where 120599119871and 120599119880are the 100(1205722) and 100(1minus1205722) percentiles

of 120599lowast119892119866

119892=1 respectively

34 The Likelihood Ratio Test for Testing Association Thelikelihood ratio statistic can be used to test whether thetwo binary random variables 119883 and 119884 are independentcor-related The corresponding null and alternative hypothesesare [16 page 45]

1198670 120595 = 1 against 119867

1 120595 = 1 (21)

The likelihood ratio statistic is defined by

Λ = minus2 ℓCQ (119877| 119884obs) minus ℓCQ ( | 119884obs) (22)

where 119877denotes the restrictedMLE of 120579 under119867

0 denotes

the MLE of 120579 which can be obtained by the EM algorithmspecified by (6) and (8) and ℓCQ(120579 | 119884obs) = log 119871CQ(120579 | 119884obs)

To find the restricted MLE 119877 we also employ the 119864119872

algorithm Under1198670 12057911205793= 12057921205794 we have

1205791= (1 minus 120579

119909) (1 minus 120579

119910)

1205792= 120579119909(1 minus 120579

119910)

1205793= 120579119909120579119910

1205794= (1 minus 120579

119909) 120579119910

(23)

In otherwords under1198670we only have two free parameters 120579

119909

and 120579119910 Having obtained the restrictedMLEs 120579

119909119877and 120579119910119877

wecan compute the restricted MLE

119877= (1205791119877

1205794119877

)⊤ from

(23) by

1205791119877

= (1 minus 120579119909119877

) (1 minus 120579119910119877

)

1205792119877

= 120579119909119877

(1 minus 120579119910119877

)

1205793119877

= 120579119909119877

120579119910119877

1205794119877

= (1 minus 120579119909119877

) 120579119910119877

(24)

In what follows we consider the computation of therestricted MLEs 120579

119909119877and 120579

119910119877 Now the complete-data like-

lihood function (5) becomes

119871CQ (120579119909 120579119910| 119884com 1198670)

prop (1 minus 120579119909) (1 minus 120579

119910)1199111

120579119909(1 minus 120579

119910)1199112

times (120579119909120579119910)1199113

(1 minus 120579119909) 1205791199101198994

(1 minus 120579119910)1198980

1205791198981

119910

= 1205791199112+1199113

119909(1 minus 120579

119909)1199111+1198994

1205791199113+1198994+1198981

119910(1 minus 120579

119910)1199111+1199112+1198980

(25)

so that the restricted MLEs of 120579119909and 120579

119910based on the

complete-data are given by

120579119909=1199112+ 1199113

119899 120579

119910=1199113+ 1198994+ 1198981

119873 (26)

respectivelyThus the119872-step of the119864119872 algorithm calculates(26) and the 119864-step computes the conditional expectationsgiven in (8) where 120579

119894 are defined in (23) Finally under119867

0

Λ asymptotically follows chi-squared distribution with onedegree of freedom

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

6 Journal of Probability and Statistics

4 Bayesian Inferences

To derive the posterior mode of 120579 we employ the EM algo-rithm again The latent data 119884mis = 119911

2 1199113 are the same as

those in Section 31 Based on the complete-data likelihoodfunction (5) if theDirichlet distributionDirichlet (119886

1 119886

4)

is adopted as the prior distribution of 120579 then the complete-data posterior distribution is a grouped Dirichlet (GD) distri-bution with the formal definition given by (A2) that is

119891 (120579 | 119884obs 119884mis) = GD422

(120579minus4

| a b) (27)

where a = (1198861+1199111 1198862+1199112 1198863+1199113 1198864+1198994)⊤ b = (119898

0 1198981)⊤

and 1199111= 119899 minus (119911

2+ 1199113) minus 1198994 The conditional predictive dis-

tribution is

119891 (119884mis | 119884obs 120579) =3

prod

119894=2

Binomial(119911119894

10038161003816100381610038161003816100381610038161003816

119899119894

120579119894

1199011198941205791+ 120579119894

) (28)

Therefore the119872-step of the EM algorithm is to calculate thecomplete-data posterior mode

1205791= 1 minus 120579

2minus 1205793minus 1205794

1205792=

1198862minus 1 + 119911

2

119886+minus 4 + 119873

(1 +1198980

1198861+ 1198862minus 2 + 119899 minus 119911

3minus 1198994

)

1205793=

1198863minus 1 + 119911

3

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

1205794=

1198864minus 1 + 119899

4

119886+minus 4 + 119873

(1 +1198981

1198863+ 1198864minus 2 + 119911

3+ 1198994

)

(29)

where 119886+= sum4

119894=1119886119894 and the 119864-step is to replace 119911

1198943

119894=2by the

conditional expectations given by (8)In addition based on (27) and (28) the data augmen-

tation algorithm of Tanner and Wong [13] can be used togenerate posterior samples of 120579 A sampling method from(27) is given in the appendix

5 Simulation Studies

In this section two simulation studies are conducted to com-pare the efficiency of the proposed combination question-naire model with that of the hidden sensitivity model of Tianet al [8] (ie the main questionnaire model only) where(1199011 1199012 1199013) are assumed to be (13 13 13) and 119901

119894=Pr(119882 =

119894) for 119894 = 1 2 3 In the first simulated example let the totalsample size in the combination questionnaire model be thesame as the sample size in the hidden sensitivitymodel In thesecond example the sample size for themain questionnaire inthe combination questionnaire model is assumed to be equalto the sample size in the hidden sensitivity model

In the first simulated example let a total of119873 = 119899 + 119898 =

50 + 50 = 100 participants be interviewed by using the com-bination questionnaire model The true values of 120579

1198944

119894=1are

listed in the second column of Table 5 We first generate

(1198991 119899

4)⊤

sim Multinomial (119899 11990111205791 11990121205791+ 1205792 11990131205791+ 1205793 1205794)

(30)

and1198980sim Binomial (119898 120579

1+1205792) so that119898

1= 119898minus119898

0The119864119872

algorithm (6) and (8) is used to calculate the MLEs of 1205791198944

119894=1

We repeated this experiment 1000 times The average MLEsof 1205791198944

119894=1are reported in the third column of Table 5 The

corresponding bias variance and mean square error (MSE)are displayed in the fourth fifth and sixth columns of Table5 Next let119873 = 119899 + 119898 = 100 + 0 = 100 participants be inter-viewed by using the hidden sensitivity model (ie the mainquestionnaire only) The corresponding results are reportedin the last four columns of Table 5

From Table 5 we can see that both the MLEs of 1205791198944

119894=1

in the combination questionnaire model and the hiddensensitivity model are very close to their true values whilethe MSEs of 120579

1198944

119894=1in the combination questionnaire model

are slightly larger than those in the hidden sensitivity modelThese numerical results are not surprising since in the hiddensensitivity model Tian et al [8] implicitly assumed that allrespondents must strictly follow the design instructions Inother words the noncompliance behavior will not occurThe introduction of the supplemental questionnaire in thecombination questionnaire model can effectively reduce thenon-compliance behavior since more respondents will not befaced with the sensitive question while the cost for intro-ducing such a supplemental questionnaire is thatwe definitelylose a little of efficiency

In the second simulated example we assume that a total of119873 = 119899+119898 = 100+100 = 200 participants are interviewed byusing the combination questionnaire model while only119873 =

119899 + 119898 = 100 + 0 = 100 are interviewed by using the hiddensensitivity model We repeat this experiment 1000 times Thecorresponding results are reported in Table 6

From Table 6 we can see that the MSEs of 1205791198944

119894=1in the

combination questionnaire model are smaller than those inthe hidden sensitivity model In addition by comparing thefifth columns in Tables 5 and 6 we can see that the precisionsof 1205791198944

119894=1for the proposed combination questionnaire model

in the second simulated example are significantly improvedwhen compared with those in the first simulated example

6 Analyzing Cervical Cancer Data in Atlanta

Williamson and Haber [17] reported a study which examinedthe relationship between disease status of cervical cancer andthe number of sex partners and other risk factors Caseswere 20ndash79-year-old women of Fulton or DeKalb county inAtlanta Georgia They were diagnosed and were ascertainedto have invasive cervical cancer Controls were randomlychosen from the same counties and the same age rangesTable 7 gives the cross-classification of number of sex partners(ldquofew 0ndash3rdquo or ldquomany ⩾4rdquo denoted by 119883 = 0 or 119883 = 1) anddisease status (control or case denoted by 119884 = 0 or 119884 = 1)Generally a sizable proportion (135 in this example) of theresponses would be missing because of the sensitive questionabout the number of sex partners in a telephone interviewThe objective is to examine if association exists between thenumber of sex partners and disease status of cervical cancer

For the purpose of illustration we presume that 119883 = 0

is non-sensitive although the number 0ndash3 of sex partners

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Journal of Probability and Statistics 7

Table 5 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 50 + 50) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01013 00013 00113 00113 01010 00010 00064 000641205792

050 04985 minus00015 00089 00089 04975 minus00025 00041 000411205793

020 02008 00008 00037 00037 02001 00001 00029 000291205794

020 01994 minus00006 00026 00026 02014 00014 00016 000161205791

020 01957 minus00043 00090 00090 01985 minus00015 00055 000551205792

030 02999 minus00001 00070 00070 02999 minus00001 00034 000341205793

040 04038 00038 00040 00040 04025 00025 00037 000371205794

010 01006 00006 00017 00017 00991 minus00009 00009 000091205791

030 03021 00021 00107 00107 03026 00026 00081 000811205792

025 02483 minus00017 00077 00077 02499 minus00001 00039 000391205793

015 01474 minus00026 00042 00042 01476 minus00024 00032 000321205794

030 03021 00021 00033 00033 02999 minus00001 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 6 Comparison of the efficiency of the proposed combination questionnaire model with that of the hidden sensitivity model of Tian etal [8] (ie the main questionnaire model only) with (119901

1 1199012 1199013) = (13 13 13)

Parameter True valueThe combination questionnaire model The hidden sensitivity model

(119873 = 119899 + 119898 = 100 + 100) (119873 = 119899 + 119898 = 100 + 0)MLE Bias Variance MSE MLE Bias Variance MSE

1205791

010 01003 00003 00030 00030 01013 00013 00064 000641205792

050 05015 00015 00029 00029 05019 00019 00041 000411205793

020 01986 minus00014 00017 00017 01969 minus00031 00028 000281205794

020 01995 minus00005 00013 00013 01999 minus00001 00016 000161205791

020 02007 00007 00041 00041 02049 00049 00057 000571205792

030 02989 minus00011 00033 00033 02965 minus00035 00034 000341205793

040 04007 00007 00020 00020 03992 minus00008 00037 000371205794

010 00997 minus00003 00009 00009 00994 minus00006 00009 000091205791

030 03032 00032 00054 00054 02965 minus00036 00079 000791205792

025 02488 minus00012 00039 00039 02501 00001 00038 000381205793

015 01481 minus00019 00021 00021 01531 00031 00032 000321205794

030 03000 00000 00017 00017 03003 00003 00021 00021Note Bias(120579119894) = 119864(120579119894) minus 120579119894 and MSE(120579119894) = [Bias(120579119894)]

2

+ Var(120579119894)

Table 7 Cervical cancer data fromWilliamson and Haber [17]

Number of sex partners Disease status of cervical cancer119884 = 0 (control) 119884 = 1 (case)

119883 = 0 (few 0ndash3) 165 (1199031 1205791) 103 (119903

4 1205794)

119883 = 1 (many ⩾4) 164 (1199032 1205792) 221 (119903

3 1205793)

Missing 43 (11990312 1205791+ 1205792) 59 (119903

34 1205793+ 1205794)

Note The observed counts and the corresponding cell probabilities are inparentheses 119883 is a sensitive binary variate and 119884 is a non-sensitive binaryvariate

is somewhat sensitive for some respondents To illustratethe proposed design and approaches we let 119882 = 1 (2 3)

if a respondent was born in JanuaryndashApril (MayndashAugust

SeptemberndashDecember) It is then reasonable to assume that119901119896= Pr(119882 = 119896) asymp 13 for 119896 = 1 2 3 and119882 is independent

of the sensitive binary variate X and the the non-sensitivebinary variate 119884 For the ideal situation (ie no samplingerrors) the observed counts from the main questionnaire asshown in Tables 1 and 3 would be 119899

1= 11990313 = 55 119899

2=

55 + 1199032= 219 119899

3= 55 + 119903

3= 276 119899

4= 1199034= 103 that is

119884obs119872 = 119899 1198991 119899

4 = 653 55 219 276 103 (31)

On the other hand we can view themissing data in Table 7 asthe observed counts from the supplemental questionnaire asshown in Table 2 From Table 4 we have 119898

0= 11990312

= 43 and1198981= 11990334

= 59 that is 119884obs119878 = 1198981198980 1198981 = 102 43 59

Therefore we obtain the observed data 119884obs = 119884obs119872 119884obs119878

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

8 Journal of Probability and Statistics

Table 8 MLEs and 95 confidence intervals of 120579 and 120595

Parameter MLE std 95 asymptotic 95 bootstrapCI CI

1205791

023793 002937 [018036 029551] [018006 029879]1205792

024916 002261 [020484 029348] [020375 029359]1205793

035196 002247 [030790 039601] [030731 039666]1205794

016095 001434 [013284 018905] [013400 018908]120595 208830 043971 [122646 295011] [135361 318593]

61 Likelihood-Based Inferences Using 120579(0) = 144 as the

initial values the EM algorithm in (6) and (8) converged in29 iterations The resultant MLEs of 120579 and 120595 are given in thesecond column of Table 8 From (13) and (14) the asymptoticvariance-covariance matrix of the MLEs

minus4is

Iminus1obs (minus4)

= (

000086302 minus0000442447 minus0000384452

minus000044245 0000511253 minus0000010026

minus000038445 minus0000010026 0000505241

)

(32)

The estimated standard errors of 120579119894(119894 = 1 2 3) are square

roots of the main diagonal elements of the above matrixFrom (17) the estimated standard errors of 120579

4= 1minus120579

1minus1205792minus1205793

and = 12057911205793(12057921205794) are given by

se (1205794) = 1⊤

3Iminus1obs(minus4) 13

12

se () = 120572⊤Iminus1obs (minus4)120572

12

(33)

respectively where

120572 = (

1205793(1 minus 120579

2minus 1205793)

12057921205792

4

12057911205793(1205792minus 1205794)

(12057921205794)2

1205791(1 minus 120579

1minus 1205792)

12057921205792

4

)

(34)

These estimated standard errors are listed in the third columnof Table 8 From (15) and (16) we can obtain the 95asymptotic confidence intervals of 120579 and120595 which are showedin the fourth column of Table 8

Based on (18) and (19) we generate 119866 = 10000 bootstrapsamples The corresponding 95 bootstrap confidence inter-vals of 120579 and 120595 are displayed in the last column of Table 8

To test the null hypothesis 1198670 120595 = 1 against 119867

1 120595 = 1

we need to obtain the restricted MLE 119877 Using

(120579(0)

119909 120579(0)

119910)⊤

= (05 05)⊤ (35)

as the initial values the EM algorithm in (26) converged to

120579119909119877

= 064807 120579119910119877

= 052948 (36)

Table 9 Posterior modes and estimates of parameters for thecervical cancer data

Parameter Posterior Bayesian Bayesian 95 Bayesianmode mean std credible interval

1205791

023793 024025 002934 [018574 030050]1205792

024916 024789 002251 [020365 029192]1205793

035196 035029 002246 [030621 039432]1205794

016095 016155 001431 [013448 019042]120595 208830 214538 045838 [139399 318678]

in 19 iterations From (24) the restricted MLEs of 120579 areobtained as

119877= (016559 030493 034314 018634)

⊤ (37)

The log-likelihood ratio statistic Λ is equal to 12469 and the119875-value is 00004137 Since this 119875-value is far less than 005the 119867

0is rejected at the 005 level of significance Thus we

can conclude that there is an association between sex partnersand cervical cancer status based on the current data Thisconclusion is identical to that from the two 95 confidenceintervals of the odds ratio as shown in Table 8 where bothconfidence intervals exclude the value 1

62 Bayesian Inferences When Dirichlet (1 1 1 1) (ie theuniform distribution on T

4) is adopted as the prior dis-

tribution of 120579 the posterior modes of 120579are equal to thecorresponding MLEs Using 120579(0) = 1

44 as the initial values

we employ the data augmentation algorithm to generate1000000 posterior samples of 120579 and discard the first halfof the samples The Bayesian estimates of 120579 and 120595 are givenin Table 9 Since the lower bound of the Bayesian credibleinterval of the 120595 is larger than 1 we believe that there isan association between the number of sexual partners andcervical cancer status

The posterior densities of the 1205791198944

119894=1and 120595 estimated by a

kernel density smoother are plotted in Figures 1 and 2

7 Discussion

In this paper we develop a general framework of designand analysis for the combination questionnaire model whichconsists of a main questionnaire and a supplemental ques-tionnaire In fact the main questionnaire (see Table 1) is ageneralization of the nonrandomized triangular model [18]and is then a special case of the multicategory triangularmodel [19] The supplemental questionnaire (see Table 2) isa design of direct questioning The introduction of the sup-plemental questionnaire can effectively reduce the noncom-pliance behavior since more respondents will not be facedwith the sensitive question while the cost for introducingsuch a supplemental questionnaire is that we definitely losea little of efficiency The combination questionnaire modelcan be used to gather information to evaluate the associationbetween one sensitive binary variable and one non-sensitivebinary variable

We note that the proposed combination questionnairemodel has one limitation in applications that is it cannot be

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Journal of Probability and Statistics 9

0 02 04 06 08 1

0

2

4

6

8

10

14

12

The p

oste

rior d

ensit

y

1205791

(a)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205792

(b)

0 02 04 06 08 1

0

5

10

15

The p

oste

rior d

ensit

y

1205793

(c)

0 02 04 06 08 1

0

5

10

15

20

25

The p

oste

rior d

ensit

y

1205794

(d)

Figure 1 The posterior densities of the 1205791198944

119894=1estimated by a kernel density smoother based on the second 500000 posterior samples of

120579generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1) (a) The posterior density of 1205791 (b) The

posterior density of 1205792 (c) The posterior density of 120579

3 (d) The posterior density of 120579

4

1 2 3 4 5

0

02

04

06

08

The p

oste

rior d

ensit

y

120595

Figure 2 The posterior density of the odds ratio 120595 estimated by a kernel density smoother based on the second 500000 posterior samplesof 120579 generated by the data augmentation algorithm when the prior distribution is Dirichlet (1 1 1 1)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

10 Journal of Probability and Statistics

applied to situationwhere two categories 119883 = 0 and 119883 = 1

are sensitive like income For example let 119883 = 0 if hisherannual income is $25000 or less and 119883 = 1 if hisher annualincome is more than $25000 For such cases it is worthwhileto develop new designs to address this issue One way is toreplace the main questionnaire in Table 1 by a four-categoryparallel model (see [20 Section 41]) The other way is toemploy the parallel model [21] to collect the information on119883 and to employ the design of direct questioning to collectinformation on 119884 then we could use the logistic regression toestimate the odds ratio which is one of our further researches

Appendix

The Mode of a Group Dirichlet Density anda Sampling Method from a GDD

Let

V119899minus1

= (1199091 119909

119899minus1)⊤

119909119894⩾ 0

119894 = 1 119899 minus 1

119899minus1

sum

119894=1

119909119894⩽ 1

(A1)

denote the (119899 minus 1)-dimensional open simplex in R119899minus1 Arandom vector x = (X

1 X

119899)⊤

isin T119899is said to follow a

group Dirichlet distribution (GDD) with two partitions if thedensity of x

minus119899= (1198831 119883

119899minus1)⊤isin V119899minus1

is

GD1198992119904

(119909minus119899

| a b)

= 119888minus1

GD (

119899

prod

119894=1

119909119886119894minus1

119894)(

119904

sum

119894=1

119909i)

1198871

(

119899

sum

119894=119904+1

119909i)

1198872

(A2)

where 119909minus119899

= (1199091 119909

119899minus1)⊤ a = (119886

1 119886

119899)⊤ is a positive

parameter vector b = (1198871 1198872)⊤ is a nonnegative parameter

vector 119904 is a known positive integer less than 119899 and thenormalizing constant is given by

119888GD = 119861 (1198861 119886

119904) 119861 (119886119904+1

119886119899)

times 119861(

119904

sum

119894=1

119886119894+ 1198871

119899

sum

119894=119904+1

119886119894+ 1198872)

(A3)

We write x sim GD1198992119904

(a b) on T119899or xminus119899

sim GD1198992119904

(a b) onV119899minus1

to distinguish the two equivalent representationsIf 119886119894⩾ 1 then the mode of the grouped Dirichlet density

(A2) is given by [11 22]

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198871

sum119904

119895=1(119886119895minus 1)

119894 = 1 119904

(A4)

119909119894=

119886119894minus 1

sum119899

119895=1(119886119895minus 1) + 119887

1+ 1198872

1 +1198872

sum119899

119895=119904+1(119886119895minus 1)

119894 = 119904 + 1 119899

(A5)

The following procedure can be used to generate iidsamples from a GDD [11] Let

(1) y(1) sim Dirichlet (1198861 119886

119904) on T

119904

(2) y(2) sim Dirichlet (119886119904+1

119886119899) on T

119899minus119904

(3) 119877 sim Beta(sum119904119894=1

119886119894+ 1198871 sum119899

119894=119904+1119886119894+ 1198872) and

(4) y(1) y(2) and 119877 are mutually independent

Define

x(1) = 119877 times y(1) x(2) = (1 minus 119877) times y(2) (A6)

Then

x = (x(1)x(2)) sim GD

1198992119904(a b) (A7)

on T119899 where a = (119886

1 119886

119899)⊤ and b = (119887

1 1198872)⊤

Acknowledgments

The authors are grateful to the Editor an Associate Editorand two anonymous referees for their helpful comments andsuggestions which led to the improvement of the paper Theyalso would like to thank Miss Yin LIU of The University ofHongKong for her help in the simulation studies Jun-WuYursquosresearch was partially supported by a Grant (no 09BTJ012)from the National Social Science Foundation of China anda Grant from Hunan Provincial Science and TechnologyDepartment (no 11JB1176) Partial work was done when Jun-WuYu visitedTheUniversity ofHongKongGuo-LiangTianrsquosresearch was partially supported by a Grant (HKU 779210M)from the Research Grant Council of the Hong Kong SpecialAdministrative Region

References

[1] S LWarner ldquoRandomized response a survey technique for eli-minating evasive answer biasrdquo Journal of the American Statisti-cal Association vol 60 no 309 pp 63ndash69 1965

[2] H C Kraemer ldquoEstimation and testing of bivariate associationusing data generated by the randomized response techniquerdquoPsychological Bulletin vol 87 no 2 pp 304ndash308 1980

[3] J A Fox and P E Tracy ldquoMeasuring associations with random-ized responserdquo Social Science Research vol 13 no 2 pp 188ndash1971984

[4] S E Edgell S Himmelfarb and D J Cira ldquoStatistical efficiencyof using two quantitative randomized response techniques toestimate correlationrdquo Psychological Bulletin vol 100 no 2 pp251ndash256 1986

[5] T C Christofides ldquoRandomized response technique for twosensitive characteristics at the same timerdquo Metrika vol 62 no1 pp 53ndash63 2005

[6] J M Kim and W D Warde ldquoSome new results on the mult-inomial randomized response modelrdquo Communications in Sta-tistics Theory and Methods vol 34 no 4 pp 847ndash856 2005

[7] G J L M Lensvelt-Mulders J J Hox P G M van der Heijdenand C J M Maas ldquoMeta-analysis of randomized responseresearch thirty-five years of validationrdquo Sociological Methods ampResearch vol 33 no 3 pp 319ndash348 2005

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Journal of Probability and Statistics 11

[8] G L Tian J W Yu M L Tang and Z Geng ldquoA new non-randomizedmodel for analysing sensitive questionswith binaryoutcomesrdquo Statistics in Medicine vol 26 no 23 pp 4238ndash42522007

[9] A P Dempster N M Laird and D B Rubin ldquoMaximumlikelihood from incomplete data via the EM algorithmrdquo Journalof the Royal Statistical Society B vol 39 no 1 pp 1ndash38 1977

[10] G-L Tian K W Ng and Z Geng ldquoBayesian computationfor contingency tables with incomplete cell-countsrdquo StatisticaSinica vol 13 no 1 pp 189ndash206 2003

[11] K W Ng M L Tang M Tan and G L Tian ldquoGroupedDirichlet distribution a new tool for incomplete categoricaldata analysisrdquo Journal of Multivariate Analysis vol 99 no 3 pp490ndash509 2008

[12] D R Cox and D OakesAnalysis of Survival Data Monographson Statistics andApplied Probability ChapmanampHall LondonUK 1984

[13] MA Tanner andWHWong ldquoThe calculation of posterior dis-tributions by data augmentationrdquo Journal of the American Sta-tistical Association vol 82 no 398 pp 528ndash550 1987

[14] M A Tanner Tools for Statistical Inference Methods for theExploration of Posterior Distributions and Likelihood Functions Springer Series in Statistics Springer New York NY USA 3rdedition 1996

[15] B Efron and R J Tibshirani An Introduction to the Bootstrapvol 57 of Monographs on Statistics and Applied ProbabilityChapman amp Hall New York NY USA 1993

[16] A Agresti Categorical Data Analysis Wiley Series in Probabil-ity and Statistics Wiley-Interscience New York NY USA 2ndedition 2002

[17] G DWilliamson andMHaber ldquoModels for three-dimensionalcontingency tables with completely and partially cross-classified datardquo Biometrics vol 50 no 1 pp 194ndash203 1994

[18] JWYuG L Tian andM L Tang ldquoTwonewmodels for surveysampling with sensitive characteristic design and analysisrdquoMetrika vol 67 no 3 pp 251ndash263 2008

[19] M L Tang G L Tian N S Tang and Z Q Liu ldquoA new non-randomized multi-category response model for surveys witha single sensitive question design and analysisrdquo Journal of theKorean Statistical Society vol 38 no 4 pp 339ndash349 2009

[20] Y Liu and G L Tian ldquoMulti-category parallel models in thedesign of surveys with sensitive questionsrdquo Statistics and ItsInterface vol 6 no 1 pp 137ndash149 2013

[21] G L Tian ldquoA new non-randomized response model the par-allel modelrdquo Statistica Neerlandica In press

[22] K W Ng G L Tian andM L TangDirichlet and Related Dis-tributions Theory Methods and Applications Wiley Series inProbability and Statistics John Wiley amp Sons Chichester UK2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: A survey design for a sensitive binary variable correlated ... · Title A survey design for a sensitive binary variable correlated with another nonsensitive binary variable Author(s)

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of