:j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the...

21
.... :j OPTIMAL GROUP TESTING IN THE PRESENCE OF BLOCKERS by Scott A. Langfeldt, Jacqueline M. Hughes-Oliver, SUjit K. Ghosh, and S. Stanley Young Institute of Statistics Mimeograph Series No. 2297 . May 1997 NORTH CAROLINA STATE l)NIVERSITY Raleigh, North Carolina

Transcript of :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the...

Page 1: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

....:j

OPTIMAL GROUP TESTING IN THE PRESENCE OF BLOCKERS

by

Scott A. Langfeldt, Jacqueline M. Hughes-Oliver, SUjit K. Ghosh,and S. Stanley Young

Institute of Statistics Mimeograph Series No. 2297

. May 1997

NORTH CAROLINA STATE l)NIVERSITYRaleigh, North Carolina

Page 2: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

rMimeo Series

No. 2297May 1997

Optimal Group Testing in thePresence of Blockers

By: Langfeldt, Hughes-Oliver,Ghosh and Young

[I

Name

~o Q!iJi4

Date

'~r

j

I

"

Page 3: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

..

! <;:tatistics. of tM Oe7'rtment 01 v

The library . U 'versityNorth Carolina State m

Optimal Group Testing in the Presence of Blockers

Scott A. LangfeldtJacqueline M. Hughes-Oliver

Sujit GhoshNorth Carolina State University

S. Stanley YoungGlaxo Wellcome

Abstract

Testing in groups can lead to great efficiencies in total testing cost when searching

for individuals with some characteristic. If the presence of a blocking object can cause

a group with a positive object to test negative, there is a need to find optimal pooling

strategies to minimize the cost of testing and reduce the number of missed positive

individuals. We develop application driven cost functions to determine optimal testing

strategies. Also, formulas are derived that describe the behavior of three different

grouping strategies and provide examples illustrating how to determine the optimal

strategy. We show that the strategies resulting from these methods provide much

lower expected cost than classical methods. These results can be directly applied to

HIV blood testing and for screening compounds for potential drugs.

Key Words: Composite Sampling, Pooling, Compound Screening, Square Array-"

Design

.1\. ii·,it. '

1

'..~

Page 4: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

,

.-

1 Introduction

Large pharmaceutical companies have different compounds in collections ranging in size from

several thousands to several hundreds of thousands. The discovery of a new drug begins by

first identifying active compounds (that is, those that have some biological effect) from these

collections. Since a new drug can be worth hundreds of millions or even billions of dollars,

finding new drugs is very competitive; it is critical these huge compound collections be

screened quickly and economically while reducing the chance of missing compounds that

have strong drug potential.

The handling of large numbers of compounds for screening is logistically complex. Archive

dry samples are put into liquid stores so that robotic liquid handling systems can be used.

Small amounts_of these liquids are then transferred from master plates to daughter plates. A

typical plate has 8x12=96 wells or 16x24=384 wells in which the liquid stores of compounds

may be placed (See Figure 1 for example of 96 well plate.) Liquid handling robots are then

pr'0grammed to select and pool samples. The activity of a pool is measured and individual

compounds in highly active pools are retested.

1 2 3 4 5 6 78 9 10 11 12

A 000000000000BOOOOOOOOOOOOcOOOOOOOOOOOODOOOOOOOOOOOOEOOOOOOOOOOOOFOOOOOOOOOOOOGOOOOOOOOOOOOHOOOOOOOOOOOO

Figure 1: 8x12 = 96 Well Plate used in Compound Screening.

Section 2 contains a review of the methods of pooling and an assessment of activity of a

pool. These methods assume that all pools containing active compounds will be identified

as such. The existence of blocking compounds that mask the effect of active compounds

invalidates the assumption of no false negative test results. This paper provides alternative

formulas and optimal pooling strategies that account for the existence of blockers. Section 3

2

Page 5: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

describes the different components of the cost and optimality criteria, and Section 4 contains

guidelines and examples of using the results in Section 3. A discussion'is presented in Section

5. Additional notation and details are included in the Appendix.

2 Group Testing in Compound Screening

2.1 Dorfman

The idea of group testing was first introduced by Dorfman (1943) to improve the efficiency of,screening all incoming US servicemen for syphillis prior to World War II. Dorfman suggests

instead of testing each serviceman's blood individually, pool k men's blood samples together,

and if the pool tests negative for syphillis, then conclude all servicemen in that pool are

negative. If the pool tests positive, retest each man's blood individually.

Suppose p is the proportion of individuals in the population that possess the charac­

teristic, N is the number of individuals to be classified, and k is the pool size. Under the

assumptions of perfect testing and random dispersion of the characteristic, Dorfman shows

that the expected number of tests is given by:

k+l kET,Dorfman(k) = N[-k- - (1- p) ] (1)

..

Feller (1968) shows that (1) is minimized by using k to be the smallest integer larger

than 1/yIP. Intuitively, as p gets small, larger pools can be used as even a large pool will

test negative and, as a result, many unneccesary individual tests can be eliminated. For

example, if 1% of the population has some characteristic, then the optimal k of 11 results in

an 80% reduction of tests relative to testing each individual separately!

There are many multi-stage variations on the basic Dorfman strategy. For example, Fin­

ucan (1965) suggests a multi-stage method which begins the same as the Dorfman strategy,

but instead of retesting all individuals in each positive pool, the individuals from positive

pools are re-pooled and then tested again in these new pools of possibly a different size. This

can be repeated several times. Also, Sterrett (1957) suggests a sequential scheme which again

begins as Dorfman; however, if a pool is positive, the individuals in that pool are tested one

at a time (sequentially) until a positive is discovered. Then the remaining individuals are

pooled and tested. If this new pool tests negative, then all the positive items in the original

pool were identified. If the pool tests positive, then these individuals are tested sequentially.

Others who have contributed to these multi-stage strategies include Sobel and Groll (1959),

3

Page 6: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

..

Kuman and Sobel (1971) and many others. Although these strategies yield smaller expected

number of tests, they also require more than two stages and thus will not be considered

further.

If pools always register positive when there is one or more individual that has the char­

acteristic, then the aforementioned methods based on Dorfman's strategy are all expected to

work well. Unfortunately, in compound screening there can be certain compounds, known

as blockers, that block the detection of an active compound when placed in the same pool

as an active. The basic Dorfman strategy with a small p leads to the use of large pool

sizes. However, as the pool size increases, the probability that a blocker is present with an

active also increases, raising the chances of missed active compounds. This clearly violates

the assumption of perfect testing. The criterion for optimality must now simultaneously

consider minimizing the probability of a false negative and minimizing the number of tests.

Our extension of the Dorfman strategy does this.

2.2 Square Arrays Designs

Phatarfod and Sudbury (1994) observe when testing blood for the HIV virus, it is possible

for a blood sample to neutralize a positive sample; the neutralizing sample is called a blocker.

If we let f be the proportion of blockers in the population, then, intuitively, as f gets large

relative to p, it will be more and more difficult to detect positive individuals. Phatarfod

and Sudbury (1994) are interested in reducing the probability of missing positive individuals

when these blockers exist, while at the same time reducing the expected number of tests.

They suggest placing the individuals in square (k x k) trays and using one of two testing

(pooling) strategies.

1. SAl: Each of the k rows and k columns are pooled and tested for a total of 2k

preliminary tests per tray. Then, all individuals that lie in a positive row and a positive

column are retested. We term this the AND strategy.

2. S A2: Each of the k rows of a k by k array are pooled and tested. If none of the k

rows is positive, then no further testing is conducted on that array. If exactly one row tests

positive, each of the individuals in that row are tested. If more than one row tests positive,

each of the k columns are pooled and tested, and all individuals that lie in a positive row

and a positive column are retested. This is a combination of the Dorfman scheme (zero or

one row positive) and SAl (more than one row positive), and can result in three stages of

testing if there are more than two active rows.

Phatarfod and Sudbury (1994) show that the expected number of tests per individual for

4

Page 7: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

SAl and SA2 when blockers do not exist are given by:

ET,SAl(k)

ET,SA2(k)

N[2k- 1 + 1 - 2(1 - p)k + (1 _ p)2k-l]

N[2k- 1 +1 - 2(1- p)k + (1- p)2k-l

-k-1(1 _ p)k2_ p(l _ p)k2 _k]

(2)

(3)

r

For several possible values of p and assuming f = 0, Phatarfod and Sudbury (1994) find

corresponding optimal values of k that minimize the expected number of tests. They then

use these optimum values to examine the probabilities of a false negative and the expected

number of tests and find some improvements over simple Dorfman. One may reasonably

believe that if f were not restricted to equal 0 while finding the optimal k, then even greater

gains may be achieved. We investigate this belief in the remaining sections.

3 Costs of Conducting a Two-Stage Group Test

There are four basic steps required in two-stage group testing. First, the individual samples

are collected, organized and prepared for testing. Second, they are pooled. Third, the

pools are tested, and, fourth, the individuals from the positive pools are retested. When

you include the cost of missing a positive individual, this corresponds to four basic costs:

the startup cost, which is the same regardless of the strategy selected; the cost of pooling

(which mayor may not depend on the pool size); the cost of testing a cell, either pooled or

individual; and the cost of missing a positive individual.

Therefore, the total expected cost of using strategy S with pool size k is:

where

Ns(k)

ET,s(k)

EM,s(k)

Startup cost,

Cost of constructing one pool.

(assume cost of poolin~ does not depend on the size of the pool),

Cost of a single test,

Cost of an undetected positive,

Number of pools of size k for strategy S,

Expected total number of tests for strategy S of pool size k,

Expected number of undetected (missed) positives for strategy

S using pools of size k.

5

(4)

Page 8: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

Note that if we know the probability of missing a positive, PM,s(k), then the expected

number of missed positive individuals is simply N PM,s( k).

Our goal is to find the strategy that minimizes the total expected cost. This strategy

will result in an optimum trade-off between the expected number of tests, number of pools,

and the expected number of missed positives.

This is not the first time that optimizing a cost function has been applied to a group

testing problem. Burns and Mauro (1987) suggest using a cost function when there are

probabilities of misclassification inherent in the test. They minimize a linear combination

of the total number of expected tests, the cost of a false positive and the cost of a false

negative.

3.1 Some Strategies

The easiest strategy is simply testing each individual separately. This will always result in

a. total cost of Co + C2N. Certainly, not very appealing, but it can be the best strategy in

some cases.

As before, let p be the proportion of individuals with the characteristic, and let f be

the proportion of individuals that are blockers. These events are mutually exclusive. We

will consider three two-stage testing strategies for the purposes of classifying each of N

individuals according to whether they have the characteristic: Dorfman, Square Array with

AND Retesting and Square Array with OR Retesting.

1. Dorfman. Pool each row and test for the characteristic of interest. If the pool is

negative, conclude all individuals in the pool are negative; if a pool is positive, retest each

individual separately.

In a square array strategy, as proposed by Phatarfod and Sudbury (1994), we randomly

place the individuals in square (k by k) arrays and each of the k rows and k columns are

pooled and tested, we can imagine two retesting strategies:

2. Square Array with AND retesting. Retest all individuals that are in a positive column

and a positive row. (This is equivalent tb Phatarfod and Sudbury (1994) SAl strategy.)

This strategy will result in twice as many initial pools as Dorfman, but fewer total tests, for

certain values of p, will be required. In addition, since a positive individual will be missed

if a blocker occurs either in the same row or in the same column, the probability of a false

negative test result can be high.

3. Square Array with OR retesting. Retest all individuals that are in a positive column

or a positive row. If missed positives are costly, this strategy might be a good choice. There

6

Page 9: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

are many total tests required, but an individual will be missed only if a blocker occurs in

the same row and the same column.

To minimize and compare cost functions, we need to determine the number of pools,

Ns(k), the expected number of total tests, ET,s(k), and the expected number of missed

positives, EM,s(k), for each of the three proposed strategies, Dorfman, Square Array AND

retesting and Square Array OR retesting. The details of the derivations are given in the

Appendix, and the results are summarized in Table 1.

Table 1

Expected Values Used in Cost Function Optimization.Strategy Ns(k) ET s(k)a EM s(k)Dorfman Nk ·1 N[k- 1 + Rp,j(k)] Np[l - (1- J)k-l]

Square ArrayAND 2Nk-1 Nl2k- 1 + p(l - f~2(k-l) Np[l - (1 - f?(k-l)]

+ 1-p-J)Rp,J k-1)]

OR 2Nk-1 N{2k- 1 + 2Rp,j(k) N p[l - (1 - J)k-l j2-p(l - J)2(k-l)-(1 - p - J)R~.f(k - I)}

We can now determine optimal strategies for any situation where p, f, Co, C I , C2 , and

C3 are known.

3.2 Determining Cost Parameters

Assigning values to the cost parameters, Co - C3 , is a critical step. It may be possible to

assign to each parameter a dollar amount which includes manpower, cost of materials, time,

and maybe opportunity costs. If that is not possible, then a more creative approach may be

necessary. To begin the process, it may be easiest to set C2 (the cost of a single test) to be

1. Then the other cost parameters can be determined relative to C2• C I would be the cost

of pooling relative to conducting a test. So, C2=2, for example, would imply that the act of

pooling one group was twice as expensive as conducting a test. The"cost of missing a positive

individual can be thought of as "How many additional tests would I be willing to run if I was

guaranteed of identifying one previously missed positive individual." So, C3 = 100 implies

that the experimenter would be willing to run 100 additional tests if it meant identifying

one previously missed positive.

7

Page 10: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

11

In compound screening, the cost of a missed positive can be estimated as the cost of

finding a lead compound. A lead compound is biologically active, amenable to chemical

modification, novel in structure or some other aspect and is not too different in any of

a number of other aspects. It typically costs on the order of half of a million dollars to

develop a high throughput assay and test 50 to 100 thousand compounds. Typically 50 to

100 potential leads are found. So, a lead costs 5 to 10 thousand dollars. There is often

redundancy in a compound collection so that if one particular compound is missed, it is

likely that a similar compound is found which will point back at the missed compound. This

would decrease the effective cost of missing an active compound. Since a single assay costs

approximately 1 dollar, this would result in C3 to be significantly less than 5 to 10 thousand

dollars and can be determined by the goal of the screen: identify most or just some of the

active compounds.

4= Minimizing Cost of Strategies

Once p, !, Co, CI , C2 and C3 have been determined, the difficult part is finished. Now,

all that remains is to evaluate the cost function and determine for which strategy at which

value of k is the cost minimum.

4.1 Scenario 1: p = 0.02, f = .01, Co - 200, C1

C3 = 100

0.5, C2 - 1 and

Suppose it is known that p = 0.02 and! = .01. Also, the experimenter desires to screen

10,000 individuals and has determined that the initial startup cost is 200, the cost of con­

ducting a single test is 1, the cost of pooling the samples is 0.5 (half the cost of a test) and

a missed positive is 100. This corresponds to cost parameter values of Co = 200, CI = 1,

C2 = 0.5 and C3 = 100 and a cost function of:

Cs(k) = 200 + (0.5)Ns~k) + (1)ET,s(k) + (100)EM ,s(k)

To determine the best strategy, simply calculate the k that minimizes cost for each of the

three strategies using the formulas given in Table 1. For example, Dorfman with k = 10

results in a total cost of:

CD(10) 200 + (0.5)ND(10) + (1)ET,D(10) + (100)EM ,D(10)

200 + (0.5)(1000) + (1)(2670) + (100)(17.3)

5099

8

Page 11: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

Similarly, we can calculate the cost for the Square Array AND and OR retesting strategies

where k = 10 to be:

CA (10) 6905

Co(10) 6293,

respectively. These need to be calculated for all reasonable k, and will result in the costs

presented in Table 2. Dorfman (k = 7), Square Array AND retesting (k = 9) and Square

Array OR retesting (k = 11) are optimal where Dorfman (k = 7) has the optimal overall

minimum cost of 4754.

Table 2Scenario 1: Total Costs for

Dorfman, AND and OR

retesting strategies: p = 0.02,

f = 0.01, Co = 200, C1 = 0.5,

C2 = 1, C3 = 100.ota ost

Dor man AND

6 4765 7376 71557 4754 7050 67448 4824 6898 64999 4945 6862 6361

10 5099 6905 629411 5276 7003 627712 5469 7142 6295

4.2 Scenario 2 - p = 0.02, f = .01, Co = 200, C1 = 0.5, C2

C3 = 1000

1 and

Now suppose the cost of a missed positive, is much higher, C3 = 1000 while everything else

remains the same. We now calculate the costs found in Table 3. Notice the pool sizes have

shrunk significantly (k = 3,3,8) and the optimal strategy is the Square Array OR retesting

(k = 8) with minimum cost 7330. Since the cost of a missed positive is so high and the

Square Array OR retesting strategy helps prevent missed positives, the OR strategy has

the minimal cost. This example illustrates the trade-offs between missed positives and total

expected number of tests.

9

Page 12: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

4.3 Scenario 3: p

C3 = 1000

0.02, f .01, Co - 200, c1 - 2, C2 - 1 and

Now consider the original scenario except the cost of a missed positive is large, C3 = 1000

and the cost of pooling is twice the cost of conducting a test, CI = 2. We would expect the

overall cost of the AND retesting strategy to be higher and the OR retesting strategy to be

lower relative to the others. Also, the additional cost of constructing a pool will result in

higher costs for AND and ORand higher costs for smaller pool sizes.

Calculating the costs for all reasonable k results in Table 4. The best strategies are

achieved by using pool sizes of k = 4,4,10 for Dorfman, AND and OR respectively. The

optimal cost of 10640 is found by using OR with k = 10. Notice that if we tested each

individual separately, it would result.. in a total test cost of 10200. This is a case where we

should not pool; we should instead test individually.

For a fixed p and f, as C3 increases, the square array OR retesting strategy becomes

more attractive. Also, as the cost of pooling, CI , increases, Dorfman becomes better relative

to the Square Array Strategies.

Table 5 shows optimal strategies for a fixed cost function. Using Co = 200, CI = 2,

C2 = 1 and C3 = 1000 we get optimal strategies for different values of p and f. (These costs

correspond to those in Scenario 3.)

As f increases, the cost of the Square Array OR retesting strategy becomes lower and

lower relative to the other strategies' costs. Also, as p increases, the optimal pool sizes

decrease, as expected. Other than pool size, p has little effect on the strategy. There are

examples where the strategy changes as p changes, but this is not very common and usually

only occurs when there is little difference between the total costs of two strategies over the

range of reasonable p.

Figure 5 shows how the cost function changes as k changes. For reference, figures (a), (c)

and (g) correspond to scenarios 1,2 and 3 respectively. Where the cost of a missed positive,

C3 , is high and blockers are not too uncommon, (figures (c) and (g)), the cost increase rapidly

as k increases. Also, where C3 is low or Mockers are uncommon, the cost stays relatively

flat, making the selection of k once the strategy has been determined to be not extremely

critical. Also, in figure (g), none of the costs drop below 10200, the cost of individual testing,

reinforcing the fact that in this scenario, individual testing is the best choice.

10

Page 13: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

5 Discussion

Square array strategies clearly have some advantages over classical Dorfman in several cases.

When 1 is high relative to p, or when the cost of a missed positive (C3 ) is high, the OR

retesting strategy has low total cost. The AND retesting strategy works very well when

the cost of a missed positive (C3 ) is low or 1 is low. It is interesting to note that even

when 1 = 0, the AND retesting strategy generally shows much lower costs than Dorfman.

In short, by using information about the proportion of blockers and the costs of testing,

pooling and missing positive individuals, a design can be determined which minimizes the

cost of identifying individuals with the characteristic.

A disadvantage of using square arrays is that it is not practical for a small group of

individuals if N / k2 is not close to an integer. For example, suppose the experimenter is

interested in screening 100 individuals where Co=200, C1 =2, C2=1 and C 3 =1000 where it

is known that p=0.005 and 1=0.01. For this scenario, the optimal test is OR with k=18.

However, this requires 182 = 324 individuals to fill one square array. In this case, it may be

more reasonable to form 12 pools using the optimal Dorfman strategy of k = 8 and conduct

a test on the four individuals left over or compare the costs of AND and OR with k=10.

There is need for additional research in this area.

If a large number of individuals are to be tested, it is not essential that N / P be close to

an integer in order to use a square array strategy. For example, a typical size for a collection

of compounds is 10000. In the scenario above where the optimal test is a Square Array AND

strategy with k = 18, 30 plates would be filled with a remaining 280 (2.80%) compounds

that are not assigned to a plate. These remainders could either be tested individually, placed

in a Square Array with k = 16 (24 remaining), or tested in Dorfman pools of k = 8. In

short, as N gets large, a smaller and smaller proportion of the individuals will be left out

of the original pools and there will be significant cost savings regardless of the number of

individuals remaining after the original pooling. In other words, for large samples, it is not

essential that N / P be close to an integer.

In scenario 3 we found that none of the 'strategies was better than testing each individual

separately. This was due to a high cost of pooling as well as a high cost of missing a positive

individual. If the cost of pooling is very high, this will result in none of the group testing

strategies being very effective because they are more expensive than individual testing. If

this is the case, the experimenter should not force pools as the cost would be high and many

positive individuals could be missed.

In the family of two-dimensional strategies, only square arrays were considered here.

11

Page 14: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

There may be some large improvements in total cost if we generalize to rectangular k by I

arrays. Although the search for a minimum in a rectangular arrays will be more difficult as

compared to square arrays, it should be straight forward to find the k and I that globally

minimize total cost.

Also, some significant cost benefits may emerge if we no longer limit our investigation

to two-stage tests and consider various multi-stage or sequential testing strategies. There

are situations, however, where these multi-stage strategies are not practical. For example,

in compound screening, programming the robots that pool the compounds is not a trivial

chore, and since the time delay costs are so high, any avoidable delay is unacceptable.

Throughout this paper, we assumed that p and f were both fixed and known. It is

more likely the case that these values are completely unknown or may change as the work

progresses. In these situations, it is important that a strategy be devised to estimate p and

f while at the same time identifying individuals with the characteristic of interest. One

approach would be to run a preliminary test to estimate p and f and then use these values

to design the strategy and then implement it on a large collection of individuals. If the initial

experiment was in a k by I rectangular array, where k is very different from I, then this may

be possible. Also, a Bayesian approach would be appropriate if the experimenter has some

prior information about these values. Either of these methods would assist the experimenter

in estimating p and f.Some applications require that the pool sizes do not change once the bulk of the ex­

periment has begun, making the initial estimate of p and f even more critical. This may

be inconvenient, but would only be required only once. However, this would not allow for

modification if later in the screening it becomes clear that p and f are very different than

previously believed.

The strategies discussed in the paper assume that the individuals being tested are stochas­

tically independent. Usually, compound collections are ordered according to when they were

acquired and similar compounds are collected in batches. If these compounds are placed in

trays in the collection order, adjacent compounds are very likely to be similar. This violates

the assumption of stochastic independence, and makes these methods flawed. However, if

the compounds are instead randomly assigned to the pools, then this assumption will still

be valid. In HIV testing, blood collected in batches may tend to be similar (e.g., from the

same geographical location, etc.) which again will violate this assumption.

The Phatarfod and Sudbury (1994) SAl strategy is identical in description to the AND

strategy of Section 3.1: an individual will be retested if it lies in a positive row and a positive

12

Page 15: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

..

column. Intuitively, a false negative will occur if either the row or the column contains a

blocker, which is higher than the probability of either the row or the column individually.

Phatarfod and Sudbury (1994) claimed that SAl would reduce the chance of a missed positive

as compared to Dorfman of the same pool size. Phatarfod and Sudbury (1994) incorrectly

stated that the probability of a false negative using SAl is {I - (1 - J)k-l p; this is the

probability of a missed positive for the OR retesting strategy. The correct probability should

be [1 - (1 - J)2(k-l)] which is strictly larger than the probability of a false negative using

Dorfman of [1 - (1 - J)k-l] for all f > o. The use of SAl (a.k.a. AND) does not decrease

the probability of a false negative, but instead increases it as compared to Dorfman of the

same pool size.

References

[1] Burns,K.C. and Mauro,C.A. (1987). Group Testing with Test Error as a Function of

Concentration. Communications in Statistics 16 (10), 2821-2837

[2] Dorfman, R. (1943). The Detection of Defective Members of Large Populations,Annals

of Mathematical Sciences 14, 436-440

[3] Feller, W. (1968). An Introduction to Probability Theory and its Application.Vol. 1, 3rd

end. Wiley, New York, 1968

[4] Finucan, H.M. (1965). The Blood Testing Problem, Applied Statistics 13, 43-50

[5] Kumar, S. and Sobel, M. (1971). Finding a Single Defective in Binomial Group Testing.

Journal of the American Statistical Association 66, 824-828.

[6] Phatarfod, R.M. and Sudbury, A. (1994), The Use of a Square-Array Scheme in Blood

Testing Statistics in Medicine, 13, 2337-2343

[7] Sobel, M. and Groll, P.A. (1959). Group Testing to Eliminate Efficiently all Defectives

in a Binomial Sample. The Bell System Technical Journal 38, 1179-1252.

[8] Sterrett,A.(1957),On the Detection of Defective Members of Large Populations. Annals

of Mathematical Sciences 28, 1033-1036

13

Page 16: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

APPENDIX

Here we present the details that lead to the formulas given in Table 1. Let

At Event individual in row i, column j has the characteristic,

Aij Event individual in row i, column j is a blocker,

Aij Event individual in row i, column j is

neither a blocker nor has the characteristic,

Rt Event that row i is tested to have the characteristic,

Ri Event that row i is tested to not have the characteristic,

Cf Event that column j is tested to have the characteristic,

Cj Event that column j is tested to not have the characteristic,

P(At) p,

P(Aij) - f,P(Aij ) - 1 - p - f

Then assuming individuals have been randomly placed in pools (rows or columns) of size

k, we can show that

P(Ri contains no blockers)

(1 - J)k-l (5)

..P(R contains no blockers and at least one with characteristic) (6)

~ (k)! ifO(1 f)k-iLJ O!i!(k - i)!P - P -i=l

k

(1 - J)k L ( ~ ) C-P-)i(1 _ _ P_)k-ii=l z 1 - f 1 - f

(1 - J)k(1 - (1 _ _ P_)k)I-f

(1- J)k - (1 - P - J)\ (7)

P(AT)P(Rt n CtIAT) +P(A~)P(Rt n C+IA~) +tJ t J tJ tJ t J tJ

P(Ao.)P(Rt n CtIA,?)tJ t J tJ

pP(RtIAt)P(CfIAt) +

14

Page 17: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

,.

(1 - p - f)P(RtIAij)P(CtIAij)

_ p(1 - f)2(k-l) + (1 - p - f)[(1 - f)k-l - (1 - p - f)k-l]2, (8)

P(Rt u ct) - P(Rt) +P(ct) - P(Rt nct)

- 2[(1 - f)k - (1 - P - f)k] - p(1 - f)2(k-l)

-(1 - p - f)[(1 - f)k-l - (1 - p - f)k-l]2. (9)

These results provide the basis for the required formulas for the three strategies, as given

below.

Dorfman

ET,DoTfman( k) - N/k +NP(Rt)1 k k- N[- + (1 - f) - (1- p - f) ]k

PM,DoTfman(k) - P(R? nAt:)Z ZJ

- P(At)P(RiIAt)

- p[1 - (1 - f)k-l]

NDoTfman(k) - N/k

(10)

(11)

(12)

Square Array with AND Retesting

ET,A(k) - 2N/k +NP(Rt net)

- N[2/k +p(1 - f)2(k-l)

+(1- p - f)[(1 - f)k-l - (1 - p - f)k-l]2]

PM,A(k) - P(At)[1 - P(Rt n ctIAt)]"

- p[1- P(RiIAt)P(CjIAt)]

- p[1 - (1 - j)2(k-l)]

NA(k) - 2N/k

(13)

(14)

(15)

Square Array with OR Retesting

ET,o(k) = N{2/k +2[(1- f)k - (1 - p - f)k] - p(1 - f)2(k-l)

-(1 - p - f)[(1- f)k-l - (1 - p - f)k-l]2} (16)

15

Page 18: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

.,

t

PM,o(k) - P(AT)P(R~ n C~IAT)1) 1 ) 1)

= P(Aij)P(RfIAij)P(CjlAij)

- p(l - (1 - j)(k-l))2 (17)

No(k) - 2N/k (18)

16

Page 19: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

p=O.02.1=0.01C1=0.5,C2=1,C3=1oo

p=0.01,1=0.001C1 =O.5,C2=1 ,C3=100

Dorfman

AND IOR

1=- -

~

!w

DOI1mon IANDOR

1=- -

I\\I

~<"~"'''''''''':''::'=':':'~'~'~'~'~'~~'=''''~''''~

~

!w

10 20 30 40 50 10 20 30 40 50

rPool Size

(a)Pool Size

(b)

p=0.02,1=0.01C1 =O.5,C2=1 ,C3=1 000

p=0.01.1=0.001C1 =0.5,C2=1 ,C3=1000

I.....I\

---

1=--DOI1manANDOR

Ii\I

~,,,,,,", ~ •.:..::"""'"'''''~'''

10 20 30 40 50 10 20 30 40 50

PooISlz1(e)

Pool Size(d)

p=0.01,1=0.001C1=2,C2=1,C3=100

1

- DOI1men I..... AND I- - OR

p=O.02,I=O.01C1=2,C2=1,C3=1OO

I I DOI1man I! ANDI -- ~

~-.=-.::-:.--~-~-"--10 20 30 40 50 10 20 30 40 50

Pool Size(e)

Pool Size(I)

p=O.02,1=0.01C1=2,C2=1,C3=1000

p=O.02,I=O.01C1=2,C2=1,C3=1oo0

___ --r ......

.. IDorfmanANDOR

,L....---I::.:: DOI1mon

ANDOR

10 20 30 40 50 10 20 30 40 50

Pool Size(g)

Pool Size(h)

Figure 1: Total Cost as a Function of Pool Size for Three Retesting Strategies

17

Page 20: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

Table 3

Scenario 2 - Total Costs for

Dorfman) AND and OR

retesting strategies: p = 0.02)f = 0.01) Co = 200) C1 = 0.5)

Cz = 1) C3 = 1000.ota ost

Dor man And

3456789

9756106431200313587152881705218850

18288196242189124587275003052433599

112259162811675877361

73307435

.,

Table 4Scenario 3 - Total Costs for

Dorfman) AND and ORretesting strategies: p = 0.02)f = 0.01) Co = 200) C1 = 2)

Cz = 1) C3 = 1000.ota ost

Dor man And

456789

1011

1439315003160871743018927205172216623851

18

2712427891295873178634273369333969242507

166621411612587116471108010768

1064010650

Page 21: :j - Nc State · PDF fileFeller (1968) shows that (1) is minimized by using k to be the smallest integer larger than 1/yIP. Intuitively, as p gets small, larger pools can be used as

1

.1"

.'"

Table 5

3 359257 13452

19

25